I recently published dt, a modern data transformation tool designed to make the awk workflows commonly used on this blog more intuitive, expressive, and fast. Dt is written in Rust because it compiles to a single binary that runs anywhere, and it uses Polars for the actual data processing, giving you columnar operations that handle large files efficiently. The syntax uses explicit functions (filter(), select(), mutate()) that you chain together with pipes, so transformations read like a recipe instead of a regex puzzle. There’s also an interactive REPL that shows you the result after each operation, letting you build complex pipelines step-by-step, catch mistakes early, and undo errors with .undo.


Installation

Getting started with dt is simple: run the command in your terminal based on your operating system:

macOS/Linux:

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/system0x7/dt/releases/latest/download/data-transform-installer.sh | sh

Windows:

powershell -ExecutionPolicy ByPass -c "irm https://github.com/system0x7/dt/releases/latest/download/data-transform-installer.ps1 | iex"

Via Cargo (any platform with Rust installed):

cargo install data-transform

The installers will add dt to your PATH automatically. After installation, verify it’s working:

dt --version

You can now try launching the interactive REPL by simply running dt in the terminal (.exit to leave).

Note: When using the REPL, commands should be entered and run one at a time. If you are copying examples from this post, paste each line separately to ensure they execute correctly.


How dt Simplifies awk

For a full reference visit the dt reference. To demonstrate how dt simplifies common workflows used in this blog, let’s revisit some of the used awk commands.

Example 1: Filtering Samples by Population

The awk version:

awk -v OFS='\t' '
  FNR==1 {f++}
  f==1 {want[$1]=1; next}               # populations to keep
  f==2 {if ($3 in want) keep[$1]=1; next} # reference.ind (IID in col1, pop in col3)
  f==3 {if ($2 in keep) print $1,$2}    # reference.fam (FID col1, IID col2)
' pops reference.ind reference.fam > pops.keep

To do this in dt we load the three files separately first:

pops = read('pops', header=false)
ind_file = read('reference.ind', header=false)

fam_file = read('reference.fam', header=false)

Now we filter the ind file for the sample iids:

iids = ind_file | filter($3 in pops) | select($1)

In words, this means: take the ind_file we previously loaded, filter its third column (note the dollar notation for accessing unnamed columns) based on the population labels in the loaded pops file, and select the first column (which contains the sample IIDs).

In the next step, we filter the reference fam file in a similar way:

keep_file = fam_file | filter($2 in iids) | select($1, $2)

keep_file | write('pops.keep', header=false)

This translates to: take the loaded fam_file, filter its second column (which contains the IIDs) based on the IIDs we want, and select (“keep”) only the first (FIDs) and second (IIDs) columns. The result is a PLINK-compatible keep file containing only samples belonging to population labels mentioned row-wise in the pops file.

This last step can also be written more concisely by piping the selected table directly into the output:

fam_file | filter($2 in iids) | select($1, $2) | write('pops.keep', header=false) 

Example 2: String Manipulation with Lookups

Below is a rather obscure-looking awk command from this previous post: SmartPCA Tutorial: How to Run PCA on Genetic Data (EIGENSTRAT).

awk -F',' -v OFS=',' '
  NR == FNR {
    line = $0
    gsub(/^[[:space:]]+/, "", line)
    if (line == "") next

    split(line, a, /[[:space:]]+/)
    if (a[1] == "" || a[3] == "") next

    label = a[3]
    sub(/\.(AG|DG|HO|SG)$/, "", label)
    pop[a[1]] = label
    next
  }

  {
    if (split($1, parts, ":") >= 2 && (parts[2] in pop))
      $1 = pop[parts[2]] ":" parts[2]
    print
  }
' data.ind smartpca.csv > smartpca_with_labels.csv

In the dt REPL, we can do this step-by-step. First, we load the needed files:

pca = read('smartpca.csv', header=false)
ind = read('data.ind', header=false)

Next, we clean up the population labels in the ind file by removing the suffixes .AG, .DG, .HO, or .SG from the third column:

labels = ind | mutate($3 = replace($3, re('\.(AG|DG|HO|SG)$'), ''))

This creates a lookup table where the first column contains IDs and the third column contains cleaned population labels.

Now we transform the PCA file. For each row, we extract the IID from the first column (the part after the colon, hence [1] instead of [0]), look up its population label in the transformed ind file where the third column contains population labels with their suffixes (.HO, .AG, etc.) removed, and reconstruct (“mutate”) the first column as label:IID:

result = pca | mutate($1 = lookup(labels, split($1, ':')[1], on=$1, return=$3) + ':' + split($1, ':')[1])

In plain language: split the first column by : to get the IID, look it up in our labels table to find the cleaned population label, then rebuild the column by concatenating the label with the IID using : as separator.

The split() function takes the column to split (here $1), the delimiter to split on (: in quotes), and which part to extract: [0] for the part before the delimiter or [1] for the part after. The lookup() function takes four arguments: the lookup table (labels), the key to search for (the extracted IID), the column in the lookup table to match against (on=$1), and the column to return (return=$3). The rest of the mutation concatenates strings together using the + operator, joining the looked-up label with : and the original IID.

Finally, write the result:

result | write('smartpca_with_labels.csv', header=false)

The syntax may look unfamiliar at first in the second example, but dt follows a consistent pattern. After a few times, the operations become intuitive. The pipeline structure makes your work self-documenting, and the REPL lets you verify each step interactively.

If you encounter bugs or have feature requests, you can report them on GitHub.