Text processing with shell utilities (Linux / Cygwin)

This is a small collection of tips for processing text data with the Bash shell and some common utilities. I used these to help process numerical results that my research code had outputted in the form of text files. By automating the processing of text files, I saved a lot of time when preparing and condensing data for publication and thesis writing. Linux users should already have most if not all of the required utilities installed. Windows users should install Cygwin, which gives a Bash shell and all the necessary utilities.

Since Windows and Unix text files differ in the line ending characters used, be aware that there can be some issues when the line ending in a file is not what is expected. The Cygwin tools expect and produce Unix line endings. To convert between Unix and Windows (DOS) line endings, use dos2unix and unix2dos, both tools that come with Cygwin. iconv is another tool that can convert between many different text formats and character encodings.

Concatenate files line by line

If you have two files with the same number of lines and you want to concatenate them line by line, you can use the paste command. For example, you have files a.txt and b.txt, where a.txt contains:


and b.txt contains:


and you want to join files a.txt and b.txt line by line like the following (the lines from each file are separated by a tab character):

1	a
2	b

Then use the following command:

paste a.txt b.txt > c.txt

You can specify a delimiter to use between the lines of each file:

paste -d "," a.txt b.txt > c.txt

This will produce:


Merge specific CSV columns line by line

If you have two comma separated value (CSV) files and you want to merge specific columns from each file into a new file line by line (equivalent to opening the CSV files in Excel or another spreadsheet program and copying the desired column from each file and pasting them into a new file), you can use the paste command:

paste -d "," <(cut -d \, -f 1 FILE1) <(cut -d \, -f 2 FILE2) > outfile.csv

The two cut commands are performed first due to the <() operation. The cut commands extract column 1 (specified by -f 1, which means “field 1” using cut terminology) from FILE1 and column 2 from FILE2. cut uses the -d option to know what delimiter is used in the files to separate fields (columns). In this case, since FILE1 and FILE2 are CSV files, the delimiter is a comma.

The paste command then pastes the extracted columns together, with a comma between the columns. The output file is therefore also a CSV document. The above takes column 1 from FILE1 and column 2 from FILE2 and pastes them side by side in outfile.csv. I used this method extensively to summarize CSV files that were generated by a research program.

Extracting a range of lines from a text file

To print a specific line range from a file (in this case, print lines 2 to 4 from somefile.txt), use the sed utility:

sed -n 2,4p somefile.txt

You can also print multiple line ranges(lines 1 to 2, then line 4):

sed -n -e 1,2p -e 4p somefile.txt

The above will print lines 1 to 2 and then print line 4.

Of course you can append > outfile.txt to save the output from the above to a file.

Adding delimiters to fixed width data

If a file has lines that are of the format YYYYMMDD (i.e. dates), you can add delimiters so that the output file has lines with the format YYYY MM DD (using a space as a delimiter). To do this, use the cut utility:

cat textfile.txt | cut -c1-4,5-6,7-8 --output-delimiter=' ' > outfile.txt

Cut will print characters 1 to 4 (corresponding to YYYY), 5 to 6 (MM) and 7 to 8 (DD) with the specified delimiter (in this case, a space) between them.

Multiple File Search and Replace

The following replaces 'foo' in all files under the current directory (and all subdirectories, recursively) with 'bar':

find ./ -type f -exec sed -i 's/foo/bar/g' {} \;

You can use any valid option with find. For example, to limit the replacement to HTML files in the directory,

find ./ -type f -name '*.html' -exec sed -i 's/foo/bar/g' {} \;

I have noticed that sed changes the modification date of all files - even files that do not match. If this matters, then I would suggest using grep to get the list of files that contain the match and then make a script that processes each file in the grep result one by one.

Count number of lines matching a pattern

You can count the number of lines matching a certain pattern with the -c option in grep. To show all lines matching the pattern, followed by the count, use the following:

grep PATTERN file; grep -c PATTERN file

There are other ways for doing this with tools other than grep but I have not had to use them.

You can store the count to a Bash variable like this:

count=`grep -c PATTERN file`

Print a text file backwards

The cat command will print the file contents in the normal order. To reverse the order, use the tac command:

tac filename.txt

This will print the lines out one by one, but in reverse order (last line printed first). It does not reverse the lines themselves.

Grep for multiple words

You can grep for two (or more) words at once. For example, create a file containing:


Then use the following grep command (the -E option enables extended regular expressions, which lets you use the | as an “or” operator):

grep -E 'One|Two|Three' file.txt

The output is:


About Peter Yu I am a research and development professional with expertise in the areas of image processing, remote sensing and computer vision. I received BASc and MASc degrees in Systems Design Engineering at the University of Waterloo. My working experience covers industries ranging from district energy to medical imaging to cinematic visual effects. I like to dabble in 3D artwork, I enjoy cycling recreationally and I am interested in sustainable technology. More about me...

Feel free to contact me with any questions about this site at [user]@[host] where [user]=web and [host]=peteryu.ca

Copyright © 1997 - 2021 Peter Yu