This is a small collection of tips for processing text data with the Bash shell and some common utilities. I used these to help process numerical results that my research code had outputted in the form of text files. By automating the processing of text files, I saved a lot of time when preparing and condensing data for publication and thesis writing. Linux users should already have most if not all of the required utilities installed. Windows users should install Cygwin, which gives a Bash shell and all the necessary utilities.
Since Windows and Unix text files differ in the line ending characters used, be aware that there can be some issues when the line ending in a file is not what is expected. The Cygwin tools expect and produce Unix line endings. To convert between Unix and Windows (DOS) line endings, use dos2unix
and unix2dos
, both tools that come with Cygwin. iconv
is another tool that can convert between many different text formats and character encodings.
If you have two files with the same number of lines and you want to concatenate them line by line, you can use the paste
command. For example, you have files a.txt and b.txt, where a.txt contains:
1 2
and b.txt contains:
a b
and you want to join files a.txt and b.txt line by line like the following (the lines from each file are separated by a tab character):
1 a 2 b
Then use the following command:
paste a.txt b.txt > c.txt
You can specify a delimiter to use between the lines of each file:
paste -d "," a.txt b.txt > c.txt
This will produce:
1,a 2,b
If you have two comma separated value (CSV) files and you want to merge specific columns from each file into a new file line by line (equivalent to opening the CSV files in Excel or another spreadsheet program and copying the desired column from each file and pasting them into a new file), you can use the paste
command:
paste -d "," <(cut -d \, -f 1 FILE1) <(cut -d \, -f 2 FILE2) > outfile.csv
The two cut
commands are performed first due to the <()
operation. The cut
commands extract column 1 (specified by -f 1
, which means “field 1” using cut
terminology) from FILE1 and column 2 from FILE2. cut
uses the -d
option to know what delimiter is used in the files to separate fields (columns). In this case, since FILE1 and FILE2 are CSV files, the delimiter is a comma.
The paste
command then pastes the extracted columns together, with a comma between the columns. The output file is therefore also a CSV document. The above takes column 1 from FILE1 and column 2 from FILE2 and pastes them side by side in outfile.csv. I used this method extensively to summarize CSV files that were generated by a research program.
To print a specific line range from a file (in this case, print lines 2 to 4 from somefile.txt), use the sed
utility:
sed -n 2,4p somefile.txt
You can also print multiple line ranges(lines 1 to 2, then line 4):
sed -n -e 1,2p -e 4p somefile.txt
The above will print lines 1 to 2 and then print line 4.
Of course you can append > outfile.txt
to save the output from the above to a file.
If a file has lines that are of the format YYYYMMDD (i.e. dates), you can add delimiters so that the output file has lines with the format YYYY MM DD (using a space as a delimiter). To do this, use the cut
utility:
cat textfile.txt | cut -c1-4,5-6,7-8 --output-delimiter=' ' > outfile.txt
Cut will print characters 1 to 4 (corresponding to YYYY), 5 to 6 (MM) and 7 to 8 (DD) with the specified delimiter (in this case, a space) between them.
The following replaces 'foo' in all files under the current directory (and all subdirectories, recursively) with 'bar':
find ./ -type f -exec sed -i 's/foo/bar/g' {} \;
You can use any valid option with find
. For example, to limit the replacement to HTML files in the directory,
find ./ -type f -name '*.html' -exec sed -i 's/foo/bar/g' {} \;
I have noticed that sed
changes the modification date of all files - even files that do not match. If this matters, then I would suggest using grep
to get the list of files that contain the match and then make a script that processes each file in the grep
result one by one.
You can count the number of lines matching a certain pattern with the -c
option in grep. To show all lines matching the pattern, followed by the count, use the following:
grep PATTERN file; grep -c PATTERN file
There are other ways for doing this with tools other than grep but I have not had to use them.
You can store the count to a Bash variable like this:
count=`grep -c PATTERN file`
The cat
command will print the file contents in the normal order. To reverse the order, use the tac
command:
tac filename.txt
This will print the lines out one by one, but in reverse order (last line printed first). It does not reverse the lines themselves.
You can grep
for two (or more) words at once. For example, create a file containing:
One Two Three Four Five
Then use the following grep
command (the -E
option enables extended regular expressions, which lets you use the |
as an “or” operator):
grep -E 'One|Two|Three' file.txt
The output is:
One Two Three