This is a small collection of tips for processing text data with the Bash shell and some common utilities. I used these to help process numerical results that my research code had outputted in the form of text files. By automating the processing of text files, I saved a lot of time when preparing and condensing data for publication and thesis writing. Linux users should already have most if not all of the required utilities installed. Windows users should install Cygwin, which gives a Bash shell and all the necessary utilities.
Since Windows and Unix text files differ in the line ending characters used, be aware that there can be some issues when the line ending in a file is not what is expected. The Cygwin tools expect and produce Unix line endings. To convert between Unix and Windows (DOS) line endings, use dos2unix and unix2dos, both tools that come with Cygwin. iconv is another tool that can convert between many different text formats and character encodings.
If you have two files with the same number of lines and you want to concatenate them line by line, you can use the paste command. For example, you have files a.txt and b.txt, where a.txt contains:
1 2
and b.txt contains:
a b
and you want to join files a.txt and b.txt line by line like the following (the lines from each file are separated by a tab character):
1 a 2 b
Then use the following command:
paste a.txt b.txt > c.txt
You can specify a delimiter to use between the lines of each file:
paste -d "," a.txt b.txt > c.txt
This will produce:
1,a 2,b
If you have two comma separated value (CSV) files and you want to merge specific columns from each file into a new file line by line (equivalent to opening the CSV files in Excel or another spreadsheet program and copying the desired column from each file and pasting them into a new file), you can use the paste command:
paste -d "," <(cut -d \, -f 1 FILE1) <(cut -d \, -f 2 FILE2) > outfile.csv
The two cut commands are performed first due to the <() operation. The cut commands extract column 1 (specified by -f 1, which means “field 1” using cut terminology) from FILE1 and column 2 from FILE2. cut uses the -d option to know what delimiter is used in the files to separate fields (columns). In this case, since FILE1 and FILE2 are CSV files, the delimiter is a comma.
The paste command then pastes the extracted columns together, with a comma between the columns. The output file is therefore also a CSV document. The above takes column 1 from FILE1 and column 2 from FILE2 and pastes them side by side in outfile.csv. I used this method extensively to summarize CSV files that were generated by a research program.
To print a specific line range from a file (in this case, print lines 2 to 4 from somefile.txt), use the sed utility:
sed -n 2,4p somefile.txt
You can also print multiple line ranges(lines 1 to 2, then line 4):
sed -n -e 1,2p -e 4p somefile.txt
The above will print lines 1 to 2 and then print line 4.
Of course you can append > outfile.txt to save the output from the above to a file.
If a file has lines that are of the format YYYYMMDD (i.e. dates), you can add delimiters so that the output file has lines with the format YYYY MM DD (using a space as a delimiter). To do this, use the cut utility:
cat textfile.txt | cut -c1-4,5-6,7-8 --output-delimiter=' ' > outfile.txt
Cut will print characters 1 to 4 (corresponding to YYYY), 5 to 6 (MM) and 7 to 8 (DD) with the specified delimiter (in this case, a space) between them.
The following replaces 'foo' in all files under the current directory (and all subdirectories, recursively) with 'bar':
find ./ -type f -exec sed -i 's/foo/bar/g' {} \;
You can use any valid option with find. For example, to limit the replacement to HTML files in the directory,
find ./ -type f -name '*.html' -exec sed -i 's/foo/bar/g' {} \;
I have noticed that sed changes the modification date of all files - even files that do not match. If this matters, then I would suggest using grep to get the list of files that contain the match and then make a script that processes each file in the grep result one by one.
You can count the number of lines matching a certain pattern with the -c option in grep. To show all lines matching the pattern, followed by the count, use the following:
grep PATTERN file; grep -c PATTERN file
There are other ways for doing this with tools other than grep but I have not had to use them.
You can store the count to a Bash variable like this:
count=`grep -c PATTERN file`
The cat command will print the file contents in the normal order. To reverse the order, use the tac command:
tac filename.txt
This will print the lines out one by one, but in reverse order (last line printed first). It does not reverse the lines themselves.
You can grep for two (or more) words at once. For example, create a file containing:
One Two Three Four Five
Then use the following grep command (the -E option enables extended regular expressions, which lets you use the | as an “or” operator):
grep -E 'One|Two|Three' file.txt
The output is:
One Two Three
Discussion
File1:
X
Y
Z
File2:
A,B,C,D
X,E,F,L
Y,T,I,o
I need output like:
X,F
Y,I
Please help and revert urgentl!!
I was wondering if you could assist me using this command a little more..
I need to search a particular directory and subdirectories for all files beginning with "A," then concatenate all the files contents into one file.
Any guidance as how this can be accomplished?