Bioinformatics/Unix tips


I find myself doing lots of data manipulation in unix these days and am forever googling how to do the simplest of things (one of my many life ambitions is to be able to tar files without looking up the syntax).


So, I think its time that I start my own collection of code snippets to help out others….

Finding and replacing text using sed


Adding text before  and/or after using awk

Say for example you have a list of samples, generated by using ls in your directory:


sampleA.fq_1 sampleB.fq_1 sampleC.fq_1 … sampleZZ.fq_1

sampleA.fq_2 sampleB.fq_2 sampleC.fq_2 … sampleZZ.fq_2

and you want to use this list to write some analysis code, but you want to strip off the file extension, and precede each entry with “-s” and follow it with a “\”


as a one liner it looks like this

ls *.fq_1 | sed ‘s/.fq_1//g’ | awk ‘{printf(“-s %s \\ \n”,$0)}’ > sample_list_formatted


lets break it down:

The first piece of the pipeline lists (ls) the files that match my criteria (ends in .fq_1) – I don’t want samples to appear twice.

sampleA.fq_1 sampleB.fq_1 sampleC.fq_1 … sampleZZ.fq_1

this is passed to the second argument following the pipe character (|) which uses sed to strip off the extension using sed ‘s/findthis/replacewith/g’

where s stands for “substitute”, and g for “global” – every instance.

sampleA sampleB sampleC … sampleZZ

The output of this is again passed through a pipe (|) to awk, which adds the text “-s” to the front of each element and a “\” to the end of each element, followed by a line break.  The syntax for this is quite cryptic so lets gothrough it.  The awk command uses printf which accepts a string, formats it and prints it.  In this case it is printing “-s” before the string (%s) a backslash (\) after the string. The spaces in the text are also included in the output (NOTE: The backslash is escaped by a backslash because backslashes are special characters, but more on this in a different post.) the (\n) inserts a newline between each newly formatted element of $0, which is the entire file passed to awk (in this case through the pipe (|)).  $0 is shorthand for “the entire file”, if we had a multi-column file we could call individual columns to be processed using  $1, or $2, or $whatever to apply the process to that column of the file only.

-s sampleA \

-s sampleB \

-s sampleC \

-s sampleZZ \

Finally this output is passed to a file using (>) and a filename. Omitting this would direct the output to the screen instead.  I often omit this in the ifrst instance and replace it with another pipe and head, just to see what the output will look like, before writing a file.

ls *.fq_1 | sed ‘s/.fq_1//g’ | awk ‘{printf(“-s %s \\\n”,$0)}’ | head


Leave a Reply

Your email address will not be published. Required fields are marked *