SLIDE 5 awk
17
- Conditional statements:
- Looping:
Display expression levels for the gene NANOG: $ awk '{ if(/NANOG/) print $0 }' foo.txt
$ awk '/NANOG/ { print $0 } ' foo.txt
$ awk '/NANOG/' foo.txt Add line number to the above output: $ awk '/NANOG/ { print NR"\t"$0 }' foo.txt NR: line number of the current row Calculate the average expression (4th, 5th and 6th fields in this case) for each transcript $ awk '{ total= $4 + $5 + $6; avg=total/3; print $0"\t"avg}' foo.txt
$ awk '{ total=0; for (i=4; i<=6; i++) total=total+$i; avg=total/3; print $0"\t"avg }' foo.txt
bioawk*
- Extension of awk for commonly used file
formats in bioinformatics
$ bioawk -c help bed: 1:chrom 2:start 3:end 4:name 5:score 6:strand 7:thickstart 8:thickend 9:rgb 10:blockcount 11:blocksizes 12:blockstarts sam: 1:qname 2:flag 3:rname 4:pos 5:mapq 6:cigar 7:rnext 8:pnext 9:tlen 10:seq 11:qual vcf: 1:chrom 2:pos 3:id 4:ref 5:alt 6:qual 7:filter 8:info gff: 1:seqname 2:source 3:feature 4:start 5:end 6:score 7:filter 8:strand 9:group 10:attribute fastx: 1:name 2:seq 3:qual 4:comment
18
*https://github.com/lh3/bioawk
bioawk: Examples
- Print transcript info and chr from a gff/gtf file (2 ways)
bioawk -c gff '{print $group "\t" $seqname}' Homo_sapiens.GRCh37.75.canonical.gtf bioawk -c gff '{print $9 "\t" $1}' Homo_sapiens.GRCh37.75.canonical.gtf Sample output: gene_id "ENSG00000223972"; transcript_id "ENST00000518655"; chr1 gene_id "ENSG00000223972"; transcript_id "ENST00000515242"; chr1
- Convert a fastq file into fasta (2 ways)
bioawk -c fastx '{print “>” $name “\n” $seq}' sequences.fastq bioawk -c fastx '{print “>” $1 “\n” $2}' sequences.fastq
19
Summarize by Columns:
groupBy (from bedtools)
20
column(s) for grouping
column(s) to be summarized
- Operation(s) applied to opCol:
sum, count, min, max, mean, median, stdev, collapse (comma-sep list) distinct (non-redundant comma-sep list)
Print the gene ID (1st column), the gene symbol , and a list of transcript IDs (2nd field) $ sort -k1,1 Ensembl_info.txt | groupBy -g 1 -c 3,2 -o distinct,collapse !Ensembl Gene ID !Symbol !Ensembl Transcript ID ENSG00000281518 FOXO6 ENST00000627423,ENST00000630406 ENSG00000280680 HHAT ENST00000625523,ENST00000626327,ENST00000627903
!Ensembl Gene ID !Ensembl Transcript ID !Symbol ENSG00000281518 ENST00000627423 FOXO6 ENSG00000281518 ENST00000630406 FOXO6 ENSG00000280680 ENST00000625523 HHAT ENSG00000280680 ENST00000627903 HHAT ENSG00000280680 ENST00000626327 HHAT ENSG00000281614 ENST00000629761 INPP5D ENSG00000281614 ENST00000630338 INPP5D
input Partial output
Input file must be pre-sorted by grouping column(s)!