Awk, Awk Pattern matching and processing language Looks for - - PowerPoint PPT Presentation

awk awk
SMART_READER_LITE
LIVE PREVIEW

Awk, Awk Pattern matching and processing language Looks for - - PowerPoint PPT Presentation

CSC209 Fall 2001 What is AWK? Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern matches, do something Many details handled automatically very easy to one off (write and throw away)


slide-1
SLIDE 1

CSC209 Fall 2001 Karen Reid 1

Awk, Awk

What is AWK?

Pattern matching and processing

language

Looks for pattern in file If pattern matches, do something Many details handled automatically very easy to one off (write and throw

away)

What’s it good for?

data manipulation (omitting part of

file, counting occurrences)

rapid prototyping converting file formats

Features

awk is data-driven as opposed to procedural This means you think about the format of the

data you’re trying to manipulate vs. what to do

Highly automated (record retrieval, break

down into fields, type conversion)

no variable declarations usual programming constructs

slide-2
SLIDE 2

CSC209 Fall 2001 Karen Reid 2

History

The name awk comes from the initials of its

designers: Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan

created in 1977 at AT&T, Bell Labs. 1985: nawk many versions: awk,nawk,POSIX awk, gawk

(gawk is on cdf)

The commandline

Use as part of a pipeline For simple things, specify directly on

command line:

example: ls –l | awk ‘{ print $2}’ prints the

second column

for more complex things, dump to file and use

–f

example: awk –f myscript inputfile | lp awk reads from stdin and prints to stdout

Patterns and Actions

3 main blocks: BEGIN, processing block,

END

2 parts to statements: pattern and action patterns tell awk what to match actions tell awk what to do if there is a match can omit either one, but not both no pattern = match everything no action = print

Example:

seawolf:~% head -5 file1 Baker, Chase 29 GMUP 56.28 57.79 Frohlich, Jon 29 UTAH 49.10 49.20 Kittredge, Brad 25 TOC 45.05 46.22 Liggett, Michael 27 DYNA 47.25 48.12 Linderman, Ross 25 PNA 52.55 51.17 seawolf:~% head -5 file1 | awk '/^Kit/ {print $0}' Kittredge, Brad 25 TOC 45.05 46.22 seawolf:~% head -5 file1 | awk '/.*/' Baker, Chase 29 GMUP 56.28 57.79 Frohlich, Jon 29 UTAH 49.10 49.20 Kittredge, Brad 25 TOC 45.05 46.22 Liggett, Michael 27 DYNA 47.25 48.12 Linderman, Ross 25 PNA 52.55 51.17

slide-3
SLIDE 3

CSC209 Fall 2001 Karen Reid 3

Input

awk works with records , defaults to a single

line

reading records is automatic, no read

statement

next tells awk to skip the current record exit causes program to go to END record exit within END causes awk to quit

seawolf:~% head -5 file1 Baker, Chase 29 GMUP 56.28 57.79 Frohlich, Jon 29 UTAH 49.10 49.20 … seawolf:~% cat testawk /.*/ {print $1} /.*/ {print $2} seawolf:~% head -5 file1 | awk -f testawk Baker, Chase Frohlich, Jon … seawolf:~% cat testawk2 /.*/ {print $1; next} /.*/ {print $2} seawolf:~% head -5 file1 | awk -f testawk2 Baker, Frohlich, …

Fields

group of characters separated by the field

separator

the variable FS holds the field separator set it: BEGIN { FS = “,”} or –Fchar to

change it

likewise, OFS holds output field separator predefined: $1 = 1st field, $2 = 2nd field etc.

$0= entire record (line)

Example:

seawolf:~% head -5 file1 Baker, Chase 29 GMUP 56.28 57.79 Frohlich, Jon 29 UTAH 49.10 49.20 Kittredge, Brad 25 TOC 45.05 46.22 Liggett, Michael 27 DYNA 47.25 48.12 Linderman, Ross 25 PNA 52.55 51.17 seawolf:~% head -5 file1 | awk -F, '{print $2}' Chase 29 GMUP 56.28 57.79 Jon 29 UTAH 49.10 49.20 Brad 25 TOC 45.05 46.22 Michael 27 DYNA 47.25 48.12 Ross 25 PNA 52.55 51.17

slide-4
SLIDE 4

CSC209 Fall 2001 Karen Reid 4

format

mostly free format if more than 1 statement per line, use ; to

separate

good idea to just always use ; (like C) at least opening { of action must be on the

same line as pattern

comments use #

Patterns

6 types in total: BEGIN END Expressions String Patterns Range Patterns Compound Patterns

BEGIN and END

BEGIN always matches before 1st input

record

used to initialize variables must be 1st pattern if used (some versions) END always matches after last input record is

read

use it for things like printing totals must be last pattern if used

seawolf:~% cat swimresults | wc -l 15 seawolf:~% awk 'END { print("Total lines",NR); }' swimresults Total lines 15 seawolf:~% head -5 swimresults Stanford, Jeffrey 25 HIMA 47.07 46.32 Liggett, Michael 27 DYNA 47.25 48.12 Baker, Chase 29 GMUP 56.28 57.79 ... seawolf:~% cat testawk3 BEGIN { OFS=","; print "First Name","Last Name"; print "----------","---------"; } { print $2,$1;} seawolf:~% head -5 swimresults | sed -e 's/,/ /g' | awk -f testawk3 First Name,Last Name

  • ---------,---------

Jeffrey,Stanford Michael,Liggett Chase,Baker ...

slide-5
SLIDE 5

CSC209 Fall 2001 Karen Reid 5

Expressions

expression = operator in awk and its operands can compare both numbers and strings type conversion is automatic type of operand depends on operator

Operator Meaning == is equal to < less than > greater than <= less than or equal to >= greater than or equal to != not equal to ~ matched by !~ not matched by

Automatic type conversion

if using numerical operator: if both operands are numbers, then they will

be compared numerically

if both are strings, compare on collation order if 1 is number while the other is string treated

as if both are strings

$ cat awktest1 $6 < $5 {print $1,$2,$5,$6;} $ cat swimresults Stanford, Jeffrey 25 HIMA 47.07 46.32 Liggett, Michael 27 DYNA 47.25 48.12 Baker, Chase 29 GMUP 56.28 57.79 ... $ awk -f awktest1 swimresults Stanford, Jeffrey 47.07 46.32 ... $ cat awktest2 $1 > "P" {print $1,$2;} $ awk -f awktest2 swimresults Stanford, Jeffrey Richner, Thomas ... $3 > "A" {print $0} prints nothing. Why?

String Matching

3 forms: /string/ - matches if string occurs anywhere

in the record

~ and !~ can deal with more specific scope

  • eg. $1 ~ /ttt*/ matches

Liggett, Michael 27 DYNA 47.25 48.12 Kittredge, Brad 25 TOC 45.05 46.22

slide-6
SLIDE 6

CSC209 Fall 2001 Karen Reid 6

Range Patterns

2 patterns separated by a comma (,) action is performed for all lines between 1st

  • currence of 1st pattern and 1st occurrence
  • f second pattern

if 2nd occurrence not found, matches

everything from 1st occurrence on

Example:

$ cat awktest3 $1 ~ /^B.*/, $1~/^M.*/ { print $0} $ sort swimresults |awk -f awktest3 Baker, Chase 29 GMUP 56.28 57.79 Frohlich, Jon 29 UTAH 49.10 49.20 Kittredge, Brad 25 TOC 45.05 46.22 Liggett, Michael 27 DYNA 47.25 48.12 Linderman, Ross 25 PNA 52.55 51.17 McCormick, Aaron 27 RMM 49.00 49.30

Compound patterns

can use logical operators to combine

patterns

!, ||, && example:

$ awk '$6 < 47 && $3 >=28 {print $0}' swimresults Wanie, Lee 28 TOC 46.00 46.39

Actions

Tells awk what to do when a pattern is found surrounded by {} Includes:

variables loops data structures (arrays)

slide-7
SLIDE 7

CSC209 Fall 2001 Karen Reid 7

Variables

3 types: user defined, field variables,

predefined

no declaration "you use it, therefore it is" auto-initialized to 0, but alway initialize predefined variables are all uppercase case sensitive no type declaration, conversion is automatic if conversion fails, gives value of 0

Example:

calculate the average final swim time (recall

6th column was final time): BEGIN { totalTime=0; } { totalTime+=$6} END { print "average time:",totalTime/NR;} $ awk -f avgtime.awk swimresults average time: 50.7953

Built in variables (awk)

FILENAME - name of current input file FS - field separator, defaults to space OFS - output field separator, default space ORS - output record separator, default new

line

NR - number of records read thus far NF - number of fields

Example

$ cat starlight Star light, star bright, First star I see tonight, I wish I may, I wish I might, Get to play Halflife2 in the coming nights $ awk '{ print "line",NR,NF,"words:",$0}' starlight line 1 4 words: Star light, star bright, line 2 5 words: First star I see tonight, line 3 8 words: I wish I may, I wish I might, line 4 8 words: Get to play Halflife2 in the coming nights

slide-8
SLIDE 8

CSC209 Fall 2001 Karen Reid 8

Built-in Variables

RS - record separator, default to new line

(Nawk)

ARGC - number of commandline args (Nawk) ARGV - the actual commandline args ENVIRON - environment variables (POSIX

Awk)

IGNORECASE - regular expressions become

case insensitive (gawk)

Example:

$awk 'BEGIN { print "Num args=“ ,ARGC,"args=",ARGV[0],ARGV[1],ARGV[2];}' starlight blah Num args= 3 args= awk starlight blah awk 'BEGIN { print "HOMEDIR=",ENVIRON["HOME"];}' HOMEDIR= /home/ken $ awk '/baker/ {print $0};' swimresults $ awk 'BEGIN {IGNORECASE=1;}; /baker/ {print $0};' swimresults Baker, Chase 29 GMUP 56.28 57.79

Conditions - if

  • if statement is very much like C

#if.awk { if($6<49) { print($1,$2,"fast"); } else { print($1,$2,"slow"); } } $ head -3 swimresults Stanford, Jeffrey 25 HIMA 47.07 46.32 Liggett, Michael 27 DYNA 47.25 48.12 Baker, Chase 29 GMUP 56.28 57.79 $ head -3 swimresults | awk -f if.awk Stanford, Jeffrey fast Liggett, Michael fast Baker, Chase slow

Loops

Syntax very much like C Some examples: { i=1; while(i<=NF) { print $i; i++; } }

slide-9
SLIDE 9

CSC209 Fall 2001 Karen Reid 9

Results:

$ head -2 swimresults Stanford, Jeffrey 25 HIMA 47.07 46.32 Liggett, Michael 27 DYNA 47.25 48.12 $ head -2 swimresults | awk -f while.awk Stanford, Jeffrey 25 HIMA 47.07 46.32 Liggett, Michael 27 DYNA 47.25 48.12

break & continue

do...while is analogous break and continue has same semantics as C Scan each line for the occurence of 25

{ for(i=1;i<=NF;i++) { if ($i ~ /25/) { print("field",i,":",$i); break; } } }

String functions

length(str) substr(string,position,length) gsub(r,s,t) - substitutes s into t whenever r

  • ccurs (nawk)

sub(r,s,t) - like gsub, but only subs once toupper(str) (posix awk) tolower(str)

String examples:

$ head -5 swimresults Stanford, Jeffrey 25 HIMA 47.07 46.32 Liggett, Michael 27 DYNA 47.25 48.12 Baker, Chase 29 GMUP 56.28 57.79 Kittredge, Brad 25 TOC 45.05 46.22 Richner, Thomas 27 UCLA 50.00 48.79 $ head -5 swimresults | awk '{print(length($2),substr($2,1,3),toupper($2));}' 7 Jef JEFFREY 7 Mic MICHAEL 5 Cha CHASE 4 Bra BRAD 6 Tho THOMAS $

slide-10
SLIDE 10

CSC209 Fall 2001 Karen Reid 10

Another string example:

$ head -5 swimresults | awk '{print(gsub(/e/,"!",$2),"occurrence(s)"); print($2);}' 2 occurrence(s) J!ffr!y 1 occurrence(s) Micha!l 1 occurrence(s) Chas! 0 occurrence(s) Brad 0 occurrence(s) Thomas

Numeric functions

int(x) - returns the nearest integer to x (truncate

towards 0): int(3) is 3, int(3.9) is 3, int(-3.9) is -3, and int(-3) is -3 as well.

sqrt(x) - This returns the positive square root of x. exp(x) - This returns the exponential of x (e ^ x) log(x) - This returns the natural logarithm of x sin(x) - This returns the sine of x, with x in radians. cos(x) atan2(y, x) - This returns the arctangent of y / x in

radians.

rand() - This returns a random number in {0,1}

Arrays

awk array are dynamic no need to declare size associative arrays not like "normal arrays" indices are stored as strings array item "15" < "3" so element "15" stored

before element "3"

Normal arrays:

must be declared contiguous chunk of memory index 0 = 1st element, index 1 = 2nd element etc. impossible to add to array A contiguous array of four elements might look like

this: +---------+---------+--------+---------+ | 8 | "foo" | "" | 30 | value +---------+---------+--------+---------+ 0 1 2 3 index

slide-11
SLIDE 11

CSC209 Fall 2001 Karen Reid 11

Associative arrays

each array is a collection of pairs: an index,

and its corresponding array element value

  • rder is irrelevant

is sparse (i.e., some indices can be missing) new pairs can be added at any time

Associative arrays - example

Element 4 Value 30 Element 2 Value "foo" Element 1 Value 8 Element 3 Value "" Add a tenth element whose value is "number ten": Element 10 Value "number ten" Element 4 Value 30 Element 2 Value "foo" Element 1 Value 8 Element 3 Value ""

Array access

myarray["2"] to access the "2" element a reference automatically creates that array element an array element that has no recorded value has

default value ""

test existence with index in array has the value 1 (true) if array[index] exists, and 0

(false) if it does not exist.

doesn't create element if it is not present delete myarray["2"] - removes the entry forever

Array access example:

BEGIN {i=1} { names[$2] = i++; } END { if("Ross" in names) print("Ross was",names["Ross"],"th person there"); else print("Ross wasn't there!"); }

slide-12
SLIDE 12

CSC209 Fall 2001 Karen Reid 12

For loop for arrays

  • awk has a special kind of for statement for scanning an array:
  • for (var in array) body
  • executes body once for each index in array that the program has previously

used, with the variable var set to that index.

BEGIN {i=1} { names[$2] = i++; } END { for (n in names) print n; } Output: Steven Patrick Thomas Michael Jon

functions

  • only in nawk & gawk

function add(a,b) { a=a+b; return a; } { print(add($5,$6)); } $ head -2 swimresults Stanford, Jeffrey 25 HIMA 47.07 46.32 Liggett, Michael 27 DYNA 47.25 48.12 $ head -2 swimresults | awk -f add.awk 93.39 95.37

Functions variables

parameters are passed by value local variables must be declared with

parameters, separated by a space

function myfunc(x a,b) <--a,b are parameters

  • therwise, variables are global