Language Processing Credits: Sommerville, Chapter 13.4 Andy - - PowerPoint PPT Presentation

language processing
SMART_READER_LITE
LIVE PREVIEW

Language Processing Credits: Sommerville, Chapter 13.4 Andy - - PowerPoint PPT Presentation

Language Processing Credits: Sommerville, Chapter 13.4 Andy Pimentel, University of Amsterdam David Albrecht, Monash University Charles A. Ofria, Michigan State University Wuwei Shen, Western Michigan University Instructor: Peter Baumann


slide-1
SLIDE 1

320312 Software Engineering (P. Baumann)

Language Processing

Instructor: Peter Baumann email: p.baumann@jacobs-university.de tel:

  • 3178
  • ffice:

room 88, Research 1 Sommerville, Chapter 13.4

Credits: Andy Pimentel, University of Amsterdam David Albrecht, Monash University Charles A. Ofria, Michigan State University Wuwei Shen, Western Michigan University

slide-2
SLIDE 2

2 320312 Software Engineering (P. Baumann)

To warm up…

  • "Parser development is still a black art."
  • - Paul Klint et. al, Towards an engineering discipline for GRAMMARWARE,

in: ACM TOSEM, May 2005

  • Some magic:

Sort word list X in APL: All primes up to R:

slide-3
SLIDE 3

3 320312 Software Engineering (P. Baumann)

Roadmap

  • Compilers & Co
  • Flex & Bison: The Mechanics
  • Grammars and actions
  • Error handling and debugging
  • Wrap-up
slide-4
SLIDE 4

4 320312 Software Engineering (P. Baumann)

Language Processing Systems

  • Accept a natural or artificial language as input

and generate some other representation of that language

  • Compiler: generate machine code; ex: gcc
  • Interpreter: act immediately on instructions while being processed; ex: SQL, JS
  • Used: when easiest way to solve a problem is

to describe algorithm or data

  • Meta-CASE tools process tool descriptions, method rules, etc, to generate tools
slide-5
SLIDE 5

5 320312 Software Engineering (P. Baumann)

Roadmap

  • Compilers & Co
  • Flex & Bison: The Mechanics
  • YACC / bison
  • (F)lex
  • Their interplay, and how to code it
  • Grammars and actions
  • Error handling and debugging
  • Wrap-up
slide-6
SLIDE 6

6 320312 Software Engineering (P. Baumann)

  • YACC ("Yet Another Compiler Compiler") = a parser generator
  • Bison = GNU yacc
  • LALR(1)
  • Parser generator = tool producing a parser for a given grammar
  • ie, produce source code of syntactic analyzer for corresponding language
  • stack to remember all nodes generated in parse tree up to now (so stack empty at the end)
  • Input: myparser.y[pp] containing grammar (=rules) + actions
  • Output: C[++] program myparser.c[pp]
  • + optionally header file of tokens myparser.h

What is Bison?

slide-7
SLIDE 7

7 320312 Software Engineering (P. Baumann)

What is (F)lex?

  • Lex = a scanner generator
  • Flex = "fast lex"
  • Regular expressions
  • Input: scanner.l containing patterns (=rules) + actions
  • Output: C program scanner.c
  • Typically, the generated scanner produces tokens

for the (YACC-generated) parser

slide-8
SLIDE 8

8 320312 Software Engineering (P. Baumann)

Synopsis: How F&B Do The Job

  • Bison grammar defines admissible sequences ("sentences")
  • Context-free grammar (and more)

[+-]?[0-9]+ return NUM; "+" return PLUS; [ \t\n]+ /* do nothing */ expr : NUM '+' NUM ;

12 + 2

  • Ex:
  • Flex grammar defines single tokens ("words")
  • Regular expressions
slide-9
SLIDE 9

9 320312 Software Engineering (P. Baumann)

Bison yyparse()

Flex yylex()

12 + 2

Flex & Bison: a Team

nextToken = yylex() "saw token NUM!" NUM „+‟ NUM

"OK!"

[0-9]+

12 + 2

main()

slide-10
SLIDE 10

10 320312 Software Engineering (P. Baumann)

Bison File Format

Definitions %% Rules %% Supplementary Code

The identical LEX format was actually taken from this...

  • Code to be copied before generated code
  • Definitions used: tokens, types, etc.
  • user subroutines
  • Copied after the end of the bison

generated code

  • Pairs of production rules and actions
  • productions rules describe CFG
slide-11
SLIDE 11

11 320312 Software Engineering (P. Baumann)

Bison Definitions Section

%{ #include <stdio.h> #include <stdlib.h> %} %token ID NUM %start expr

terminal symbols start symbol (a non-terminal, obviously) Typedefs, includes, namespaces, … Copied literally into C source

Definitions %% Rules %% Supplementary Code

slide-12
SLIDE 12

12 320312 Software Engineering (P. Baumann)

Bison Rules Section

  • Contains grammar
  • referring to previously defined non-terminals and terminals
  • Example:

expr : expr '+' term | term ; term : term '*' factor | factor ; factor : '(' expr ')' | ID | NUM

Definitions %% Rules %% Supplementary Code

char, not string! Be nice, define PLUS

slide-13
SLIDE 13

13 320312 Software Engineering (P. Baumann)

Bison Code Section

  • "any other code"
  • main(), yyerror(), ...

void yyerror(char* err) { cerr << "Syntax error: " << err << endl; } main() { return yyparse(); }

Definitions %% Rules %% Supplementary Code

Called by bison code when encountering special token error This is the parser

slide-14
SLIDE 14

14 320312 Software Engineering (P. Baumann)

FLEX

Definitions %% Rules %% Supplementary Code

cf YACC / bison!

  • Code to be copied before generated code
  • Definitions used: tokens, etc.
  • user subroutines
  • Copied after the end of the flex generated

code

  • Pairs of production rules and actions
  • rules describe regular expressions
slide-15
SLIDE 15

15 320312 Software Engineering (P. Baumann)

FLEX Code Example

%{ #include <stdio.h> #include "parser.h" %} id [_a-zA-Z][_a-zA-Z0-9]* num [+-]?[0-9]+ semi [;] wspc [ \t\n]+ %% {id} { return ID; } {num} { return NUM; } {semi} { return SEMI; } {wspc} {;} Defined in bison's parser.h Returns only token tag – actual value passed elsewhere

slide-16
SLIDE 16

16 320312 Software Engineering (P. Baumann)

Sidebar: If I Don't Want to Use FLEX

#include "parser.h" int yylex() { if (it's a num) return NUM; else if (it's an id) return ID; else if (end of input) return 0; else if (it's an error) return -1; }

slide-17
SLIDE 17

17 320312 Software Engineering (P. Baumann)

Flex/Bison Code: How to Compile & Link

$ flex scanner.l $ bison –d myparser.ypp $ gcc –o parser myparser.cpp lex.yy.c –ly –lfl

myparser.ypp myparser.cpp parser gcc bison scanner.l lex.yy.c

flex

slide-18
SLIDE 18

18 320312 Software Engineering (P. Baumann)

Roadmap

  • Compilers & Co
  • Flex & Bison: The Mechanics
  • Grammars and actions
  • Error handling and debugging
  • Wrap-up
slide-19
SLIDE 19

19 320312 Software Engineering (P. Baumann)

Semantic Actions in Bison: Overview

  • Action = code executed when rule is applied
  • Any C/C++ code
  • Ex: expression evaluation, symbol table insertion / lookup, parse tree build-up
  • How to pass information between rules?
  • Attribute values store intermediate results
  • $1, $2, … = result from evaluating (non-) terminal #1, #2, …
  • $$ = result of current expression

expr : expr '+' term { $$ = $1 + $3; } | term { $$ = $1; }

slide-20
SLIDE 20

20 320312 Software Engineering (P. Baumann)

Flex Bison term : term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } factor: NUM { $$ = yylval; }

Dynamics of Rule Processing

  • Rule "fires" = right-hand side reduced to left-hand non-terminal
  • Rules reduced bottom-up
  • Successful if, at EOF, only axiom remains (empty stack)

factor '*' NUM NUM term term factor 2 * 3 yylval2 yylval3 $$  3 $$  6 $$  2 $$  2

slide-21
SLIDE 21

21 320312 Software Engineering (P. Baumann)

Semantic Actions: Larger Example

expr: expr '+' term { $$ = $1 + $3; } | term { $$ = $1; } ; term: term '*' factor { $$ = $1 * $3; } | factor { $$ = $1; } ; factor: '(' expr ')' { $$ = $2; } | ID { $$ = lookup(symbolTable,yylval); } | NUM { $$ = yylval; } ;

Does not run like this; yylval trickier in real life, usually needs a union!

slide-22
SLIDE 22

22 320312 Software Engineering (P. Baumann)

Roadmap

  • Compilers & Co
  • Flex & Bison: The Mechanics
  • Grammars and actions
  • Error handling and debugging
  • Wrap-up
slide-23
SLIDE 23

23 320312 Software Engineering (P. Baumann)

Error Handling: Catch & Recover & Report

  • Good error handling includes
  • Elastic recovery from errors, graceful continuation
  • Meaningful diagnostic messages
slide-24
SLIDE 24

24 320312 Software Engineering (P. Baumann)

Error Handling: Catch & Recover

  • Elastic recovery from errors, graceful continuation
  • Implementing good error handling can be extremely tricky!
  • Example, good for line-oriented input (lab assembler!):
  • bison std token error eats up all non-understood tokens
  • Predefined macros reset parser + scanner for meaningful continuation
  • + individual actions (message output, …)

line : /* empty */ | line whatever | line error /* std error token */ { yyerror( "Failure :-(" ); /* msg output etc. */ yyerrok; /* reset parser */ yyclearin; /* reset scanner */ }

slide-25
SLIDE 25

25 320312 Software Engineering (P. Baumann)

Error Handling: Syntactic vs Semantic Errors

  • Semantic error: caught by your action, ignored by parser

line : /* empty */ | line whatever | line error { yyerror( "Failure :-(" ); yyerrok; /* reset parser */ yyclearin; /* reset scanner */ } expr: expr '/' expr { if ($3 == 0.0) yyerror(“div by zero”); else $$ = $1 / $3; }

  • Syntactic error: caught by parser, need to manually cure & reset:
slide-26
SLIDE 26

26 320312 Software Engineering (P. Baumann)

Error Handling: Report

  • Provide meaningful diagnostic output!
  • Compiler needs to give programmer a good advice
  • Useful information: Line number, column number, violating token, what

understood vs what expected, …

  • Examples:
  • Bad: "Syntax error"
  • Good: "Line 15, column 21, near token 'flip': found unknown instruction parameter 'coin'"
  • Note: find current line number in global variable yylineno
slide-27
SLIDE 27

27 320312 Software Engineering (P. Baumann)

Debugging My Parser

  • If it doesn't do what expected:
  • gcc –DYYDEBUG …
  • In your program code:

extern int yydebug; yydebug = 1;

  • …will give (very!) verbose log of bison rule processing
slide-28
SLIDE 28

28 320312 Software Engineering (P. Baumann)

Roadmap

  • Compilers & Co
  • Flex & Bison: The Mechanics
  • Grammars and actions
  • Error handling and debugging
  • Wrap-up
slide-29
SLIDE 29

29 320312 Software Engineering (P. Baumann)

LEX = flex YACC = bison CC = gcc calc: parser.o scanner.o $(CC) -o calc parser.o scanner.o -ly -lfl scanner.o: parser.h scanner.c scanner.c: scanner.l parser.h $(LEX) –oscanner.c scanner.l parser.o: parser.cpp parser.h parser.cpp parser.h: parser.ypp $(YACC) -d parser.ypp

Sample Makefile

slide-30
SLIDE 30

30 320312 Software Engineering (P. Baumann)

Bored, Lonely? Try This!

  • bison -v parser.y
  • Produces state machine description in y.output
  • flex -d scanner.l
  • Flex-generated state automaton will generate (tons of) runtime output
slide-31
SLIDE 31

31 320312 Software Engineering (P. Baumann)

ANTLR

  • „ANother Tool for Language Recognition“
  • Java based
  • www.antlr.org
  • ANTLRWorks: Sophisticated GUI grammar development environment
  • Grammar editor, interpreter, debugger
  • Ambiguous path visualization
slide-32
SLIDE 32

32 320312 Software Engineering (P. Baumann)

ANTLR

slide-33
SLIDE 33

33 320312 Software Engineering (P. Baumann)

Background & Beyond

  • Books about compiler design
  • “lex & yacc”, by John Levine et al.
  • “Principles of Compiler Design”, by A.V. Aho and & D. Ullman
  • “The Unix Programming Environment”, by Kernighan & Pike
  • Domain Specific Languages (DSLs)
  • Executable meta-language
  • Applied Metamodelling: A Foundation for Language Driven Development
  • SCALA