Regular Expressions for Linguists: A Life Skill . Michael - - PowerPoint PPT Presentation

regular expressions for linguists a life skill
SMART_READER_LITE
LIVE PREVIEW

Regular Expressions for Linguists: A Life Skill . Michael - - PowerPoint PPT Presentation

. Regular Expressions for Linguists: A Life Skill . Michael Yoshitaka Erlewine mitcho@mitcho.com Hackl Lab Turkshop March 2013 . . Regular Expressions What are regular expressions? Regular Expressions (aka regex es or regexp s) are a


slide-1
SLIDE 1

. .

Regular Expressions for Linguists: A Life Skill

Michael Yoshitaka Erlewine

mitcho@mitcho.com

Hackl Lab Turkshop March 2013

. .

slide-2
SLIDE 2

Regular Expressions

What are regular expressions?

  • Regular Expressions (aka regexes or regexps) are a way of telling your

computer what to do.

  • Specifically, look at (a lot of) text and find things or make changes.
  • These are both things which we might want to do as linguists.

. . 2

slide-3
SLIDE 3

Why are we learning it now?

  • Constructing input for Turk (can) involve(s) manipulating a lot of text.

In particular, you might want to test different systematic variants of your sentences.

  • Cutting and pasting is highly prone to errors.
  • Regular Expressions are a quick way to do this.

. .

3

slide-4
SLIDE 4

Note

You will not master regular expressions today. Regular expressions are learned through practice.

. .

4

slide-5
SLIDE 5

Tools

Today we will use regexes as a glorified find/replace tool in a text editor. Free editors with good regex support include:

  • TextWrangler for Mac
  • Notepad++ for Windows
  • Download the NppToolBucket plugin. Move the dll file to

C:\Program Files\Notepad++\plugins. This adds a better find/replace window to Notepad++.

  • Komodo Edit for Mac/Windows/Linux

Hopefully you’ve already installed one of those.

. .

5

slide-6
SLIDE 6

Get these now

☞ Download the Regular Expressions cheat sheet: http://web.mit.edu/hackl/www/lab/turkshop/ slides/regex-cheatsheet.pdf ☞ Download some sample files to play with here: http://web.mit.edu/hackl/www/lab/turkshop/ examples/week2-regex.zip

. .

6

slide-7
SLIDE 7

What do you do with a regex?

  • Find
  • Find all/count
  • Find and replace

Make sure you know how to do these things in your editor.

. . 1 Open the file lookingglass.txt in your editor and count the

  • ccurrences of cat. Try replacing all the cats with dogs. (Don’t

save; undo.) What problem does this have? Make sure you turn on “regular expressions” (TextWranger: “grep”) and “wrap around” in your search window. Notice that you can specify case-sensitivity.

. .

7

slide-8
SLIDE 8

Basic matching

Look at the “basic matching” section of the cheat sheet.

. . 1 Does the text have any tabs? . . 2 Search for Alice\n. What did you find? . . 3 Find five-letter words with \s\w\w\w\w\w\s. What problem(s)

does this have?

. . 4 This file annoyingly has line breaks in the middle of sentences. Find

\n and replace them all with one space each. (Don’t save; undo.) Did this do what we wanted?

. .

8

slide-9
SLIDE 9

Character classes

Look at the “character classes” section of the cheat sheet.

. . 1 Find sequences of three vowels in a row with

[aeiou][aeiou][aeiou].

. . 2 Find sequences of three capital letters in a letter

[A-Z][A-Z][A-Z]. Make sure to turn on case-sensitivity!

. .

9

slide-10
SLIDE 10

Boundaries

Look at the “boundaries” section of the cheat sheet.

. . 1 Search for \bcat\b. What did you find? . . 2 Search for \bcat\w. What did you find? . . 3 Find some words that start with “q”. . . 4 How ofuen is “Alice” at the beginning of a line? At the end of a line? . . 5 Does ^through pick out the same things as \nthrough?

. .

10

slide-11
SLIDE 11

Disjunctions

Disjunctions are pretty straightforward.

. . 1 Search for (cat|dog)s. What did you find?

Moving on...

. .

11

slide-12
SLIDE 12

“Quantifiers”

Look at the “quantifiers” section of the cheat sheet.

. . 1 Find all words that end in -ing. Make sure you select the entire

  • ing-suffixed word.

. . 2 Find all words that end in -ing or the plural -ings. . . 3 How many exactly-ten-letter words are there? . . 4 How many matches of \b[\w’]{10}\b are there? What is this? . . 5 What’s the longest word in the text? . . 6 What’s the longest word in the text which can be typed using only the

keys on the top row of your keyboard?

. . 7 Find the line that both the “Red Queen” and “Red King” are in.

. .12

slide-13
SLIDE 13

Special characters

Special characters are pretty straightforward too. If your editor ever dies and tells you your regular expression is bad, most likely you forgot to escape something.

. .

13

slide-14
SLIDE 14

Backreferences

Look at the “backreferences” section of the cheat sheet.

. . 1 Search for \b(\w+)␣\1\b. What did you find? . . 2 Search for \b(\w)\w*␣\1\w*\b. What does this find? . . 3 Replace all the line breaks inside paragraphs, but keep the paragraph

breaks intact.

. .

14

slide-15
SLIDE 15

Realistic (important for next week!) exercises

Open file blocking.txt.

. . 1 Take each sentence and replace “made V” with the appropriate

“Ved”. (Don’t save; undo.)

. . 2 Take each line and turn it into two lines: one as is and the other with

“make V” replaced with the appropriate “Ved”. Open file quantifiers.txt. Each sentence contains two quantifiers in curly braces.

. . 1 Take each sentence and split it into two sentences, using the options

in the curly braces one at a time. Add “a” and “b” to the sentence numbers at the same time.

. .

15