Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata February 19, 2015. Data-type wrap-up; regular expressions.

laureate_genre <- factor(c("novel", "short story", "novel", "poetry", "novel")) laureate_genre [1] novel short story novel poetry [5] novel Levels: novel poetry short story factors ▶ levels(laureate_genre) for the levels

clues <- list( absent=c("Assyrian", "Sensible Course"), present=list( unnecessary=c("Duchess", "Race"), necessary=list( invisible=c("Scandal", "Twisted"), visible=list( undecodable=c("Boscombe", "Five"), decodable=c("Red-Headed", "Identity") ) ) ) ) hierarchy

clues $absent [1] "Assyrian" "Sensible Course" $present $present$unnecessary [1] "Duchess" "Race" $present$necessary $present$necessary$invisible [1] "Scandal" "Twisted" $present$necessary$visible $present$necessary$visible$undecodable [1] "Boscombe" "Five" $present$necessary$visible$decodable [1] "Red-Headed" "Identity"

clues$present$unnecessary [1] "Duchess" "Race" clues$present$necessary$visible$decodable [1] "Red-Headed" "Identity"

Peru 3 Mario Vargas Llosa 2010 5 Sweden Tranströmer 2011 Tomas 4 China Yan 2012 Mo Canada laureates[1:5, c("firstname", "surname", "year", Munro 2013 Alice 2 France Modiano 2014 Patrick 1 surname year bornCountry firstname "bornCountry")] data frames

frm[rows, cols] frame indexing ▶ blank : keep them all ▶ number: choose that row/column ▶ numeric vector: choose those rows/columns ▶ logical: filter those rows/columns ▶ character vector: choose these named rows/columns (but rows don’t have names by default)

frm$colname == frm[, "colname"] shorthand frm$colname[rows] == frm[rows, "colname"]

homework questions? Wingham Arequipa 5 Vargas Llosa Tranströmer Stockholm 4 Gaomi Yan 3 Munro recent_flags <- laureates$year >= 2010 2 Paris Modiano 1 bornCity surname laureates[recent_flags, c("surname", "bornCity")] query logic

Munro recent_flags <- laureates$year >= 2010 Tranströmer Stockholm 4 Gaomi Yan 3 Wingham 2 Arequipa Paris Modiano 1 bornCity surname laureates[recent_flags, c("surname", "bornCity")] 5 Vargas Llosa query logic ▶ homework questions?

laureates[order(laureates$surname, laureates$firstname)[1:5], c("surname", "year")] surname year 49 Agnon 1966 38 Aleixandre 1977 54 Andric 1961 48 Asturias 1967 46 Beckett 1969 ordering

Which of laureates$bornCountry contain "now" ? grep(pattern, s) # which elements match pattern? string search

grep(pattern, s) # which elements match pattern? string search ▶ Which of laureates$bornCountry contain "now" ?

grep("now", laureates$bornCountry) laureates$bornCountry[grep("now", laureates$bornCountry)]

[19] "Schleswig (now Germany)" [10] "Bosnia (now Bosnia and Herzegovina)" [18] "Tuscany (now Italy)" [17] "British India (now India)" [16] "East Friesland (now Germany)" [15] "Prussia (now Germany)" [14] "Prussia (now Germany)" [13] "Russian Empire (now Poland)" [12] "Russian Empire (now Finland)" [11] "French Algeria (now Algeria)" [9] "Ottoman Empire (now Turkey)" grep("now", laureates$bornCountry, value=T) [8] "Austria-Hungary (now Ukraine)" [7] "Russian Empire (now Poland)" [6] "Crete (now Greece)" [5] "Russian Empire (now Lithuania)" [4] "Austria-Hungary (now Czech Republic)" [3] "USSR (now Russia)" [2] "Free City of Danzig (now Poland)" [1] "Persia (now Iran)" or, for short

gsub("male", "male-identified", laureates$gender) gsub(pattern, replacement, s) # globally replace string substitution

grep("197.", laureates$year, value=T) [1] "1979" "1978" "1977" "1976" "1975" "1974" "1973" [8] "1972" "1971" "1970" a new grammar: patterns ▶ most characters match themselves ▶ . matches any character

meta: backslash \\ next normal character is special \\d a digit \\s a white-space character (space, tab….) \\w a “word character” (letters…) \\D anything but a digit \\S anything but white space \\W anything but a word character

[ Edited 2/21/15 : This used to show perl=T but that was misleading. That op- tion can work around some encoding issues but there are other approaches to encoding problems that are more comprehensive, and which I’ve used to fix this slide. See Gries for examples of the kind of patterns that require perl=T .] grep("\\W", laureates$surname, value=T) [1] "Vargas Llosa" "Le Clézio" "García Márquez" [4] "Martin du Gard" "O'Neill" "von Heidenstam"

grep("\\W", laureates$surname, value=T) [1] "Vargas Llosa" "Le Clézio" "García Márquez" [4] "Martin du Gard" "O'Neill" "von Heidenstam" [ Edited 2/21/15 : This used to show perl=T but that was misleading. That op- tion can work around some encoding issues but there are other approaches to encoding problems that are more comprehensive, and which I’ve used to fix this slide. See Gries for examples of the kind of patterns that require perl=T .]

grep("^Hungary", laureates$bornCountry, value=T) grep("Hungary", laureates$bornCountry, value=T) zero-width ˆ the start of the string $ the end of the string \\b a word boundary

make-your-own classes ▶ [...] matches exactly one, except ▶ a-z means the range (code order) ▶ initial ˆ means opposite day

quantifiers ? one or none of previous * zero or more + one or more {n} exactly n {n,m} from n to m (can omit either)

spacey <- c("Doris Lessing", "Doris Lessing", "Doris Lessing") grep("Doris Lessing", spacey) [1] 1 ▶ How to match all three?

grep("Doris\\s+Lessing", spacey) [1] 1 2 3

grep("^M.*o", laureates$surname, value=T) [1] "Modiano" "Munro" "Morrison" "Mahfouz" [5] "Milosz" "Montale" "Mommsen" grep("^M.*o$", laureates$surname, value=T) [1] "Modiano" "Munro" anchors

meta: backslash (2) \\ next special character is normal \\. \\* a literal period, a literal asterisk \\+ \\? literal + and ? \\( \\[ \\{ literal, literal, literal \\\\ literal backslash

time to get grammatical (...) q quantifier q applies to everything in (...) (...|...) one or the other of the sides of the |

grep("^(\\w+ ){2,}", laureates$firstname, value=T) [1] "Sir Vidiadhar Surajprasad" [2] "Sir Winston Leonard Spencer" [3] "André Paul Guillaume" [4] "Carl Friedrich Georg" [5] "Carl Gustaf Verner" [6] "Gerhart Johann Robert" [7] "Count Maurice (Mooris) Polidore Marie Bernhard" [8] "Paul Johann Ludwig" [9] "Selma Ottilia Lovisa" [10] "Christian Matthias Theodor"

many_names <- laureates$firstname[c(7, 99)] many_names [1] "Jean-Marie Gustave" "Selma Ottilia Lovisa" gsub("(\\w+) .*$", "\\1", many_names) [1] "Jean-Marie" "Selma" pattern substitution ▶ in substitution string, \\n corresponds to n th parenthesized expression in pattern

gsub("^som(eth)ing$", "\\1", tricky_years) tricky_years <- c("1774.", "[1793]", "[1795?]", "1792-96.") cleanup

gsub("^\\D*(\\d{4}).*$", "\\1", tricky_years) [1] "1774" "1793" "1795" "1792"

next ▶ Hockey, McCarty, McPherson, Kirschenbaum ▶ http://www.rci.rutgers.edu/~ag978/litdata/hw5 ▶ read Gries according to the guide in homework 5 ▶ groups…

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata February 19, 2015. Data-type wrap-up; regular expressions. laureate_genre <- factor(c("novel", "short story", "novel",

Literary Elements: A Story Sep 1510:34 PM 1 Literary elements.notebook September 21, 2017

Getting Inside A Story Literary Elements: the pieces of a story Analysis: exploring how the

Update on the Literary Fund Presentation to: House Appropriations Elementary and Secondary

Overview of the Literary Fund and Overview of the Literary Fund and VPSA Educational Technology

JC2 LITERARY EPILOGUE A NEW SYLLABUS, A NEW HOPE JC2 LITERARY EPILOGUE Please be seated in 6

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

The Ferrante Effect and the Italian Literary Establishment Maria Mattea Legge Elena

Seamus Heaney and Literary Tourism November 2015 BTS team Stewart Walker Ivan Broussine

Childrens Book Contest The Power to Make a Difference Through Literacy 1 National Literary

First Literary Dates @engagenow_eu Jess Sanz Institut del Teatre Organised by: ENGAGE WITH

Project Ma na ging the Business Ca se Project Ma na ging the Business Ca se a s w ell a s

xsmle Estimation of various spatial panel models 2016 Belgian Stata Users Group meeting Brussels

Q1 2019 Earnings April 30, 2019 Forward-Looking Statements & Non-GAAP Measures

5G Su 5G Summit it Ov Overvie view Debabani Choudhury Intel Labs IEEE 5G Summit, Honolulu,

Rhythm P . S. Langeslag Types of Metrical Position Position Notation Expected Word Stress*

Long running OpenSource projects Survival guide Topics The Open Source economy Making

The Music Systems Nicolas Gold CREST Engineering Team University College London 19th

CHAPTER 8: AGENT COMMUNICATION An Introduction to Multiagent Systems

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata February 19, 2015. Data-type wrap-up; regular expressions. laureate_genre <- factor(c("novel", "short story", "novel",

Literary Elements: A Story Sep 1510:34 PM 1 Literary elements.notebook September 21, 2017

Getting Inside A Story Literary Elements: the pieces of a story Analysis: exploring how the

Update on the Literary Fund Presentation to: House Appropriations Elementary and Secondary

Overview of the Literary Fund and Overview of the Literary Fund and VPSA Educational Technology

JC2 LITERARY EPILOGUE A NEW SYLLABUS, A NEW HOPE JC2 LITERARY EPILOGUE Please be seated in 6

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata

The Ferrante Effect and the Italian Literary Establishment Maria Mattea Legge Elena

Seamus Heaney and Literary Tourism November 2015 BTS team Stewart Walker Ivan Broussine

Childrens Book Contest The Power to Make a Difference Through Literacy 1 National Literary

First Literary Dates @engagenow_eu Jess Sanz Institut del Teatre Organised by: ENGAGE WITH

Project Ma na ging the Business Ca se Project Ma na ging the Business Ca se a s w ell a s

xsmle Estimation of various spatial panel models 2016 Belgian Stata Users Group meeting Brussels

Q1 2019 Earnings April 30, 2019 Forward-Looking Statements &amp; Non-GAAP Measures

5G Su 5G Summit it Ov Overvie view Debabani Choudhury Intel Labs IEEE 5G Summit, Honolulu,

Rhythm P . S. Langeslag Types of Metrical Position Position Notation Expected Word Stress*

Long running OpenSource projects Survival guide Topics The Open Source economy Making

The Music Systems Nicolas Gold CREST Engineering Team University College London 19th

CHAPTER 8: AGENT COMMUNICATION An Introduction to Multiagent Systems

Q1 2019 Earnings April 30, 2019 Forward-Looking Statements & Non-GAAP Measures