Usingcharacter n gramstoclassify na3velanguageinanonna3ve - PowerPoint PPT Presentation

Using character  n ‐grams to classify  na3ve language in a non‐na3ve  English corpus of transcribed speech  Charlo;e Vaughn  Janet Pierrehumbert  Hannah Rohde   Northwestern University  AACL 2009 | University of Alberta | October 10 

Authorship a;ribu3on  (Mosteller and Wallace, 1964; Koppel, Schler, and Zigdon, 2005)  ▸ Use various components of wri3ng (e.g. syntac3c,  stylis3c, discourse‐level) to determine aspects of  author’s iden3ty  – e.g. gender, emo3onal state, na3ve language, actual iden3ty 

Na3ve language classifica3on  (Tsur and Rappoport, 2007)  ▸ Examined English wri3ng from the Interna3onal  Corpus of Learner English (ICLE)   – Used subcorpora from 5 different na3ve language backgrounds:  Bulgarian, Czech, French, Russian, Spanish  ▸ Divided each document into character  n ‐grams  – e.g. ‘bigrams’ = ‘_b’, ‘bi’, ‘ig’, ‘gr’, ‘ra’, ‘am’, ‘ms’, and ‘s_’  ▸ Used mul3‐class support vector machine (SVM) to  classify each document by na3ve language of writer  

Findings  (Tsur and Rappoport, 2007)  ▸    Obtained 65.6%   accuracy in iden3fying   na3ve language of the   author based on   character bigrams alone  – Compared with 20% random baseline accuracy, 46.78% accuracy  for character unigrams, and 59.67% for character trigrams 

Interpreta3on  (Tsur and Rappoport, 2007)  ▸ Speculated that “use of L2 words is strongly  influenced by L1 sounds and sound pa;erns”  (p. 16)     bigrams ≈ diphones   ▸ Language transfer evident on many levels  – Effect of L1 on L2 pronuncia3on is widely a;ested    (Flege, 1987, 1995; Mack, 2003)  ▸ But, what if your L1 background doesn’t just affect  how you say words in your L2, but what words you  use in the first place? 

Drawbacks and open ques3ons   from Tsur and Rappoport (2007)  ▸ How generalizable are these results to speech?   – Wri3ng is a more conscious, deliberate process than speech  – If this really is a phonological process, we might expect stronger  effects in speech    ▸ Used corpus uncontrolled for topic content  – Did use  /‐idf  measure to address possible content bias, but  nonetheless a highly variable corpus  ▸ What is driving this effect?  – Li;le evidence offered for the L1‐driven phonological hypothesis 

Goals of present study  ▸ Extend methodology to naturalis3c speech data  ▸ Use seman3cally controlled corpus to minimize variability  in topic or register  ▸ Explore classifier input in order to pinpoint the source(s)  of the effect 

The corpus  (Van Engen, Baese‐Berk, Baker, Choi, Kim, and Bradlow,  in press)  ▸ The Wildcat Corpus of Na3ve‐ and Foreign‐Accented  English (from Northwestern University)  – Both scripted and spontaneous speech recordings  – Orthographically transcribed  – 24 na3ve English speakers & 52 non‐na3ve English speakers  English  (n=24),  Korean  (n=20),  Mandarin Chinese  (n=20),   Indian (n=2), Spanish (n=2), Turkish (n=2), Italian (n=1), Iranian (n=1),   Japanese (n=1), Macedonian (n=1), Russian (n=1), Thai (n=1)  – Designed in part to examine communica3on between talkers of  different language backgrounds 

Diapix task  (Van Engen, Baese‐Berk, Baker, Choi, Kim, and Bradlow,  in press) 

Subcorpus details  English  Korean  Mandarin  Total  (n = 24)  (n = 20)  (n = 20)  Word   15,617  17,253  19,168  52,038  tokens   Word  981  927  915  1,461  types   Word type/  0.063  0.054  0.048  token ra>o   Unique character  402  382  378  bigrams   Unique character  2,141  2,006  1,982  trigrams   Space = _    Apostrophe = ‘ 

Classifier  ▸ k Nearest Neighbors (kNN)  – k = number of neighbors  /bc/  Test   (5, 3, 0)  Na3ve Mandarin  θ  Na3ve English  /cd/  /ab/  Na3ve Korean  – 1 speaker = 1 document = 1 vector  • Mul3dimensional vectors of frequencies represent either:  all words, all  bigrams, or all trigrams  – Random 80% documents training, 20% tes3ng 

Results  k   Words  Bigrams  Trigrams  1  69.2  69.5  69.2  4  53.8  61.5  76.9  8  69.2  61.5  69.2  (in percent correct)  Li;le decrease in accuracy aver removing most frequent words 

What is doing the classifying?  ▸ Pick out  n ‐grams that are:  – maximally variant in frequency between language backgrounds  – fairly frequent  

What is doing the classifying?  ▸ Look for possible phonological effects  – Maybe English speakers use words with difficult consonant  clusters that non‐na3ve speakers avoid? 

st_   just  just  just  first  first  first 

So what  is  doing the classifying?  ▸ A number of things… 

Case 1: Single func3on word  to_   N ‐gram significant  to  because of one single  func3on word  to  Other examples:  to  ut_ = ‘but’  and ‘about’   _wi and ll_ = ‘will’ 

Case 2: Single interjec3on  oh_   oh  oh  N ‐gram significant  because of one  single interjec3on or  discourse marker   oh  Other examples:  hm_ = ‘mhm’   yes = ‘yes’   no_  = ‘no’ 

Case 3: Single morpheme  n’t   don’t  N ‐gram significant  because of one single  morpheme  don’t  don’t  doesn’t  doesn’t  didn’t  didn’t  can’t  didn’t 

Combina3on of cases  _ho   Func3on and content  to  words  how  Vocabulary items  how  how  house  house  honey  holding 

Combina3on of cases  _ca   cat  Content and func3on  to  case  words  cat  can  carrying  can  cat  can 

Back to Tsur and Rappoport  ▸ How generalizable are their results to speech?  – Classifier performs well on orthographically transcribed speech  ▸ Have we determined what is driving this  effect?  – Appears to be more lexical than phonological 

Conclusions  ▸ Can obtain successful classifica3on using simple  orthographic transcrip3on  – No phone3cally or morphologically tagged corpus appears to be  necessary  ▸ Main ac3on areas are morphosyntax and lexical  seman3cs  ▸ Classifier’s sta3s3cal power derived from collapsing  across related cases  – Trigrams do this best 

Thank you:  Tyler Kendall   Bei Yu   Ann Bradlow  Language Dynamics Lab                                               at Northwestern University   Speech Communica3on Research Group           at Northwestern University 

Usingcharacter n gramstoclassify na3velanguageinanonna3ve - PowerPoint PPT Presentation

Usingcharacter n gramstoclassify na3velanguageinanonna3ve Englishcorpusoftranscribedspeech Charlo;eVaughn JanetPierrehumbert HannahRohde NorthwesternUniversity

N-grams L445 / L545 Dept. of Linguistics, Indiana University Spring 2017 1 / 22 N-grams

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

Statistical Language Modeling with N-grams in Python By Olha Diakonova What are n-grams

Classify then Summarize or Summarize then Classify Melvin F. Janowitz DIMACS, Rutgers University

N-grams & Language ID If N-gram models represent language models, can we use N-gram

ALTERNATIVE PROTEIN PRESENTATION NFS 200 BY BENJAMIN KRAEMER RECOMMENDATIONS OF RED MEAT

n-grams BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler

N-grams and Morpheme Analysis in IR Paul McNamee Johns Hopkins University Applied Physics

N-GRAMS Speech and Language Processing, chapter6 Presented by Louis Tsai CSIE, NTNU

Strings II Review Strings are stored character by character. Can access each character

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University

Machinetranslation p( strong winds) >

BONE & JOINT INFECTIONS Henry F. Chambers, MD Disclosures AstraZeneca advisory board

Massive Black Hole Growth and Formation: Implications for LISA P.Coppi, Yale 1. Supermassive

Twitter Sentiment Analysis Group 23a CS365A- Project Presentation Ajay Singh (12056)

New Jersey Center for Teaching and Learning AP Chemistry Progressive Science Initiative This

Catalunya Barcelona Zoom-in Annotations: Folksonomy Popularity Quality Diversity Empuries

Internationalization of Informatics Education J.C.M. Baeten Chair, Division of Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

Usingcharacter n gramstoclassify na3velanguageinanonna3ve - PowerPoint PPT Presentation

Usingcharacter n gramstoclassify na3velanguageinanonna3ve Englishcorpusoftranscribedspeech Charlo;eVaughn JanetPierrehumbert HannahRohde NorthwesternUniversity

N-grams L445 / L545 Dept. of Linguistics, Indiana University Spring 2017 1 / 22 N-grams

Design Elements Issue Task Force March 12, 2014 1 Historic Character 2 Historic Character 3

Curriculum on Character Development L1/A: Character in Leadership Character Development Agenda

Curriculum on Character Development Character in Leadership Character Development Agenda

Statistical Language Modeling with N-grams in Python By Olha Diakonova What are n-grams

Classify then Summarize or Summarize then Classify Melvin F. Janowitz DIMACS, Rutgers University

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

ALTERNATIVE PROTEIN PRESENTATION NFS 200 BY BENJAMIN KRAEMER RECOMMENDATIONS OF RED MEAT

n-grams BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler

N-grams and Morpheme Analysis in IR Paul McNamee Johns Hopkins University Applied Physics

N-GRAMS Speech and Language Processing, chapter6 Presented by Louis Tsai CSIE, NTNU

Strings II Review Strings are stored character by character. Can access each character

Character Education at Character Education at Northampton Academy An Academy of Character and

CANTERBURY TALES: POWERPOINT CHARACTER PRESENTATION CHARACTER PRESENTER PHYSICAL CHARACTER

- Character set - Character escape conventions - Canonical form - Line editing conventions

Character Eyes: Seeing Language through Character-Level Taggers Yuval Pinter Marc Marone Jacob

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University

Machinetranslation p( strong winds) &gt;

BONE &amp; JOINT INFECTIONS Henry F. Chambers, MD Disclosures AstraZeneca advisory board

Massive Black Hole Growth and Formation: Implications for LISA P.Coppi, Yale 1. Supermassive

Twitter Sentiment Analysis Group 23a CS365A- Project Presentation Ajay Singh (12056)

New Jersey Center for Teaching and Learning AP Chemistry Progressive Science Initiative This

Catalunya Barcelona Zoom-in Annotations: Folksonomy Popularity Quality Diversity Empuries

Internationalization of Informatics Education J.C.M. Baeten Chair, Division of Computer Science

Sambuz

Useful Links

Newsletter

Mail Us

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Machinetranslation p( strong winds) >

BONE & JOINT INFECTIONS Henry F. Chambers, MD Disclosures AstraZeneca advisory board