Introduction to CL Session 1: 7/08/2011 What is computational - - PowerPoint PPT Presentation

introduction to cl
SMART_READER_LITE
LIVE PREVIEW

Introduction to CL Session 1: 7/08/2011 What is computational - - PowerPoint PPT Presentation

Introduction to CL Session 1: 7/08/2011 What is computational linguistics? Processing natural language text by computers for practical applications ... or linguistic research Among practical applications Sometimes the


slide-1
SLIDE 1

Introduction to CL

Session 1: 7/08/2011

slide-2
SLIDE 2

What is computational linguistics?

Processing natural language text by computers

for practical applications

 ... or linguistic research

  •  Among practical applications

 Sometimes the computer only needs to classify or transform

the text

 ... but sometimes it needs to “understand”

 Ex: Watson: winner of ‘Jeopardy’  CL vs. NLP (natural language processing)

slide-3
SLIDE 3

NLP applications

  • Automatic speech recognition (ASR):

speech  text

  • Machine translation (MT):

L1  L2

  • Information retrieval (IR):

Query + documents  a subset of doc

  • Information extraction (IE):

document  “database”

slide-4
SLIDE 4

NLP applications (cont)

  • Question answering (QA):

Question + documents  Answer

  • Summarization:

documents  summary

  • Natural language generation (NLG):

representation  text

slide-5
SLIDE 5

Other Applications

  • Call Center
  • Spam filter
  • Spell checker
  • Sentiment analysis: product reviews
  • Bio-NLP: processing clinical data
  • ….
slide-6
SLIDE 6

Basic NLP tasks: Shallow processing

  • Tokenization:

– He visited New York in 2003.

  • Morphological analysis:

– visited  visit + -ed

  • Part-of-speech tagging

– He/Pron visited/V New/?? York/N in/Prep 2003/CD

  • Name-entity tagging

– He visited [LOCATION New York] in [YEAR 2003]

  • Chunking

– [NP He] [V visited] [NP New York] in [NP 2003]

slide-7
SLIDE 7

Basic NLP tasks: Deep processing

  • Parsing

– (S (NP (PRON he)) (VP (V visited) ….)

  • Semantic analysis

– Semantic tagging: *AGENT He+ visited *DEST New York+ …. – Meaning: visit (he, New-York)

  • Discourse

– Co-reference: “He” refers to “John” – Discourse structure

  • Dialogue
  • Generation
slide-8
SLIDE 8

Ambiguity

  • Phonological ambiguity: (ASR)

– “too”, “two”, “to” – “ice cream” vs. “I scream” – “ta” in Mandarin: he, she, or it

  • Morphological ambiguity: (morphological analysis)

– unlockable: [[un-lock]-able] vs. [un-[lock-able]]

  • Syntactic ambiguity: (parsing)

– John saw a man with a telescope. – Time flies like an arrow.

slide-9
SLIDE 9

Ambiguity (cont)

  • Lexical ambiguity: (WSD)

– Ex: “bank”, “saw”, “run”

  • Semantic ambiguity: (semantic representation)

– Ex: every boy loves his mother – Ex: John and Mary bought a house

  • Discourse ambiguity:

– Susan called Mary. She was sick. (coreference resolution) – It is pretty hot here. (intention resolution)

  • Machine translation:

– “brother”, “cousin”, “uncle”, etc.

slide-10
SLIDE 10

Ambiguity resolution

  • Rule-based or knowledge-based:

– Parsing:

  • I saw a man with a hat
  • I saw a man with a telescope (in my hand)

– WSD:

  • “bank”

– MT:

  • “brother”, “cousin”, “uncle”
  • Statistical approach:

– Require training data – Build a statistical model – Knowledge and rules can be incorporated into the model as features etc.

slide-11
SLIDE 11

Major approaches to NLP

  • Rule-based approach
  • Statistical approach

– Supervised learning – Semi-supervised learning – Unsupervised learning

slide-12
SLIDE 12

Supervised learning algorithms

  • Hidden Markov Model (HMM)
  • Decision tree
  • Decision list
  • Naïve Bayes
  • Transformation-based Learning (TBL)
  • Maximum Entropy (MaxEnt)
  • Support Vector Machine (SVM)
  • Conditional Random Field (CRF)
slide-13
SLIDE 13

Data

  • Raw text:

– Monolingual: English/Chinese/Arabic Gigawords – Parallel data: UN data, EuroParl

  • Treebank:

– Syntactic treebanks: a set of parse trees – Proposition Bank: – Discourse Treebank

  • Dictionaries
  • WordNet
  • FrameNet
slide-14
SLIDE 14

Applications Task1 Task2 Task_i ML1 ML_m ML2 … D1 D2 D_n … …

slide-15
SLIDE 15

The role of linguistics knowledge in NLP

  • An NLP system is language-independent.
  • Good or bad?

– Good: it can be ported to many languages without any changes. – Bad: it cannot take advantage of properties of certain languages.

  • How to incorporate (linguistic) knowledge in statistical systems?

– the design of models – as features – as filters – …  Building a treebank is an effective way.