Natural Language Processing Zhiyuan Liu THUNLP - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Zhiyuan Liu THUNLP - - PowerPoint PPT Presentation

Natural Language Processing Zhiyuan Liu THUNLP liuzy@tsinghua.edu.cn 1 What is Natural Language Processing? Input Structure Prediction Output: Semantic Structure Syntactic Structure The


slide-1
SLIDE 1

Natural Language Processing

Zhiyuan Liu THUNLP liuzy@tsinghua.edu.cn

1

slide-2
SLIDE 2

What is Natural Language Processing?

2

The Nature of NLP is Structure Prediction!

Input: Output: Structure Prediction 救援队正组织力量接应灾民下山 Syntactic Structure Semantic Structure

slide-3
SLIDE 3

Complexity of NLP

  • The search space of possible syntactic

trees of a sentence: exponential growth with sentence length

!" ! "$% ! " !

(Church and Patil, 1982)

3 句长 二分结构树数量 1 1 2 2 3 2 4 5 5 14 6 42 7 132 8 429 9 1, 430 10 4, 862 11 16, 796 12 58, 786 13 208, 012 14 742, 900 15 2, 674, 440 16 9, 794, 845 17 35, 357, 670 18 129, 644, 790 19 477, 638, 700 20 1, 767, 263, 190

Similar to the problem of board game play Chess, Go

slide-4
SLIDE 4

Complexity of NLP

  • Solution: Find optimal structure regularized with

prior syntactic and semantic knowledge

  • The regularized search for NLP is difficult

– Variety – Recursion – Ambiguity – …

4

slide-5
SLIDE 5

Complexity of NLP: Variety

5

Examples 亲,看帖要回帖哦! 走召弓虽(超强) 1314(一生一世) 菌男霉女 屌丝

slide-6
SLIDE 6

Complexity of NLP: Recursion

6

他觉得 他觉得自己丑 他觉得她认为自己丑 他觉得昨天下午的聚会她认为自己丑 他觉得他记得昨天下午的聚会她认为自己丑 他觉得就在刚才他记得昨天下午的聚会她认为自己丑

slide-7
SLIDE 7

Complexity of NLP: Recursion

7

slide-8
SLIDE 8

Complexity of NLP: Recursion

8

Noam Chomsky

We hypothesize that FLN only includes recursion and is the only uniquely human component of the faculty of language.

slide-9
SLIDE 9

Complexity of NLP: Ambiguity

9

slide-10
SLIDE 10

Complexity of NLP: Ambiguity

10

slide-11
SLIDE 11

Complexity of NLP: Ambiguity

11

领导:你这是什么意思? 小明:没什么意思。意思意思。 领导:你这就不够意思了。 小明:小意思,小意思。 领导:你这人真有意思。 小明:其实也没有别的意思。 领导:那我就不好意思了。 小明:是我不好意思。

【问:以上对话中的“意思”分别是什么意思?】

slide-12
SLIDE 12

Complexity of NLP: Ambiguity

12

1、冬天:能穿多少穿多少; 夏天:能穿 多少穿多少。 2、剩女产生的原因有两个:一是谁都看不 上;二是谁都看不上。 3、地铁里听到一个女孩大概是给男朋友打 电话:“我已经到西直门了,你快出来往 地铁站走。如果你到了,我还没到,你就 等着吧。如果我到了,你还没到,你就等 着吧。”

【问:请写出以上语句的区别】

slide-13
SLIDE 13

Complexity of NLP: Ambiguity

13

1、冬天:能穿多少穿多少; 夏天:能穿 多少穿多少。 2、剩女产生的原因有两个:一是谁都看不 上;二是谁都看不上。 3、地铁里听到一个女孩大概是给男朋友打 电话:“我已经到西直门了,你快出来往 地铁站走。如果你到了,我还没到,你就 等着吧。如果我到了,你还没到,你就等 着吧。”

【问:请写出以上语句的区别】

slide-14
SLIDE 14

Complexity of NLP: Ambiguity

14

W:小明,那些题你对了吗? M:对了,但有些题没有对。 W:看你这样似乎很多题都没有对。 M:对呀。见那么多题不对,我都不敢继续 对下去了。 W:这么说,你后面的题都没对了? M:对。

【问:请写出这段对话的意思是什么?】

slide-15
SLIDE 15

Scientific Impact of NLP

  • Turing Test: A test of machine ability to exhibit

intelligent behavior indistinguishable from that

  • f a human

15

slide-16
SLIDE 16

Scientific Impact of NLP

  • Origin Version: Imitation Game

16

slide-17
SLIDE 17

Scientific Impact of NLP

  • Origin Version: Imitation Game

17

slide-18
SLIDE 18

Scientific Impact of NLP

  • 2011: IBM Watson DeepQA system competed
  • n Jeopardy! and received the first place
  • A new milestone of AI after DeepBlue won

world champion of chess in 1997

18

Q: Who was presidentially pardoned on September 8, 1974? A: Nixon.

slide-19
SLIDE 19

Application Impact of NLP

  • Nature 2011: Natural Language QA will be

next-generation search engine

  • Gartner Hype Cycle 2012

19

slide-20
SLIDE 20

Application Impact of NLP

  • IT giants launch their NLP products

20

Apple Siri Skype Translator Sogou Input Google Knowledge Graphs

slide-21
SLIDE 21

Application Impact of NLP

  • Many research grants in NLP from US government

and military

21

Project Names Release Start Grant Machine Reading 2007 2008 $ 67.4 million Deep Exploration and Filtering of Text 2012 2013 $ 25.0 million

slide-22
SLIDE 22

Impact of Chinese NLP

  • US government regards Chinese as key languages
  • Many institutes take Chinese NLP as research areas

22

slide-23
SLIDE 23

Impact of Chinese NLP

23

Performance

International Evaluation on Chinese NLP Tasks

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

中文分词 中文依存句法分析 中文语义角色标注 中文语义依存分析 中文指代消解 中文IR4QA 中英机器翻译

World Best China Best

slide-24
SLIDE 24

TYPICAL APPLICATIONS OF NLP

24

slide-25
SLIDE 25

Search Engines

25

slide-26
SLIDE 26

Online Advertisement

26

slide-27
SLIDE 27

Content-based Recommendation

27

slide-28
SLIDE 28

Personal Assistant

28

slide-29
SLIDE 29

Machine Translation

29

slide-30
SLIDE 30

Document Summarization

30

slide-31
SLIDE 31

Sentiment Analysis and Opinion Mining

31

slide-32
SLIDE 32

Key-phrase Extraction

32

slide-33
SLIDE 33

Computational Social Sciences

  • Culturomics (文化组学): http://www.culturomics.org
  • Harvard researchers use keywords over Google Books (5

million books from 1800 to 2000) to study the evolution

  • f human culture
  • Google Book N-grams: https://books.google.com/ngrams

33

slide-34
SLIDE 34

Computational Social Sciences

  • Evolution of irregular verbs

in English

34

slide-35
SLIDE 35

Computational Social Sciences

35

slide-36
SLIDE 36

Computational Social Sciences

36

slide-37
SLIDE 37

Computational Social Sciences

37

Famous Persons Birth Location à Death Location Winckelmann Corpus Freebase

slide-38
SLIDE 38

Computational Social Sciences

  • Use language usage to study human

psychology

38

No country for

  • ld

members: User lifecycle and linguistic change in online communities with Dan Jurafsky, Jure Leskovec, Christopher Potts. WWW 2013. Best Paper Award. Cristian Danescu-Niculescu-Mizil

slide-39
SLIDE 39

TYPICAL TASKS IN NLP

39

slide-40
SLIDE 40

Advances in Natural Language Processing

  • Julia Hirschberg,

Columbia University

  • AAAI、ACL Fellow
  • Christopher Manning,

Stanford University

  • ACM、AAAI、ACL

Fellow

  • Google Scholar

Citation > 50,000

40

slide-41
SLIDE 41

NLP Tasks

41

slide-42
SLIDE 42

Two Drives for Big Data NLP

  • Annotated Language Resources, e.g., LDC

– Founded in 1992, about 700 datasets – Including speech, syntactic, translation and semantics

42

slide-43
SLIDE 43

Two Drives for Big Data NLP

  • Public Evaluations, e.g., CoNLL Shared Tasks

43

slide-44
SLIDE 44

Key Factors for NLP Developments

44

CPU/GPU LDC DL Language Resource Computation Power Machine Learning

Distributed

KB PGM Syntactic Linguistic Theories Semantic

slide-45
SLIDE 45

Deep Learning

  • Learn deep structure from big data

45

Geoffrey Hinton Judea Pearl Turing Award Winner

slide-46
SLIDE 46

Deep Learning

  • Deep learning has achieve great success in

speech recognition and image annotation

46

Speech Recognition Google Brain Error rate decreases >30%

slide-47
SLIDE 47

Deep Learning

  • DL has not achieved so significant improvement
  • n NLP, but can avoid feature engineering in

conventional methods

  • Brain-inspired methods for language learning

47

slide-48
SLIDE 48

Typical NLP Tasks

  • Machine Translation
  • Speech Dialog Systems and Chat-bots
  • Machine Reading
  • Sentiment Analysis and Opinion Mining

48

slide-49
SLIDE 49

Machine Translation

49

1990s 2015 1960 Rule-based Phrase-based 1990s Statistics-based Neural-based

slide-50
SLIDE 50

Machine Translation

  • Consider more discourse information to make

translation more fluent

50

(2013_DiscoMT)Feature Weight Optimization for Discourse-Level SMT

slide-51
SLIDE 51

Machine Translation

  • Computer-assistant Translation

51

(2014) Predictive translation memory: A mixed-initiative system for human language translation

slide-52
SLIDE 52

Speech Dialog Systems and Chat-bots

  • Speech Recognition (ASR)
  • Dialog Management (DM)
  • Action
  • Text-to-Speech Synthesis

(TTS)

52

slide-53
SLIDE 53

Machine Reading

1985

Cyc

1990

WordNet Wikipedia

2005-2010

知网

53

slide-54
SLIDE 54

Machine Reading

Knowledge Fusion

Fuel Pump Pump Relay Shorts Cold wether Headlight Fails Running hot Engine Stalls At low speeds

Information Extraction Knowledge Linking Information Detection

54

slide-55
SLIDE 55

Knowledge Graphs

55

slide-56
SLIDE 56

Construction of Knowledge Graphs

56

slide-57
SLIDE 57

Application of Knowledge Graphs

57

slide-58
SLIDE 58

Sentiment Analysis and Opinion Mining

  • Infer personal states via text or speech

– Including opinions, emotions, …

  • Detect opinion holders and targets

58

slide-59
SLIDE 59

RECOMMENDED READINGS

59

slide-60
SLIDE 60

NLP Books

60

信息检索导论 原作名: Introduction to Information Retrieval 作 者 : Christopher D.Manning / Hinrich Schutze / Prabhakar Raghavan 译者: 王斌;出版社: 人民邮电出版社 统计自然语言处理基础 原作名:Foundations of Statistical Natural Language Processing 作者: Chris Manning / Hinrich Schütze 译者: 苑春法 / 李伟 / 李庆中 出版社: 电子工业出版社;出版年: 2005-01- 01;页数: 432

slide-61
SLIDE 61

Machine Learning Books

61

Pattern Recognition and Machine Learning 作者: Christopher M. Bishop 出版社: Springer 出版年: 2007-10-1 页数: 738 统计学习方法 作者: 李航 出版社: 清华大学出版社 出版年: 2012-3;页数: 235;定价: 38.00元; ISBN: 9787302275954 机器学习 作者:周志华 出版社:清华大学出版社 出版年:2016年1月;ISBN: 9787302423287

slide-62
SLIDE 62

Academic Websites

  • Google Scholar: http://scholar.google.com/
  • ACM Portal: http://dl.acm.org/
  • ACL Anthology:

http://www.aclweb.org/anthology/

– ACL、EMNLP、NAACL、COLING – CCL、NLPCC

62

slide-63
SLIDE 63

INTRODUCTION TO THUNLP

63

slide-64
SLIDE 64

General Information

  • The lab is built in 1970s. The earliest and most

influential NLP institute in China

  • Lab leader: Prof. Maosong Sun
  • Research Interests

– Chinese Information Processing – Social Computing – Statistical Machine Translation – Cross-lingual Information Retrieval

  • Research Projects

– Deep Learning for Chinese Information Processing. National Basic Research Program of China (973 Program) – Diffusion Mechanisms of Chinese Memes on Social Media, National Social Science Fund Major Program

64

slide-65
SLIDE 65

Research Faculties

  • Prof. Maosong Sun

Chinese Information Processing Social Computation

Associate Prof. Yang Liu

Statistical Machine Translation Cross-lingual IR

Assistant Prof. Zhiyuan Liu

Semantics, knowledge graphs social computation

65

slide-66
SLIDE 66

Group Photo (2014 New Year)

66

slide-67
SLIDE 67

Research Cooperation

Knowledge Graph SMT Large-Scale Machine Learning Social Computation SMT SMT Chinese Information Processing Mobile User Analysis

67

slide-68
SLIDE 68

Chinese Information Processing

  • Chinese Word segmentation and POS Tagging
  • Chinese Abbreviation Detection
  • Chinese Related Word Detection

Lexicon

  • Syntactic and Dependency Parsing
  • Sentiment Analysis and Opinion Mining

Syntax

  • Text Classification
  • Keyphrase Extraction and Summarization
  • Entity Recognition and Disambiguation
  • Document Topic Modeling

Document

  • Error-Tolerant Chinese Pinyin Input Method
  • Relation Extraction and Knowledge Graph

Application

68

slide-69
SLIDE 69

Social Computation

  • Microblog CWS and POS Tagging
  • Microblog Event Detection and Classification
  • Social Tag Suggestion

Content

  • User Profiling and Interest Mining
  • User Relation Computation

User

  • New Word Detection and Diffusion Analysis
  • Microblog Forwarding Prediction

Global

  • Microblog Misinformation Analysis

Sociology

69

slide-70
SLIDE 70

Machine Translation

  • Sentence Alignment
  • Word Alignment
  • Parallel Corpora Construction

Translation Related Techniques

  • Translation Modeling
  • Search Algorithm Optimization
  • Deep Learning

Basic Theory

  • Bilingual Term Extraction
  • Cross-lingual IR

Cross-Lingual IR

  • Semantic Parsing
  • Semantic-based SMT

Advanced Topics

70

slide-71
SLIDE 71

Open Source Codes

  • Chinese NLP

– THULAC: An Efficient Lexical Analyzer for Chinese. [homepage][Git C++][Git Java][Git Python] – THUCTC: An Efficient Chinese Text Classifier. [homepage][Git Java]

  • General NLP

– THUNRE: A package of Neural Relation Extraction. [Git] – THUNSC: A package of Neural Sentiment Classification. [Git] – THUTAG: A package of Keyphrase Extraction and Social Tag Suggestion. [Git] – PLDA+: A package of Parallel LDA. [Git]

71

slide-72
SLIDE 72

Open Source Codes

  • Knowledge Representation Learning

– KB2E: A package of Knowledge Base to Embeddings. [Git] KR-EAR: Knowledge Representation Learning with Entities, Attributes and Relations. [Git] TKRL: Type-embodied Knowledge Representation Learning. [Git] DKRL: Description-embodied Knowledge Representation Learning. [Git]

  • Language Representation Learning

– MMDW: Max-Margin DeepWalk. [Git] – CWE: Character Word Embeddings. [Git] – CLWE: Cross-Lingual Word Embeddings. [Homepage] – OIWE: Online Interpretable Word Embeddings. [Git] – TADW: Text-Associated DeepWalk. [Git] – TWE: Topical Word Embeddings. [Git]

72

slide-73
SLIDE 73

Academic Exchange

73

slide-74
SLIDE 74

Thank You!

http://nlp.csai.tsinghua.edu.cn/~lzy/ liuzy@tsinghua.edu.cn

74