Natural Language Processing
Zhiyuan Liu THUNLP liuzy@tsinghua.edu.cn
1
Natural Language Processing Zhiyuan Liu THUNLP - - PowerPoint PPT Presentation
Natural Language Processing Zhiyuan Liu THUNLP liuzy@tsinghua.edu.cn 1 What is Natural Language Processing? Input Structure Prediction Output: Semantic Structure Syntactic Structure The
Zhiyuan Liu THUNLP liuzy@tsinghua.edu.cn
1
2
Input: Output: Structure Prediction 救援队正组织力量接应灾民下山 Syntactic Structure Semantic Structure
trees of a sentence: exponential growth with sentence length
(Church and Patil, 1982)
3 句长 二分结构树数量 1 1 2 2 3 2 4 5 5 14 6 42 7 132 8 429 9 1, 430 10 4, 862 11 16, 796 12 58, 786 13 208, 012 14 742, 900 15 2, 674, 440 16 9, 794, 845 17 35, 357, 670 18 129, 644, 790 19 477, 638, 700 20 1, 767, 263, 190
Similar to the problem of board game play Chess, Go
4
5
Examples 亲,看帖要回帖哦! 走召弓虽(超强) 1314(一生一世) 菌男霉女 屌丝
6
7
8
We hypothesize that FLN only includes recursion and is the only uniquely human component of the faculty of language.
9
10
11
12
13
14
15
16
17
18
Q: Who was presidentially pardoned on September 8, 1974? A: Nixon.
19
20
Apple Siri Skype Translator Sogou Input Google Knowledge Graphs
21
Project Names Release Start Grant Machine Reading 2007 2008 $ 67.4 million Deep Exploration and Filtering of Text 2012 2013 $ 25.0 million
22
23
Performance
International Evaluation on Chinese NLP Tasks
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
中文分词 中文依存句法分析 中文语义角色标注 中文语义依存分析 中文指代消解 中文IR4QA 中英机器翻译
World Best China Best
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Famous Persons Birth Location à Death Location Winckelmann Corpus Freebase
38
No country for
members: User lifecycle and linguistic change in online communities with Dan Jurafsky, Jure Leskovec, Christopher Potts. WWW 2013. Best Paper Award. Cristian Danescu-Niculescu-Mizil
39
40
41
42
43
44
CPU/GPU LDC DL Language Resource Computation Power Machine Learning
Distributed
KB PGM Syntactic Linguistic Theories Semantic
45
Geoffrey Hinton Judea Pearl Turing Award Winner
46
Speech Recognition Google Brain Error rate decreases >30%
47
48
49
1990s 2015 1960 Rule-based Phrase-based 1990s Statistics-based Neural-based
50
(2013_DiscoMT)Feature Weight Optimization for Discourse-Level SMT
51
(2014) Predictive translation memory: A mixed-initiative system for human language translation
52
1985
Cyc
1990
WordNet Wikipedia
2005-2010
知网
53
Knowledge Fusion
Fuel Pump Pump Relay Shorts Cold wether Headlight Fails Running hot Engine Stalls At low speeds
Information Extraction Knowledge Linking Information Detection
54
55
56
57
58
59
60
信息检索导论 原作名: Introduction to Information Retrieval 作 者 : Christopher D.Manning / Hinrich Schutze / Prabhakar Raghavan 译者: 王斌;出版社: 人民邮电出版社 统计自然语言处理基础 原作名:Foundations of Statistical Natural Language Processing 作者: Chris Manning / Hinrich Schütze 译者: 苑春法 / 李伟 / 李庆中 出版社: 电子工业出版社;出版年: 2005-01- 01;页数: 432
61
Pattern Recognition and Machine Learning 作者: Christopher M. Bishop 出版社: Springer 出版年: 2007-10-1 页数: 738 统计学习方法 作者: 李航 出版社: 清华大学出版社 出版年: 2012-3;页数: 235;定价: 38.00元; ISBN: 9787302275954 机器学习 作者:周志华 出版社:清华大学出版社 出版年:2016年1月;ISBN: 9787302423287
62
63
– Chinese Information Processing – Social Computing – Statistical Machine Translation – Cross-lingual Information Retrieval
– Deep Learning for Chinese Information Processing. National Basic Research Program of China (973 Program) – Diffusion Mechanisms of Chinese Memes on Social Media, National Social Science Fund Major Program
64
Chinese Information Processing Social Computation
Associate Prof. Yang Liu
Statistical Machine Translation Cross-lingual IR
Assistant Prof. Zhiyuan Liu
Semantics, knowledge graphs social computation
65
66
Knowledge Graph SMT Large-Scale Machine Learning Social Computation SMT SMT Chinese Information Processing Mobile User Analysis
67
68
69
Translation Related Techniques
70
– THULAC: An Efficient Lexical Analyzer for Chinese. [homepage][Git C++][Git Java][Git Python] – THUCTC: An Efficient Chinese Text Classifier. [homepage][Git Java]
– THUNRE: A package of Neural Relation Extraction. [Git] – THUNSC: A package of Neural Sentiment Classification. [Git] – THUTAG: A package of Keyphrase Extraction and Social Tag Suggestion. [Git] – PLDA+: A package of Parallel LDA. [Git]
71
– KB2E: A package of Knowledge Base to Embeddings. [Git] KR-EAR: Knowledge Representation Learning with Entities, Attributes and Relations. [Git] TKRL: Type-embodied Knowledge Representation Learning. [Git] DKRL: Description-embodied Knowledge Representation Learning. [Git]
– MMDW: Max-Margin DeepWalk. [Git] – CWE: Character Word Embeddings. [Git] – CLWE: Cross-Lingual Word Embeddings. [Homepage] – OIWE: Online Interpretable Word Embeddings. [Git] – TADW: Text-Associated DeepWalk. [Git] – TWE: Topical Word Embeddings. [Git]
72
73
74