Chinese Event Extraction School of Data Science, Fudan University - - PowerPoint PPT Presentation

chinese event extraction
SMART_READER_LITE
LIVE PREVIEW

Chinese Event Extraction School of Data Science, Fudan University - - PowerPoint PPT Presentation

Chinese Event Extraction School of Data Science, Fudan University 2017.11.22 ACE program 1 1 Assignment 3: Chinese event extraction 2 CRF++: Yet Another CRF toolkit 3 ACE program


slide-1
SLIDE 1

复旦大学大数据学院

School of Data Science, Fudan University

Chinese Event Extraction

杨依莹

2017.11.22

slide-2
SLIDE 2

2 3 1

纲 大

ACE program CRF++: Yet Another CRF toolkit Assignment 3: Chinese event extraction 1

slide-3
SLIDE 3

复旦大学大数据学院

School of Data Science, Fudan University

ACE program

Automatic Content Extraction (ACE) program:

  • The objective of the Automatic Content Extraction (ACE)

Program was to develop extraction technology to support automatic processing of source language data (in the form of natural text and as text derived from ASR and OCR).

  • The program relates to English, Arabic and Chinese texts.
  • The ACE corpus is one of the standard benchmarks for testing

new information extraction algorithms.

slide-4
SLIDE 4

复旦大学大数据学院

School of Data Science, Fudan University

ACE program

Automatic Content Extraction (ACE) program:

Given a text in natural language, the ACE challenge is to detect:

  • 1. entities mentioned in the text, such as: persons, organizations,

locations, facilities, weapons.

  • 2. relations between entities, such as: person A is the manager of

company B. Relation types include: role, part, located, near, and social.

  • 3. events mentioned in the text, such as: interaction, movement,

transfer, creation and destruction.

slide-5
SLIDE 5

复旦大学大数据学院

School of Data Science, Fudan University

ACE program

Automatic Content Extraction (ACE) program:

An example of text

slide-6
SLIDE 6

复旦大学大数据学院

School of Data Science, Fudan University

ACE program : entity

  • Entity Detection and Tracking (EDT)
  • ACE tasks identified seven types of entities: Person, Organization,

Location, Facility, Weapon, Vehicle and Geo-Political Entity (GPEs). Each type was further divided into subtypes.

  • For every mention, the annotator identified the maximal extent of the

string that represents the entity and labeled the head of each mention. Nested mentions were also captured.

slide-7
SLIDE 7

复旦大学大数据学院

School of Data Science, Fudan University

ACE program : relation

  • Relation Detection and Characterization (RDC):
  • involved the identification of relations between entities.
  • For every relation, annotators identified two primary arguments

(namely, the two ACE entities that are linked) as well as the relation's temporal attributes.

slide-8
SLIDE 8

复旦大学大数据学院

School of Data Science, Fudan University

  • Create new structured knowledge bases, useful for any app
  • Augment current knowledge bases
  • Adding words to WordNet thesaurus, facts to FreeBase or

DBPedia

  • DBpedia: an ontology derived from Wikipedia containing over 2

billion RDF triples.

  • Freebase: a dataset from Wikipedia infoboxes.
  • On 16 December 2015, Google officially announced the Knowledge Graph

API, which is meant to be a replacement to the Freebase API.

  • Support question answering
  • The granddaughter of which actor starred in the movie “E.T.”?
  • (acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

ACE program : relation

slide-9
SLIDE 9

复旦大学大数据学院

School of Data Science, Fudan University

ACE program : relation

Automatic Content Extraction (ACE) program:

  • 7 types and 17 subtypes relations from “Relation Extraction

Task”

ARTIFACT GENERAL AFFILIATION ORG AFFILIATION PART- WHOLE PERSON- SOCIAL PHYSICAL Located Near Business Family Lasting Personal Citizen- Resident- Ethnicity- Religion Org-Location- Origin Founder Employment Membership Ownership Student-Alum Investor User-Owner-Inventor- Manufacturer Geographical Subsidiary Sports-Affiliation

slide-10
SLIDE 10

复旦大学大数据学院

School of Data Science, Fudan University

  • Physical-Located PER-GPE
  • He was in Tennessee
  • Part-Whole-Subsidiary ORG-ORG
  • XYZ, the parent company of ABC
  • Person-Social-Family PER-PER
  • John’s wife Yoko
  • Org-AFF-Founder PER-ORG
  • Steve Jobs, co-founder of Apple…

ACE program : relation

slide-11
SLIDE 11

复旦大学大数据学院

School of Data Science, Fudan University

  • Using Patterns to Extract Relations
  • lexico-syntactic pattern (词典-语义规则)

ACE program : relation

slide-12
SLIDE 12

复旦大学大数据学院

School of Data Science, Fudan University

  • Supervised Learning
  • 1. Find all pairs of named entities
  • 2. Decide if 2 entities are related
  • 3. If yes, classify the relation

ACE program : relation

slide-13
SLIDE 13

复旦大学大数据学院

School of Data Science, Fudan University

  • Supervised Learning
  • The most important step: classification
  • e.g. American Airlines, a unit of AMR, immediately matched the move,

spokesman Tim Wagner said.

ACE program : relation

slide-14
SLIDE 14

复旦大学大数据学院

School of Data Science, Fudan University

  • Semi-supervised Learning
  • 1. A few high-precision seed patterns or seed tuples.
  • 2. Finding sentences that contain entities in the seed pair.
  • 3. Extract and generalize the context to learn new patterns.

May cause semantic drift

ACE program : relation

slide-15
SLIDE 15

复旦大学大数据学院

School of Data Science, Fudan University

  • Semi-supervised Learning
  • To avoid semantic drift, we introduce confidence value.
  • Setting conservative confidence thresholds for the

acceptance of new patterns and tuples.

ACE program : relation

slide-16
SLIDE 16

复旦大学大数据学院

School of Data Science, Fudan University

ACE program : event

Automatic Content Extraction (ACE) program:

  • Event Detection and Characterization (EDC)
slide-17
SLIDE 17

2 3 1

纲 大

ACE program CRF++: Yet Another CRF toolkit Assignment 3: Chinese event extraction 2

slide-18
SLIDE 18

复旦大学大数据学院

School of Data Science, Fudan University

Description

  • In this assignment, you will need to use sequence

labeling models for Chinese event extraction.

  • Event information are defined as two parts:
  • Trigger: the main word that most clearly expresses the
  • ccurrence of an event.
  • Argument: an entity, temporal expression or value that

plays a certain role in the event.

  • For example:

“因特尔在中国成立了研究中心”

  • “成立” is the trigger of type Business
  • “英特尔”, “中国” and “研究中心” are the arguments of

type Agent, Place and Org

slide-19
SLIDE 19

复旦大学大数据学院

School of Data Science, Fudan University

Description

  • This task is separated as two subtasks:
  • Trigger labeling: identify the trigger word in the sentence,

and classify it to the following 8 types:

  • Argument labeling: identify all the arguments in the

sentence, and classify them to 35 types (some are listed below, all types could be found in the training file):

  • You are required to use both HMM and CRF models for

this task. You can use any toolkit for their implementation.

  • Note that the performance of HMM can be very poor.
slide-20
SLIDE 20

复旦大学大数据学院

School of Data Science, Fudan University

Formal Definition Input

A sequence of segmented Chinese words.

Output

Label each word with ‘T_type’ (trigger), ‘A_type’ (argument)

  • r ‘O’ (neither trigger nor argument). Save your labeling result

after the real label separated with tab.

fg1:input fg2: training instance fg3: testing result

slide-21
SLIDE 21

复旦大学大数据学院

School of Data Science, Fudan University

Provided Files

  • trigger_train.txt & trigger_test.txt :
  • These two files contain 1,918 and 669 instances for training and testing,

respectively.

  • Each line contains one word and its label separated by tabs.
  • Instances are separated by blank line.
  • argument_train.txt & argument_test.txt :
  • These two files contain 2,131 and 997 instances for training and testing,

respectively.

  • Your job is to predict the sequence label for instances in test files, and write

your predictions in result files. The labels in test files are only for evaluation.

  • eval.py
  • This file can help you evaluate your model’s recall, accuracy, precision and

F1-score.

slide-22
SLIDE 22

复旦大学大数据学院

School of Data Science, Fudan University

Submission

  • Generate a zip file and name it as “sid_homework-

3.zip”.

  • It should include a python file named “extraction.py”,

two output files named “trigger_result.txt” and “argument_result.txt”, and a written report named “chinese event extraction.pdf”.

  • Program: codes should be written in python.
  • Report: the report needs to be written in English with

no more than 4 pages.

slide-23
SLIDE 23

复旦大学大数据学院

School of Data Science, Fudan University

Evaluation

  • We will mark your homework based on the four

criteria:

  • Final accuracy (20%)
  • Program (30%)
  • Report (40%)
  • HMM implementation (10%)
slide-24
SLIDE 24

复旦大学大数据学院

School of Data Science, Fudan University

Due

  • Submit your homework via E-learning system.
  • Deadline: Mid-night at December 8th 2017
  • If you have any questions about this homework, send

email to TA or our course mailbox.

  • TA in Charge
  • 杨依莹(zoeyangyy@163.com )
slide-25
SLIDE 25

2 3 1

纲 大

ACE program CRF++: Yet Another CRF toolkit Assignment 3: Chinese event extraction 3

slide-26
SLIDE 26

复旦大学大数据学院

School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

  • CRF++ ( http://taku910.github.io/crfpp/ ) is a simple,

customizable, and open source implementation

  • f Conditional Random Fields (CRFs) for

segmenting/labeling sequential data.

  • CRF++ is designed for generic purpose and will be

applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking.

slide-27
SLIDE 27

复旦大学大数据学院

School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

  • Template basic
  • Each line in the template file denotes one template. In each template,

special macro %x[row,col] will be used to specify a token in the input data.

  • Here you can find some examples for the replacements

Input: Data He PRP B-NP reckons VBZ B-VP the DT B-NP << CURRENT current JJ I-NP account NN I-NP

template expanded feature %x[0,0] the %x[0,1] DT %x[-1,0] reckons %x[-2,1] PRP %x[0,0]/%x[0,1] the/DT ABC%x[0,1]123 ABCDT123

slide-28
SLIDE 28

复旦大学大数据学院

School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

  • Training (encoding)
  • Use crf_learn command:

% crf_learn template_file train_file model_file

  • There are 4 major parameters to control the training condition
  • a CRF-L2 or CRF-L1:

Changing the regularization algorithm. Default setting is L2. Generally speaking, L2 performs slightly better than L1.

  • c float:

With this option, you can change the hyper-parameter for the CRFs. This parameter trades the balance between overfitting and underfitting.

  • f NUM:

This parameter sets the cut-off threshold for the features. CRF++ uses the features that occurs no less than NUM times in the given training data. The default value is 1.

  • p NUM:

If the PC has multiple CPUs, you can make the training faster by using multi-threading. NUM is the number of threads.

slide-29
SLIDE 29

复旦大学大数据学院

School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

  • Testing (decoding)
  • Use crf_test command:

% crf_test -m model_file test_files

  • where model_file is the file crf_learn creates. test_file is the test data you

want to assign sequential tags. This file has to be written in the same format as training file.

  • v option sets verbose level. default value is 0. You can also have marginal

probabilities for each tag and a conditional probably for the output.

% crf_test -v1 -m model test.data| head

Rockwell NNP B B/0.992465 International NNP I I/0.979089 Corp. NNP I I/0.954883 's POS B B/0.986396 Tulsa NNP I I/0.991966

slide-30
SLIDE 30

Thanks for your attention!

感谢各位聆听!