Reducing Noise in Labels and Features for a Real World Dataset: - - PowerPoint PPT Presentation
Reducing Noise in Labels and Features for a Real World Dataset: - - PowerPoint PPT Presentation
Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University
March 6, 2009 CICLING Reducing Noise in Labels & Features 2
Motivation: Secondary Electrical Grid
Structures at 2nd Ave & 83rd Street, Manhattan
- Manholes
- Service boxes
Serious event: manhole fire in the Village, April 2008 A dense network of structures and cables provide power to NYC buildings
Emergency Control System (ECS) Ticket
1 MR. ROBERT TOBIA (718)555‐5124 ‐ SMOKING. COVER OFF.‐RMKS: 2 01/06/03 08:40 MDETHUILOT DISPATCHED BY 55988 3 01/06/03 09:30 MDETHUILOT ARRIVED BY 55988 4 01/06/02 09:55 THUILOT REPORTS NO SMOKE ON ARRIVAL. THERE IS 5 A SHUNT ON LOCATION ‐ SHUNT & SERVICE NOT EFFECTED. . . . 8 REQUESTING FLUSH/ORDERED (#2836). 9 ******* NO PARKING : TUES. & FRIDAY, 11:30AM ‐ 1PM ****** RV 10 01/06/03 10:45 THUILOT REPORTS BUILDING 260 W.139 ST. 11 COMPLAINED OF LIGHT PROBLEMS. FOUND 1‐PHASE DOWN ‐ BRIDGED 12 @ 10:30 ( 2‐PHASE SERVICE ) CONSUMER IS CONTENT. . . . 18 01/06/03 18:45 FERNANDEZ REPORTS THAT IN SB‐521117 F/O254 19 W139 ST. HE CUT OUT A 3W2W COPPERED JT & REPLACED IT W/ 20 A 4W NEO CRAB....BY USING 1 LEG OFF THE 7W FROM THE HE 21 WAS ABLE TO PUSH THE MISSING PHASE BACK TO 260, BRIDGE 22 REMOVED....@ THIS TIME FERNANDEZ REPORTS THERE ARE MORE 23 B/O'S & 2 MORE JTS TO C/O, WILL F/U W/ MORE INFO....TCP
March 6, 2009 CICLING 3 Reducing Noise in Labels & Features
Outline
- NLP/IE versus real world problem and data
- Ranking problem: stucture vulnerability
- ECS ticket classification problem
– Relation to labels on structures – Relation to feature representation of structures
- Annotation task: can humans classify tickets?
- Results
– Overall noise reduction – Improvements to top of list
- Importance of knowledge transfer paths for ML
March 6, 2009 CICLING 4 Reducing Noise in Labels & Features
Typical Impasse
- A real world “database” has free text fields
that could provide new relations in an rdb
- Institutional owner gives db to NLP group for
data mining – abysmal gap in domain knowledge
March 6, 2009 CICLING 5 Reducing Noise in Labels & Features
03/06/09 SMH S/W/C BROAD & MAIN FITZSIMMONS REPORTS THE TBL HOLE IS SB-00001 FOUND ON ....SMOKING LIGHTY
NIST 2007 ACE (Automatic Content Extraction)
Results in max/avg value score (roughly, accuracy)
- Entity mentions (5 sites participating, 7 major entity types,
e.g., geopolitical, facility,org):
– Broadcast news: 65.9/52.7 – Newswire: 58.1/44.0 – Telephone: 49.2/35.5 – Usenet: 44.0/31.4
- Events (1 site, 8 major event types, e.g., business, meeting,
conflict)
– Broadcast news: 12.9 – Newswire: 15.9 – Telephone: 6.6 – Usenet: 11.3
March 6, 2009 CICLING 6 Reducing Noise in Labels & Features
CCLS/Consolidated Edison Collaboration
- Idea(lization):
– Help reduce serious events in the secondary electrical grid – Use 10 years of Emergency Control System (ECS) trouble ticket data (plus other data sources)
- A succession of automated/free‐text entries in one ticket
- A procedure for assigning a “trouble type” to each ticket
– Rank vulnerability of structures to “serious events”
- Reality:
– Data dump of very noisy data – No operational definition of “serious”
March 6, 2009 CICLING 7 Reducing Noise in Labels & Features
Related Work
- Devaney & Ram, 2005: case‐based reasoning
– 10,000 maintenance logs, machine X – Unsupervised text clustering, OWL/RDF domain model
- Liddy et al., 2006: sublanguage analysis
– ECS trouble tickets 1995‐2005: 70K train, 7k test, 100 eval – Reclassification of MSE (misc) trouble type tickets into two trouble types, SMH and WL
- Oza et al, In Press: similar gap in domain knowledge for a
complex domain
– 800,000 reports from aeronautics db – SVM and Non‐negative matrix factorization on BOW document representation for topical classification (similar to LSA)
March 6, 2009 CICLING 8 Reducing Noise in Labels & Features
Scope of Structure Ranking Problem
- Number of structures in Manhattan: 51,912
- ECS tickets for Manhattan
– Relevant Trouble Types (N=21): 61,730 – Number of structures in tickets: 27,235 (44%)
- Number of “serious” events per year depends on
the definition
– Fires and explosions (MHX, MHF, MHO): ~150 (0.6% of structures in ECS) – Other events: e.g., smoking manholes: ~470 (1.8% of structures in ECS)
March 6, 2009 CICLING 9 Reducing Noise in Labels & Features
Learning Approach to Structure Ranking
- Formulated as a supervised bipartite ranking
problem – A real‐valued score is assigned to each structure – Goal is to rank positively‐labeled examples above negatively labeled examples
- Learning algorithm
– Maximizes a weighted version of the AUC – Here we used SVM‐perf (Joachims, T., 2005) – We have also used P‐Norm Push (Rudin, C., 2008; a generalization of RankBoost)
March 6, 2009 CICLING 10 Reducing Noise in Labels & Features
Event Classification: Labels and Features
Depends on defining “serious event”
- Label structures: Did si have a serious event in Yj?
- Identify small number of explanatory features
– Four ECS‐based features affect the top of the list
- Did si have a serious event recently (> (Yj‐3) & <Yj )?
- How many recent tickets mention sj?
- Did si have a serious event in the past (> 1996 & <Yj )?
- How many past tickets mention sj?
– One cable density feature affects the rest of the list
- Train on 2005, test on 2006, evaluate on 2007
March 6, 2009 CICLING 11 Reducing Noise in Labels & Features
Baseline Event Classification
- Length constraint: At least 3 free‐text lines
- Not all tickets correspond to distinct events (referred
tickets; no work performed; non‐secondary)
- ECS Ticket Trouble Types (N=21)
– MHX/MHF/MHO: good indicator event is serious – SMH: moderate indicator event is serious – ACB: good indicator event is not serious – 16 other trouble types: generally not serious
March 6, 2009 CICLING 12 Reducing Noise in Labels & Features
ECS Tickets
- Enormous length variation: 1‐522 lines
- Varying proportion of free text lines: 0‐69%
- Fragmentary and telegraphic language
- Specialized terminology (sublanguage)
– CRAB, C&R, TROUBLE HOLE, FLUSH
- Intra‐word line breaks: AFFECTE/ D
- Misspellings inflate vocabulary size
– Before normalization: ~57K unigram types – After normalization: ~22K unigram types
March 6, 2009 CICLING 13 Reducing Noise in Labels & Features
Human Annotation Task
To acquire an extensional definition of “serious"
- Data: 171 ECS tickets; text only, no access to trouble type etc
- Annotators: 2 domain experts
- Task: sort tickets into one of three classes
- 1. Serious event
- 2. Potential precursor event
- 3. Exclude as irrelevant (e.g., not secondary; not an event)
March 6, 2009 CICLING 14 Reducing Noise in Labels & Features
Experts versus Baseline
- Kappa agreement coefficient results
– Ranges from 1 (perfect agreement) to 0 (random) to ‐ 1 (perfect disagreement) – Experts with baseline (3‐way kappa): 0.25 – Experts with each other: 0.49
- Trouble type does not correspond to expert
judgment
- Experts have moderate agreement –
subjective
- Difficult prediction problem
March 6, 2009 CICLING Reducing Noise in Labels & Features 15
Expert vs. Baseline, Annotated Tickets
Ticket Category Non‐Event Precursor Type Serious Expert Dis‐ agree Base‐ line Ex‐ perts Base‐ line Ex‐ perts Base‐ line Ex‐ perts ACB 21 16 3 2 MHX/F/O 2 1 9 7 SMH 3 7 27 15 2 Other 8 17 106 58 4 35 Totals 8 22 128 81 36 29 39
March 6, 2009 CICLING 16 Reducing Noise in Labels & Features
Expert vs. Baseline, All Tickets
March 6, 2009 CICLING 17 Reducing Noise in Labels & Features
Ticket Category Precursor Type Serious Baseline Rules Baseline Rules ACB 6,171 5,364 192 162 MHX/F/O 25 1,785 1,481 SMH 1,105 4,906 3,397 Other 25,776 16,978 81 75 Totals 31,947 23,472 6,964 5,115
- Baseline Precursor + Serious = 38,911
- Rules Precursor + Serious = 28,587
Results: AUC scores
- Best improvement on Test
Set (2006)
- Obscures what changed:
– Many large demotions of structures that are not so vulnerable (e.g., 759/52K to 3105/52K) – Side‐effect: small promotions of vulnerable structures (e.g., 69/52K to 45/52K)
March 6, 2009 CICLING 18 Reducing Noise in Labels & Features
TRAIN TEST Baseline 67.63 65.01 Rule‐ based 68.29 67.55
Changes to Top of Ranked List
- Jaccard coefficient finds the “similarity” of two sets
(range is 0 to 1, with J=1 when A=B)
- For N=5 to 1000, compare the top N structures of
the ranked list from the baseline classification of events versus the rule‐based classification
- EG: every fourth structure in top 500 of Rules
ranking is not in top 500 of Baseline ranking
March 6, 2009 CICLING 19 Reducing Noise in Labels & Features
A B A B ∩ ∪
N 5 10 15 20 100 500 1000 Jaccard 0.25 0.33 0.50 0.60 0.72 0.75 0.81
Transferring Knowledge/Expertise to ML
Expert‐labeled data is one way: new ways needed
- Annotation task facilitated knowledge transfer from
experts to machine learner via manual step
– From small set of carefully selected tickets, manually identified rules generalized well – Too small and noisy a dataset to handle automatically given the precision requirements
- Use many untrained labelers, as in Snow et al. 2008,
“Cheap and fast, but is it good?” (EMNLP)
- Have experts label features instead of instances, as
in Druck et al., 2008, “Learning from labeled features using generalized expectation criteria” (SIGIR)
March 6, 2009 CICLING 20 Reducing Noise in Labels & Features