Reducing Noise in Labels and Features for a Real World Dataset: - - PowerPoint PPT Presentation

reducing noise in labels and features for a real world
SMART_READER_LITE
LIVE PREVIEW

Reducing Noise in Labels and Features for a Real World Dataset: - - PowerPoint PPT Presentation

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University


slide-1
SLIDE 1

Reducing Noise in Labels and Features for a Real World Dataset: Application of NLP Corpus Annotation Methods

Rebecca J. Passonneau, Cynthia Rudin, Axinia Radeva, and Zhi An Liu Center for Computational Learning Systems (CCLS) Columbia University

slide-2
SLIDE 2

March 6, 2009 CICLING Reducing Noise in Labels & Features 2

Motivation: Secondary Electrical Grid

Structures at 2nd Ave & 83rd Street, Manhattan

  • Manholes
  • Service boxes

Serious event: manhole fire in the Village, April 2008 A dense network of structures and cables provide power to NYC buildings

slide-3
SLIDE 3

Emergency Control System (ECS) Ticket

1 MR. ROBERT TOBIA (718)555‐5124 ‐ SMOKING. COVER OFF.‐RMKS: 2 01/06/03 08:40 MDETHUILOT DISPATCHED BY 55988 3 01/06/03 09:30 MDETHUILOT ARRIVED BY 55988 4 01/06/02 09:55 THUILOT REPORTS NO SMOKE ON ARRIVAL. THERE IS 5 A SHUNT ON LOCATION ‐ SHUNT & SERVICE NOT EFFECTED. . . . 8 REQUESTING FLUSH/ORDERED (#2836). 9 ******* NO PARKING : TUES. & FRIDAY, 11:30AM ‐ 1PM ****** RV 10 01/06/03 10:45 THUILOT REPORTS BUILDING 260 W.139 ST. 11 COMPLAINED OF LIGHT PROBLEMS. FOUND 1‐PHASE DOWN ‐ BRIDGED 12 @ 10:30 ( 2‐PHASE SERVICE ) CONSUMER IS CONTENT. . . . 18 01/06/03 18:45 FERNANDEZ REPORTS THAT IN SB‐521117 F/O254 19 W139 ST. HE CUT OUT A 3W2W COPPERED JT & REPLACED IT W/ 20 A 4W NEO CRAB....BY USING 1 LEG OFF THE 7W FROM THE HE 21 WAS ABLE TO PUSH THE MISSING PHASE BACK TO 260, BRIDGE 22 REMOVED....@ THIS TIME FERNANDEZ REPORTS THERE ARE MORE 23 B/O'S & 2 MORE JTS TO C/O, WILL F/U W/ MORE INFO....TCP

March 6, 2009 CICLING 3 Reducing Noise in Labels & Features

slide-4
SLIDE 4

Outline

  • NLP/IE versus real world problem and data
  • Ranking problem: stucture vulnerability
  • ECS ticket classification problem

– Relation to labels on structures – Relation to feature representation of structures

  • Annotation task: can humans classify tickets?
  • Results

– Overall noise reduction – Improvements to top of list

  • Importance of knowledge transfer paths for ML

March 6, 2009 CICLING 4 Reducing Noise in Labels & Features

slide-5
SLIDE 5

Typical Impasse

  • A real world “database” has free text fields

that could provide new relations in an rdb

  • Institutional owner gives db to NLP group for

data mining – abysmal gap in domain knowledge

March 6, 2009 CICLING 5 Reducing Noise in Labels & Features

03/06/09 SMH S/W/C BROAD & MAIN FITZSIMMONS REPORTS THE TBL HOLE IS SB-00001 FOUND ON ....SMOKING LIGHTY

slide-6
SLIDE 6

NIST 2007 ACE (Automatic Content Extraction)

Results in max/avg value score (roughly, accuracy)

  • Entity mentions (5 sites participating, 7 major entity types,

e.g., geopolitical, facility,org):

– Broadcast news: 65.9/52.7 – Newswire: 58.1/44.0 – Telephone: 49.2/35.5 – Usenet: 44.0/31.4

  • Events (1 site, 8 major event types, e.g., business, meeting,

conflict)

– Broadcast news: 12.9 – Newswire: 15.9 – Telephone: 6.6 – Usenet: 11.3

March 6, 2009 CICLING 6 Reducing Noise in Labels & Features

slide-7
SLIDE 7

CCLS/Consolidated Edison Collaboration

  • Idea(lization):

– Help reduce serious events in the secondary electrical grid – Use 10 years of Emergency Control System (ECS) trouble ticket data (plus other data sources)

  • A succession of automated/free‐text entries in one ticket
  • A procedure for assigning a “trouble type” to each ticket

– Rank vulnerability of structures to “serious events”

  • Reality:

– Data dump of very noisy data – No operational definition of “serious”

March 6, 2009 CICLING 7 Reducing Noise in Labels & Features

slide-8
SLIDE 8

Related Work

  • Devaney & Ram, 2005: case‐based reasoning

– 10,000 maintenance logs, machine X – Unsupervised text clustering, OWL/RDF domain model

  • Liddy et al., 2006: sublanguage analysis

– ECS trouble tickets 1995‐2005: 70K train, 7k test, 100 eval – Reclassification of MSE (misc) trouble type tickets into two trouble types, SMH and WL

  • Oza et al, In Press: similar gap in domain knowledge for a

complex domain

– 800,000 reports from aeronautics db – SVM and Non‐negative matrix factorization on BOW document representation for topical classification (similar to LSA)

March 6, 2009 CICLING 8 Reducing Noise in Labels & Features

slide-9
SLIDE 9

Scope of Structure Ranking Problem

  • Number of structures in Manhattan: 51,912
  • ECS tickets for Manhattan

– Relevant Trouble Types (N=21): 61,730 – Number of structures in tickets: 27,235 (44%)

  • Number of “serious” events per year depends on

the definition

– Fires and explosions (MHX, MHF, MHO): ~150 (0.6% of structures in ECS) – Other events: e.g., smoking manholes: ~470 (1.8% of structures in ECS)

March 6, 2009 CICLING 9 Reducing Noise in Labels & Features

slide-10
SLIDE 10

Learning Approach to Structure Ranking

  • Formulated as a supervised bipartite ranking

problem – A real‐valued score is assigned to each structure – Goal is to rank positively‐labeled examples above negatively labeled examples

  • Learning algorithm

– Maximizes a weighted version of the AUC – Here we used SVM‐perf (Joachims, T., 2005) – We have also used P‐Norm Push (Rudin, C., 2008; a generalization of RankBoost)

March 6, 2009 CICLING 10 Reducing Noise in Labels & Features

slide-11
SLIDE 11

Event Classification: Labels and Features

Depends on defining “serious event”

  • Label structures: Did si have a serious event in Yj?
  • Identify small number of explanatory features

– Four ECS‐based features affect the top of the list

  • Did si have a serious event recently (> (Yj‐3) & <Yj )?
  • How many recent tickets mention sj?
  • Did si have a serious event in the past (> 1996 & <Yj )?
  • How many past tickets mention sj?

– One cable density feature affects the rest of the list

  • Train on 2005, test on 2006, evaluate on 2007

March 6, 2009 CICLING 11 Reducing Noise in Labels & Features

slide-12
SLIDE 12

Baseline Event Classification

  • Length constraint: At least 3 free‐text lines
  • Not all tickets correspond to distinct events (referred

tickets; no work performed; non‐secondary)

  • ECS Ticket Trouble Types (N=21)

– MHX/MHF/MHO: good indicator event is serious – SMH: moderate indicator event is serious – ACB: good indicator event is not serious – 16 other trouble types: generally not serious

March 6, 2009 CICLING 12 Reducing Noise in Labels & Features

slide-13
SLIDE 13

ECS Tickets

  • Enormous length variation: 1‐522 lines
  • Varying proportion of free text lines: 0‐69%
  • Fragmentary and telegraphic language
  • Specialized terminology (sublanguage)

– CRAB, C&R, TROUBLE HOLE, FLUSH

  • Intra‐word line breaks: AFFECTE/ D
  • Misspellings inflate vocabulary size

– Before normalization: ~57K unigram types – After normalization: ~22K unigram types

March 6, 2009 CICLING 13 Reducing Noise in Labels & Features

slide-14
SLIDE 14

Human Annotation Task

To acquire an extensional definition of “serious"

  • Data: 171 ECS tickets; text only, no access to trouble type etc
  • Annotators: 2 domain experts
  • Task: sort tickets into one of three classes
  • 1. Serious event
  • 2. Potential precursor event
  • 3. Exclude as irrelevant (e.g., not secondary; not an event)

March 6, 2009 CICLING 14 Reducing Noise in Labels & Features

slide-15
SLIDE 15

Experts versus Baseline

  • Kappa agreement coefficient results

– Ranges from 1 (perfect agreement) to 0 (random) to ‐ 1 (perfect disagreement) – Experts with baseline (3‐way kappa): 0.25 – Experts with each other: 0.49

  • Trouble type does not correspond to expert

judgment

  • Experts have moderate agreement –

subjective

  • Difficult prediction problem

March 6, 2009 CICLING Reducing Noise in Labels & Features 15

slide-16
SLIDE 16

Expert vs. Baseline, Annotated Tickets

Ticket Category Non‐Event Precursor Type Serious Expert Dis‐ agree Base‐ line Ex‐ perts Base‐ line Ex‐ perts Base‐ line Ex‐ perts ACB 21 16 3 2 MHX/F/O 2 1 9 7 SMH 3 7 27 15 2 Other 8 17 106 58 4 35 Totals 8 22 128 81 36 29 39

March 6, 2009 CICLING 16 Reducing Noise in Labels & Features

slide-17
SLIDE 17

Expert vs. Baseline, All Tickets

March 6, 2009 CICLING 17 Reducing Noise in Labels & Features

Ticket Category Precursor Type Serious Baseline Rules Baseline Rules ACB 6,171 5,364 192 162 MHX/F/O 25 1,785 1,481 SMH 1,105 4,906 3,397 Other 25,776 16,978 81 75 Totals 31,947 23,472 6,964 5,115

  • Baseline Precursor + Serious = 38,911
  • Rules Precursor + Serious = 28,587
slide-18
SLIDE 18

Results: AUC scores

  • Best improvement on Test

Set (2006)

  • Obscures what changed:

– Many large demotions of structures that are not so vulnerable (e.g., 759/52K to 3105/52K) – Side‐effect: small promotions of vulnerable structures (e.g., 69/52K to 45/52K)

March 6, 2009 CICLING 18 Reducing Noise in Labels & Features

TRAIN TEST Baseline 67.63 65.01 Rule‐ based 68.29 67.55

slide-19
SLIDE 19

Changes to Top of Ranked List

  • Jaccard coefficient finds the “similarity” of two sets

(range is 0 to 1, with J=1 when A=B)

  • For N=5 to 1000, compare the top N structures of

the ranked list from the baseline classification of events versus the rule‐based classification

  • EG: every fourth structure in top 500 of Rules

ranking is not in top 500 of Baseline ranking

March 6, 2009 CICLING 19 Reducing Noise in Labels & Features

A B A B ∩ ∪

N 5 10 15 20 100 500 1000 Jaccard 0.25 0.33 0.50 0.60 0.72 0.75 0.81

slide-20
SLIDE 20

Transferring Knowledge/Expertise to ML

Expert‐labeled data is one way: new ways needed

  • Annotation task facilitated knowledge transfer from

experts to machine learner via manual step

– From small set of carefully selected tickets, manually identified rules generalized well – Too small and noisy a dataset to handle automatically given the precision requirements

  • Use many untrained labelers, as in Snow et al. 2008,

“Cheap and fast, but is it good?” (EMNLP)

  • Have experts label features instead of instances, as

in Druck et al., 2008, “Learning from labeled features using generalized expectation criteria” (SIGIR)

March 6, 2009 CICLING 20 Reducing Noise in Labels & Features