AutomationinInformation ExtractionandIntegration SunitaSarawagi - - PowerPoint PPT Presentation

automation in information extraction and integration
SMART_READER_LITE
LIVE PREVIEW

AutomationinInformation ExtractionandIntegration SunitaSarawagi - - PowerPoint PPT Presentation

AutomationinInformation ExtractionandIntegration SunitaSarawagi IITBombay sunita@it.iitb.ac.in


slide-1
SLIDE 1

AutomationinInformation ExtractionandIntegration

SunitaSarawagi IITBombay sunita@it.iitb.ac.in

slide-2
SLIDE 2 ✁ ✂✄ ☎✆ ✆ ☎ ✝ ✞ ✟ ✠ ✟ ✡ ✟☛ ☞ ✌ ☎

Dataintegration

Theprocessofintegratingdatafrommultiple, heterogeneous,looselystructuredinformation sourcesintoasinglewell-definedstructured database

Atediousexerciseinvolving

schemamapping,

structure/informationextraction,

duplicateelimination,

missingvaluesubstitution,

errordetection

standardization

slide-3
SLIDE 3 ✁ ✂✄ ☎✆ ✆ ☎ ✝ ✞ ✟ ✠ ✟ ✡ ✟☛ ☞ ✌ ✏

Applicationscenarios

Largeenterprises:

Phenomenalamountoftimeandresourcesspenton datacleaning

Example:Segmentingandmergingname-address listsduringdatawarehousing

Web:

Creatingstructureddatabasesfromdistributed unstructuredweb-pages

Citationdatabases:CiteseerandCora

Otherscientificapplications

Bio-informatics

Extractinggenerelationsfrommedicaltext(KDDcup2002)

slide-4
SLIDE 4 ✁ ✂✄ ☎✆ ✆ ☎ ✝ ✞ ✟ ✠ ✟ ✡ ✟☛ ☞ ✌ ✒

Casestudy:CiteSeer

Paperlocation:

Extractinformationfromspecificpublisherwebsites

Extractps/pdf filesbysearchingthewebwithterms like“publications”

Informationextractedfrompapers:

Title,authorfromheader

Extractcitationentries

Bibliographysection

Separateintoindividualrecords

Segmentintotitle,author,date,pagenumbersetc

Duplicateeliminationacrossseveralcitationsto apaper(de-duplication)

slide-5
SLIDE 5 ✁ ✂✄ ☎✆ ✆ ☎ ✝ ✞ ✟ ✠ ✟ ✡ ✟☛ ☞ ✌ ✔

Recenttrends

Classicalproblemthathasbotheredresearchersand practitionersfordecades

Severalexistingcommercialsolutionsforenterprisedata integration[mid-80s]

Manual,domain-specific,data-drivenscript-basedtools

Example:Name/addresscleaning

Requirehigh-expertisetocodeandmaintain

Desiretoview“Webasadatabase”gotmachinelearning researchersworkingoncleaning

✗✙✘ ✚✜✛ ✢ ✣ ✤ ✥ ✤✦ ✧ ✣ ★✩ ✛ ✪ ✢ ✩ ✫ ✚✜✛ ✬ ✭ ✮ ✛ ✯ ✛ ✣ ✢ ✚ ✣ ✛ ✬ ✛ ✢ ✣ ✰ ✱ ✫ ✣ ★ ✲ ★ ✲ ✘ ✫ ✛ ✬✴✳ ✫ ✢ ✣ ✲ ✥ ✰ ✵ ✚ ✢ ✣ ✚ ✘ ✥ ✤ ✲ ✱ ✛ ✰ ★ ✤ ✲ ✛ ✪ ✲ ★ ✧ ✶ ✛ ✗ ✷ ✢ ✲ ✢ ✥ ✤ ✲ ✛ ✦ ✣ ✢ ✲ ✥ ★ ✤
slide-6
SLIDE 6 ✸✹✺ ✻ ✼✽ ✽ ✼ ✾ ✿❁❀ ❂ ❀ ❃ ❀ ❄ ❅ ❆ ❇

Scopeofthetutorial

Novelapplicationofdataminingandmachine learningtechniquestoautomatedatacleaning

  • perations.

Distillrecentresearchresultsfromvariousareas:

Machinelearning,datamining,informationretrieval, naturallanguageprocessing,webwrapperextraction

Focusontwooperations

InformationExtraction

Duplicateelimination

slide-7
SLIDE 7 ✸✹✺ ✻ ✼✽ ✽ ✼ ✾ ✿❁❀ ❂ ❀ ❃ ❀ ❄ ❅ ❆ ❊

Outline

InformationExtraction

Rule-basedmethods

Probabilisticmethods

Duplicateelimination

Reducingtheneedfortrainingdata:

Activelearning

Bootstrappingfromstructureddatabases

Semi-supervisedlearning

Summaryandresearchproblems

slide-8
SLIDE 8 ✸✹✺ ✻ ✼✽ ✽ ✼ ✾ ✿❁❀ ❂ ❀ ❃ ❀ ❄ ❅ ❆ ❋

InformationExtraction(IE)

TheIEtask:Given,

E:asetofstructuredelements(Targetschema)

S:unstructuredsourceS

extractallinstancesofEfromS

Varyinglevelsofdifficultydependingoninput andkindofextractedpatterns

Textsegmentation:Extractionbysegmentingtext

HTMLwrapper:Extractionfromformattedtext

ClassicalIE:Extractionfromfree-formattext

slide-9
SLIDE 9 ✸✹✺ ✻ ✼✽ ✽ ✼ ✾ ✿❁❀ ❂ ❀ ❃ ❀ ❄ ❅ ❆
  • IEbytextsegmentation

Source:concatenationofstructuredelementswith limitedreorderingandsomemissingfields

Example:Addresses,bibrecords

House number Building Road City Zip

❍ ■❏❑ ▲ ▼ ◆P❖ ◗❘ ❙ ◆P❚ ❯ ❱ ◆ ❚ ❘ ❖ ❲P❳ ❨ ❘ ❩ ❬ ❙ ◆❪❭ ❘ ❫❪❴ ❚ ❬ ◆ ❘ ❯ ❳ ❵ ❛ ❑ ❜ ❝ ❜ ❜ ❞❢❡ ❞❢❡ ❣✐❤❥ ❦ ❧ ♠ ❤ ♥ ♦ ♣ ❡ ❞❢❡ q ♥ ❤ r s ❤ ♥ ♦ t ❡ ✉ ❡ ✈✐✇ ①③② ④ ④ ♦ t ❡ ⑤ ❡ ⑥ ④ ❤ ♥ ♠ ♦ ⑦ ❡ ⑤ ❡ t✐⑧ ♥ ⑨ ❧ s ♠⑩ ❶ ❷ ❷❸ ❹ ❞ ♥ ⑧ ①③② ❧ ❥ ❤❥ ⑨ ⑤ ⑧ ④❻❺ ② ❥ ① ✈ ❥ ❦ ❧ ❥ ② ② ♥ ❧ ❥ ❦ ⑧ ❼ ⑤❾❽ ❿ ① ❧ ④ ❧ ✇ ❧ ❥ ❦ ➀ ❞➁ ➂ ❧ ❥ ➁ ② ❤ ♥ ④ r ✉ ❥ ➃ r ⑨ ♥ ⑧ ❽ ✇ ➄ ♥ ❦ ❤❥ ❧ s ➅ ② ⑨ ❧ ❤ ⑦ ❡ ✉✐➆ ② ♥ ❡ ⑥ ➃ ② ➆ ❡ ⑤ ⑧ s➇❡ ❶ ❶➈ ♦ ❶ ➉ ➉❸ ❶➋➊ ❶ ➉ ➉ ❸ ➌ ❡

Author Year Title Journal VolumePage State

slide-10
SLIDE 10 ➍ ➎ ➏ ➏ ➏ ➏➑➐ ➎ ➎ ➒➔➓ →➣ ↔↕ ➙ ➛➝➜ ➙ ➞➟ ➜ ➠❢➡ ➠ ➡ ➢✐➤➥ ➦ ➧ ➨ ➤ ➩ ➙ ➫ ➛➝➜➯➭ ➲ ➡ ➠❢➡ ➳ ➩ ➤ ➵ ➸ ➤ ➩ ➭ ➺ ➡ ➻ ➡ ➼✐➽ ➾③➚ ➪ ➪ ➭ ➺ ➡ ➶ ➡ ➹ ➪ ➤ ➩ ➨ ➭ ➘ ➡ ➶ ➡ ➺✐➴ ➩ ➷ ➧ ➸ ➨➬ ➮ ➱ ➱ ✃ ❐ ➙ ❒ ➜ ➠ ➩ ➴ ➾③➚ ➧ ➥ ➤➥ ➷ ➶ ➴ ➪❻❮ ➚ ➥ ➾ ➼ ➥ ➦ ➧ ➥ ➚ ➚ ➩ ➧ ➥ ➦ ➴ ❰ ➶❾Ï Ð ➾ ➧ ➪ ➧ ➽ ➧ ➥ ❒ ➠Ñ Ò ➧ ➥ Ñ ➚ ➤ ➩ ➪ ➵ ➻ ➥ Ó ➵ ➷ ➩ ➴ Ï ➽ Ô ➩ ➦ ➤➥ ➧ ➸ Õ ➚ ➷ ➧ ➤ ➙ ➫ ❒ ➜ ➙ ➟ ➜ ➘ ➡ ➻✐Ö ➚ ➩ ➡ ➹ Ó ➚ Ö ➡ ➶ ➴ ➸➇➡ ➙ ➫ ➟ ➜ ➙ ❒ ➜ ➮ ➮ × ➙ ➫ ❒ ➜➯➭ ➮ Ø Ø ✃ ➮ Ù ➮ Ø Ø ✃ Ú ➡ ➙ ➫ ➞➟ ➜ Û ÜÞÝ Û Ü ß à á â ã ä❾å æèç æ ç éëêì í î ï ê ð ç ç ñëò ð ó îëô ï õö ß ä å ÷ ø øù áú á û ö å æ ð ò üþý î ì ÿ ç
  • ý
ó î ê ✁ ã à ä ✂ ß û å ✁ ç ß☎✄ ý ð ç ✆ ✝ ý ✄ ç ✞ ò ô ✟ ã û à
  • ö
å ÷ ÷✠ æ ß ✡ ö å ÷ ☛ ☛ù ÷✌☞ ÷ ☛ ☛ ù ✍
slide-11
SLIDE 11 ✎✏✑ ✒ ✓✔ ✔ ✓ ✕ ✖✘✗ ✙ ✗ ✚ ✗ ✛ ✜ ✢ ✣ ✣

HTMLWrappers

Recordlevel:

Extractingelementsofasinglelistofhomogeneous recordsfromapage

Discoveringrecordboundarybydetectingregularity

Page-level:

Extractingelementsofmultiplekindsofrecords

Example:name,courses,publicationsfromhomepages

Site-level:

Example:populatingauniversitydatabasefrompages

  • fauniversitywebsite
✧✩★ ✪ ★ ✫ ✬ ✭ ✮ ✯ ✬ ✰ ✱ ✰ ✱✳✲ ✯ ★ ✪ ✴ ✰ ✪ ✱ ✵ ✲ ✫ ✱ ✬ ★ ✭ ✰ ✬ ✶ ✫✷ ✶ ✯ ✫✸ ★ ✵ ★ ✹ ★ ✵✻✺
slide-12
SLIDE 12 ✎✏✑ ✒ ✓✔ ✔ ✓ ✕ ✖✘✗ ✙ ✗ ✚ ✗ ✛ ✜ ✢ ✣ ✓

IEfromfree-formattext

Examples:

Geneinteractionsfrommedicalarticles

Partnumber,problemdescriptionfromemailsinhelp centers

Structuredrecordsdescribinganaccidentfrom insuranceclaims,

Mergingcompanies,theirrolesandamountfrom newsarticles FocusofNLresearchers[MessageUnderstanding Conferences(MUC)] Requiresdeeplinguisticsandsemanticanalysis Wewilldiscuss:ShallowIEbasedonsyntacticcues

slide-13
SLIDE 13 ✎✏✑ ✒ ✓✔ ✔ ✓ ✕ ✖✘✗ ✙ ✗ ✚ ✗ ✛ ✜ ✢ ✣ ✼

IEviamachinelearning

Givenseveralexamplesshowingpositionof structuredelementsintext, Trainamodeltoidentifytheminunseentext Issues:

Whataretheinputfeatures?

Buildper-elementclassifiersorasinglejointclassifier?

Whichtypeofclassifiertouse?

Howmuchtrainingdataisrequired?

Canonetellwhentheextractorislikelywrong?

✽ ✱ ✱ ✰ ✯✿✾ ✵ ★ ✹ ★ ✵ ✫ ✭ ✵ ✫ ✪ ✪ ❀ ❁ ❀ ✭ ✫ ✱ ❀ ✰ ✷ ✯ ✬ ✰ ❂ ✵ ★ ✴
slide-14
SLIDE 14 ✎✏✑ ✒ ✓✔ ✔ ✓ ✕ ✖✘✗ ✙ ✗ ✚ ✗ ✛ ✜ ✢ ✣❃

Inputfeatures

Contentoftheelement

Specifickeywordslikestreet,zip,vol,pp,

Propertiesofwordslikecapitalization,partsof speech,number?

Formattinginformation:e.g.,font,size

Inter-elementsequencing

Intra-elementsequencing

Elementlength

Externaldatabase

Dictionarywords

Semanticrelationshipbetweenwords

Richerstructure:tree,tables

slide-15
SLIDE 15 ❇❈❉ ❊ ❋●
❍ ■✘❏ ❑ ❏ ▲ ❏ ▼ ◆ ❖ P◗

StructureofIEmodels

❘❚❙ ❯❲❱ ❳ ❨❲❩ ❬ ❱ ❭ ❪❚❫ ❴ ❨❲❩ ❨ ❵ ❯ ❵ ❬ ❛ ❵✿❜ ❝✩❞ ❡✿❢ ❣ ❢ ❞ ❡✿❢ ❞ ❤ ✐ ❥ ❢ ❦✳❧ ❢ ♠ ❢ ♥ ❢ ❞ ❤ ♦ ♣ ♥ q ♠ ❤sr ❞ ❢ t q ✉ ✈ ✇②① ③ ④⑥⑤ ⑦ ⑧⑩⑨ ✈ ❶ ✇②❷ ❸ ❹ ❺ ❺ ❻ ❼✩❽❾ ❿②➀ ➁ ➂ ➃ ❽ ➄ ❿ ➅ ❹ ❺ ❺ ❺ ➆ ➇ ❽ ➄➈ ➀ ➁ ➂ ➉⑥➊ ➋ ➄ ➀ ❽ ➌➍ ➍ ❹ ➆⑥➎ ➅ ➇ ➏ ➀ ❽ ➄➑➐ ➂ ➒ ➋ ➊ ❹ ❺ ❺➓ ➔ ❿ ➋ ➈ ➂ ➆ ➎ →➣➀ ➁ ➄ ❽↔ → ❹ ❺ ❺ ❺ ↕ ➐ ➏ ➙ ➄ ➀ ➂ ➛ ❿ ➈ ➀ ➄ ❹ ❺ ❺ ❻ ➜ ➁ ➀ ❿ ➇ ❽➝ ❹ ❺ ❺ ❺ ➆ ➀ ➐ ➏ ➎ ➁ ➀ ❹ ❺ ❺ ❺ ➞ ❽ ➇ ❽ ➏ ➎ ➄ → ➂ ➛ ➎ ➁ ➈ ❽ ➁ ➌ ➍ ➍ ❹
slide-16
SLIDE 16 ➟➠➡➢ ➤➥ ➥ ➤ ➦ ➧✘➨ ➩ ➨ ➫ ➨➭ ➯ ➲ ➳ ➵

Rule-basedIEmodels

Stalker (Musleaetal2001)

Modeltype:Ruleswithconjunctsanddisjuncts

Foreachelement,tworules:startrulesR1andendruleR2

Features:

htmltagsprimarily

punctuations

predefinedtextfeatures:isNumber,isCapitalized

Relationshipbetweenelements:

Independentwithinsamelevelofhierarchy

Trainingmethod:basicsequentialrulecovering algorithm

slide-17
SLIDE 17

Example

Author:

R1:skipTo(<li> )

R2:skipTo(( )

Title:

R1:skipTo(<B> )ORskipTo(“ )

disjunction

R2:skipTo(</B> )OR skipTo(“ )

Volume:

R1: skipTo(<B> )skipuntil(Number)

conjunction

R2: skipTo(</B> )

➼ ➽➾➪➚ ➼ ➶ ➚ ➹➴➘ ➹ ➘ ➷☎➬➮ ➱ ✃ ❐ ➬ ❒ ➼ ❮ ➶ ➚ ❰ Ï ➘ Ð ➘ Ñ☎Ò ❒ Ó ✃☎Ô ❐Õ Ö × ×Ø Ù ➼ Ú ➚ ➹ ❒ Ò ÛÝÜ ✃ ➮ ➬ ➮ Ó Ð Ò Þàß Ü ➮ Û á ➮ ➱ ✃ ➮ Ü Ü ❒ ✃ ➮ ➱ Ò â Ð✩ã ä Û ✃ Þ ✃☎å ✃ ➮ Ú ➹æ ç ✃ ➮ æ Ü ➬ ❒ Þ⑥è é ➮ ê è Ó ❒ Ò ã å ë ❒ ➱ ➬➮ ✃☎Ô ì Ü Ó ✃ ➬ ➼ ❮ Ú ➚ ➼ ➾ ➚ Ï ➘ é☎í Ü ❒ ➘ î ê Ü í ➘ Ð Ò Ô ➘ ➼ ❮ ➾ ➚ ➼ Ú ➚ Ö Ö ï ➼ ❮ Ú ➚ ❰ Ö ð ð Ø Ö ñ Ö ð ðØ ò ➘ ➼ ❮ ➽➾➪➚ ➼ ➽➾➪➚ é ➘ Ú ê ã ➮ ✃ ➬ ❰ Ð ➘ Ñ ã ❒ ➬➮ ✃ ❰ ➼ ➶ ➚ ➹ ➘ ➹➴➘ ➷☎➬➮ ➱ ✃ ❐ ➬ ❒ ➼ ❮ ➶ ➚ Õ ðó ó Ö Ùô õ Ò ❒ å Ü ❒ ➬ Ó ✃ å ê ➹ Ü ❒ Ò ö ✃ Ó ➬ å Ü ì Ü Ó ✃ ➬ ÛÝÜ Ó Ñ Ü ➱ ❒ ➬ Ó ➬ Û ✃ Ò ➮ Ò â Ñ è Ü å ô ➼ ➾➪➚ Ú ✃ Ò ÛÝÜ Ô ê ➮ Ò Þ Ò ➱ è ➬➮ Ó Ú ✃ Ò Ü ➮ ➱ ✃ ➮ Ü Ü ❒ ✃ ➮ ➱ ❰ ➼ ❮ ➾➪➚ ➼ Ú ➚ ò ð ➼ ❮ Ú ➚ ❰ ï ÷ ð ñ ï ÷ ò ➘ ➼ ❮ ➽➾➪➚
slide-18
SLIDE 18 øùúû üý ý ü þ ÿ✁ ✂
✆ ✝ ✞ ✟

Limitationsofrule-basedapproach

AsinWEIN,Stalker

Noorderingdependencybetweenelements

Non-overlapofelementsnotexploited

Positioninformationignored

Contentlargelyignored

Heuristicstoorderrulefiringandordering

slide-19
SLIDE 19 øùúû üý ý ü þ ÿ✁ ✂
✆ ✝ ✞ ☛

Finitestatemachines

Modelorderingrelationshipbetweenelements( Softmealy,Hsu1998)

Node:elementstobeextracted

Transitionedge:rulesmarkingstartofelement.

RulesaresimilartothoseinSTALKER.

✌ ✍ ✎ ✏ ✑✒ ✓ ✔ ✍ ✓ ✕ ✎ ✖ ✗ ✍ ✏ ✘ ✖ ✒ ✏ ✖ ✙ ✚ ✖ ✓✜✛ ✢ ✣✥✤ ✢ ✦ ✧ ✤ ✢ ✣✥✤ ✢ ★ ✩ ✤ ✢ ✪ ✣ ✤ ✢ ★ ✩ ✤ ✢ ✣ ✤

Whenmorethanonerulefiresapplymorespecificrule Allallowablepermutationsmustappearintrainingdata

slide-20
SLIDE 20 ✫✬✭ ✮ ✯✰ ✰ ✯ ✱ ✲✁✳ ✴ ✳ ✵ ✳ ✶ ✷ ✸ ✯ ✰

IEwithHiddenMarkovModels

ProbabilisticmodelsforIE

✙ ✚ ✖ ✓✜✛ ✌ ✍ ✎ ✏ ✑✒ ✓ ✕ ✎ ✖ ✗ ✍ ✏

0.9 0.5 0.5 0.8 0.2 0.1

✺✼✻ ✽✾ ✿ ❀ ❁ ❀❃❂ ✾ ❄ ✻ ❂ ❅ ✽ ❅ ❀ ❆ ❀ ❁ ❀❃❇ ✿ ❈ ✛ ✒ ✏

A B C 0.6 0.3 0.1 X B Z 0.4 0.2 0.4 Y A C 0.1 0.1 0.8

❉❃❊ ❀ ✿ ✿ ❀❃❂ ✾ ❄ ✻ ❂ ❅ ✽ ❅ ❀ ❆ ❀ ❁ ❀ ❇ ✿

dddd dd 0.8 0.2

slide-21
SLIDE 21 ✫✬✭ ✮ ✯✰ ✰ ✯ ✱ ✲✁✳ ✴ ✳ ✵ ✳ ✶ ✷ ✸ ✯ ❋

HMMStructure

  • NaïveModel:Onestateperelement
  • Nestedmodel

Eachelementanother HMM

slide-22
SLIDE 22 ✫✬✭ ✮ ✯✰ ✰ ✯ ✱ ✲✁✳ ✴ ✳ ✵ ✳ ✶ ✷ ✸ ✯ ✯

HMMDictionary

Foreachword(=feature),associatethe probabilityofemittingthatword

Multinomialmodel

Moreadvancedmodelswithoverlapping featuresofaword,

example,

partofspeech,

capitalizedornot

type:number,letter,wordetc

Maximumentropymodels(McCallum2000)

slide-23
SLIDE 23 ❑▲▼ ◆ ❖P P ❖ ◗ ❘✁❙ ❚ ❙ ❯ ❙ ❱ ❲ ❳ ❖❨

Learningmodelparameters

Whentrainingdatadefinesuniquepaththrough HMM

Transitionprobabilities

Probabilityoftransitioningfromstateitostatej=

numberoftransitionsfromitoj totaltransitionsfromstatei

Emissionprobabilities

Probabilityofemittingsymbolkfromstatei=

numberoftimeskgeneratedfromi numberoftransitionfromI

Whentrainingdatadefinesmultiplepath:

AmoregeneralEMlikealgorithm(Baum-Welch)

slide-24
SLIDE 24 ❑▲▼ ◆ ❖P P ❖ ◗ ❘✁❙ ❚ ❙ ❯ ❙ ❱ ❲ ❳ ❖ ❩

UsingtheHMMtosegment

FindhighestprobabilitypaththroughtheHMM.

Viterbi:quadraticdynamicprogrammingalgorithm

❬❃❭ ❪ ❫❴
  • t
❵ ❭ ❛ ❜ ❝ ❞ ❡ ❢ ❣ ❞❃❤ ✐ ✐❥ ❦✜❧ ♠♥ ♦q♣ ♦ ❧ r r ♦ s t ✉ ♠ ✈✇ ① ① ① ② ① ❬❃❭ ❪ ❫❴ ❵ ❭ ❛ ❜ ❝ ❞ ❡ ❢ ❣❞ ❤ ✐ ✐ ❥ ❦ ❧ ♠♥ ♦ ③ ③ ③ ④ ④ ✇ ① ① ① ② ①
  • t
❬ ❭ ❪ ❫❴ ❵ ❭ ❛ ❜ ❝ ❞ ❡ ❢ ❣ ❞❃❤ ❬❃❭ ❪ ❫❴ ❵ ❭ ❛ ❜ ❣ ❞❃❤
slide-25
SLIDE 25 ❑▲▼ ◆ ❖P P ❖ ◗ ❘✁❙ ❚ ❙ ❯ ❙ ❱ ❲ ❳ ❖ ⑤

ComparativeEvaluation

Naïvemodel– Onestateperelementinthe HMM

IndependentHMM– OneHMMperelement;

RuleLearningMethod– Rapier

NestedModel– EachstateintheNaïvemodel replacedbyaHMM

slide-26
SLIDE 26

Results:ComparativeEvaluation

TheNestedmodeldoesbestinallthreecases

(from Borkar 2001) Elem ents insta nces Dataset 6 740 US Addresses 6 769 Company Addresses 17 2388 IITBstudent Addresses

slide-27
SLIDE 27 ❑▲▼ ◆ ❖P P ❖ ◗ ❘✁❙ ❚ ❙ ❯ ❙ ❱ ❲ ❳ ❖ ⑥

HMMapproach:summary

Inter-elementsequencing Intra-elementsequencing Elementlength Characteristicwords Non-overlappingtags

OuterHMMtransitions

InnerHMM

Multi-stateInnerHMM

Dictionary

Globaloptimization

slide-28
SLIDE 28 ⑧⑨⑩❶ ❷❸ ❸ ❷ ❹ ❺✁❻ ❼ ❻ ❽ ❻❾ ❿ ➀ ❷➁

InformationExtraction:summary

Rule-based

And/orcombinationwith heuristicstocontrolfiring

Brittletovariationsindata

Requirelessertrainingdata, wrappersreportedtolearn with<10examples

UsedinHTMLwrappers

Probabilistic

Jointprobability distribution,moreelegant

Mightgethardingeneral

Canhandlevariations

Usedfortextsegmentation andNEextraction

Featureengineeringiskey:havetomodelhowto combinethemwithoutunduecomplexity

slide-29
SLIDE 29 ⑧⑨⑩❶ ❷❸ ❸ ❷ ❹ ❺✁❻ ❼ ❻ ❽ ❻❾ ❿ ➀ ➄ ➅

Outline

InformationExtraction

Rule-basedmethods

Probabilisticmethods

Duplicateelimination

Reducingtheneedfortrainingdata:

Activelearning

Bootstrappingfromstructureddatabases

Semi-supervisedlearning

Summaryandresearchproblems

slide-30
SLIDE 30 ⑧⑨⑩❶ ❷❸ ❸ ❷ ❹ ❺✁❻ ❼ ❻ ❽ ❻❾ ❿ ➀ ➄ ❷

Thede-duplicationproblem

Givenalistofsemi-structuredrecords, findallrecordsthatrefertoasameentity

Exampleapplications:

Datawarehousing:mergingname/addresslists

Entity: a) Person b) Household

Automaticcitationdatabases(Citeseer):references

Entity:paper

➋➍➌➏➎ ➐➒➑ ➓ ➔ →↔➣ ↕ ➙ →↔➛ ➜ ➝ ➞ →↔➟ ➜ ➛ ➙ ➑ ➜ ➟ ➑ ➓ ➌ ➠➡ →↔➟ ➌ ➐ ➣ ➔ ➑ ➟ ➙ ➌ ➠ → ➜➢ ➞ ➓ ➠ ➌ ➣ → ➟ ➌ ➌ ➤ ➙ ➌ ➠ ➜ ↕ ➔ ➜ ➛ ➙ →↔➛ ➜ ➛ ➥ ➣ ➛ ➠ ➠ ➌ ➣ ➙ ➜ ➌ ➟ ➟
slide-31
SLIDE 31 ➦➧➨➩ ➫➭ ➭ ➫ ➯ ➲✁➳ ➵ ➳ ➸ ➳➺ ➻ ➼ ➽ ➽

Challenges

Errorsandinconsistenciesindata

Spottingduplicatesmightbehardastheymay bespreadfarapart:

maynotbegroup-ableusingobviouskeys

Domain-specific

Existingmanualapproachesrequirere-tuningwith everynewdomain

slide-32
SLIDE 32 ➦➧➨➩ ➫➭ ➭ ➫ ➯ ➲✁➳ ➵ ➳ ➸ ➳➺ ➻ ➼ ➽ ➾

Example:citationsfromCiteseer

Ourprior:

duplicatewhenauthor,title,booktitleandyearmatch..

➞ ➚ ➑ ➙ ➪ ➛ ➠ ➶ ↕ ➙ ➣ ➪ ➣ ➛ ➑ ➔ ➐ ➹ ➌ ➪ ↕ ➠ ➐ ➝ ➘
  • L. Breiman,L.Friedman,andP.Stone,(1984).

LeoBreiman,JeromeH.Friedman,RichardA. Olshen,andCharlesJ.Stone.

➞ ➴ ➛ ➜ ➥ ➌ ➠ ➌ ➜ ➣ ➌ ➶ ↕ ➙ ➣ ➪ ➣ ➛ ➑ ➔ ➐ ➹ ➌ ➪ ↕ ➠ ➐ ➌ ➠ ➝ ➘

InVLDB-94

InProc.ofthe20thInt'lConferenceonVery LargeDatabases,Santiago,Chile,September1994.

slide-33
SLIDE 33 ➞ ➷ ➬↔➮ ➱ ✃❒❐ ❮❰ Ï ÐÑ Ò Ó ➮ ❐ ➮ Ô ❮ ➮ Ð Ò ➮ ✃ Õ Ö Ñ × ✃ Ñ Ø ➮ × ➱ ❰ ÙÚ Ñ Û ➱ ✃ Ó ➮ ❮ ➬ ❐ ➱ ➮ ❰ ✃ ➬ Ð Ô Ü Û Ù ➱ ➬ Ú ❰ Ò ➮ ❐ Ý ➬ Ò Þ ➱ ➬ Ò Ò ➱ ➮ Ñ Ø ➮ × ➱ ❰ Ù ➮ Ø ➮ Ð ➬ Ð Ò ➬ Ò ➱ ➮ ß

JohnsonLaird,PhilipN.(1983).Mentalmodels. Cambridge,Mass.:HarvardUniversityPress.

ß

P.N.Johnson-Laird.MentalModels:Towardsa CognitiveScienceofLanguage,Inference,and Consciousness.CambridgeUniversityPress,1983

à Ñ Ð➏á ✃ Û Ù ➱ ➬ Ú ❰ Ò ➮ ❐ Ý ➬ Ò Þ ➱ Ñ Ò ❐ Ñ â Ý Ñ × ✃ Ñ Ø ➮ × ➱ ❰ Ù ß
  • H. Balakrishnan,S. Seshan,andR.H.Katz.,

ImprovingReliableTransportandHando PerformanceinCellularWirelessNetworks,ACM WirelessNetworks,1(4),December1995.

ß
  • H. Balakrishnan,S. Seshan,E. Amir,R.H.

Katz,"ImprovingTCP/IPPerformanceover WirelessNetworks,"Proc.1stACMConf.on MobileComputingandNetworking,November1995.

slide-34
SLIDE 34 ãäåæ çè è ç é ê✁ë ì ë í ëî ï ð ñò

Learningthede-duplicationfunction

Givenexamplesofduplicatesandnon-duplicate pairs,learntopredictifpairisduplicateornot. Inputfeatures:

ó

Variouskindsofsimilarityfunctionsbetweenattributes

ô

Editdistance, Soundex,N-gramsontextattributes

ô

Absolutedifferenceonnumericattributes

ó

Someattributesimilarityfunctionsareincompletely specified

ô

Example:weighteddistanceswithparameterizedweights

ô

Needtolearntheweightsfirst.

slide-35
SLIDE 35 ãäåæ çè è ç é ê✁ë ì ë í ëî ï ð ñ õ

Thelearningapproach

öø÷ öúù û ö ü ý þ ÿ þ ✁ ✂ ✄ þ ☎✝✆ ö✟✞ ✠✡ ☎ þ ☛ ✠☞ ✌✎✍ ✂ ÿ ✏ ✁✒✑ ✁ ✂ ✓ ✑ ✁ ✑ ✔ ✏ ✂ þ ✄ ☞ ✕✗✖ ✘✙ ✚ ✛ ÷ ✜ ✕✗✖ ✘✙ ✚ ✛ ù ✕✗✖ ✘✙ ✚ ✛ ÷ ✢ ✕✗✖ ✘✙ ✚ ✛ ✣ ✕✗✖ ✘✙ ✚ ✛ ✤ ✜ ✕✗✖ ✘✙ ✚ ✛ ✥ ÷✧✦ ★ ★ ✦ ✤ ✩ ★ ✦ ù ÷ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ✣ ★ ★ ✦ ✣ ★ ✦ ✤ ✩ ★ ✦ ✤ ÷ ✪ ✂ ✏ ✏ ✑ ✔ ✑ ✍ ✂ ÿ ✏ ✁✒✑ ☞ ✫ ✁ ✂ ☞ ☞ þ ö þ ✑ ✄ ✕✗✖ ✘✙ ✚ ✛ ✬ ✕✗✖ ✘✙ ✚ ✛ ✭ ✕✗✖ ✘✙ ✚ ✛ ✮ ✕✗✖ ✘✙ ✚ ✛ ✯ ✕✗✖ ✘✙ ✚ ✛ ÷ ★ ✕✗✖ ✘✙ ✚ ✛ ÷ ÷ ✰ ✠ ✁ ✂ ✓ ✑ ✁✒✑ ✔ ✁ þ ☞ ☎ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ✣ ✱ ÷✧✦ ★ ★ ✦ ✤ ✩ ★ ✦ ù ✱ ★ ✦ ✬ ★ ✦ ù ✩ ★ ✦ ✥ ✱ ★ ✦ ✭ ★ ✦ ÷ ✩ ★ ✦ ✬ ✱ ★ ✦ ✣ ★ ✦ ✤ ✩ ★ ✦ ✤ ✱ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ÷ ✱ ★ ✦ ✣ ★ ✦ ✮ ✩ ★ ✦ ÷ ✱ ★ ✦ ✬ ★ ✦ ÷ ✩ ★ ✦ ✥ ✱ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ✣ ★ ÷ ✦ ★ ★ ✦ ✤ ✩ ★ ✦ ù ÷ ★ ✦ ✬ ★ ✦ ù ✩ ★ ✦ ✥ ★ ★ ✦ ✭ ★ ✦ ÷ ✩ ★ ✦ ✬ ★ ★ ✦ ✣ ★ ✦ ✤ ✩ ★ ✦ ✤ ÷ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ÷ ★ ★ ✦ ✣ ★ ✦ ✮ ✩ ★ ✦ ÷ ÷ ★ ✦ ✬ ★ ✦ ÷ ✩ ★ ✦ ✥ ÷
slide-36
SLIDE 36 ✲✳✴ ✵ ✶✷ ✷ ✶ ✸ ✹✻✺ ✼ ✺ ✽ ✺ ✾ ✿ ❀ ❁❂

Learningattributesimilarityfunctions

Stringeditdistancewithparameters:

C(x,y):costofreplacingxwithy

d:costofdeletingacharacter

i:costofinsertingacharacter

Learningparametersfromexamples showingmatchings

TransformedExamples:

sequenceof

Match

Insert

Deletes

[Bilenko&Mooney,2002]

[Ristad&Yianilos1998

Trainastochasticmodelonsequence

❈ ❉❋❊
  • ❍❏■
❑▼▲ ❈ ❑ ❊
  • ◆P❖
◗❘ ❙ ❚ ❘ ❙ ❯ ❱❳❲ ❨ ❩❭❬ ❊ ❊ ❊ ❊ ❊ ❪ ❫ ❫ ❫ ❫
slide-37
SLIDE 37 ❴❵❛❜ ❝❞ ❞ ❝ ❡ ❢✻❣ ❤ ❣ ✐ ❣❥ ❦ ❧ ♠♥

Summary:De-deduplication

Previousworkconcentratedondesigninggood static,domain-specificstringsimilarityfunctions

Recentspateofworkondynamiclearning-based approachappearspromising

Twolevels:

Attribute-level:Tuningparametersofexistingstring similarityfunctionstomatchexamples

Record-level:ClassifierslikeSVMsanddecision treesusedtocombinethesimilarityalongvarious attributessavingtheeffortoftuningthresholdsand conditions

slide-38
SLIDE 38 qrst ✉✈ ✈ ✉ ✇ ①✻② ③ ② ④ ②⑤ ⑥ ⑦ ⑧ ✈

Outline

InformationExtraction

Rule-basedmethods

Probabilisticmethods

Duplicateelimination

Reducingtheneedfortrainingdata:

Activelearning

Bootstrappingfromstructureddatabases

Semi-supervisedlearning

Summaryandresearchproblems

slide-39
SLIDE 39 qrst ✉✈ ✈ ✉ ✇ ①✻② ③ ② ④ ②⑤ ⑥ ⑦ ⑧ ✉

Activelearning

Ordinarylearner:

learnsfromafixedsetoflabeledtrainingdata

Activelearner:

Selectsunlabeledexamplesfromalargepooland interactivelyseekstheirlabelsfromauser

Carefulselectionofexamplescouldleadtofaster convergence

Usefulwhenunlabeledexamplesareabundantand labelingthemrequireshumaneffort

slide-40
SLIDE 40 qrst ✉✈ ✈ ✉ ✇ ①✻② ③ ② ④ ②⑤ ⑥ ⑦ ⑧ ⑨

Example:activelearning

⑩❷❶ ❸ ❹ ❸ ❹ ❺❼❻ ⑩ ❶ ❸ ❹❽ ❸ ❹ ❹❾ ❻ ❿ ❹ ❽ ➀❼➁ ❾ ➁ ➂ ❶ ❾ ➃ ❹ ❸ ➄❳➅ ➀ ❾ ➀ ➄✝➆ ➇P➈ ➈ ➉ ➊➋ ➌ ➍P➎ ➏➑➐ ➒ ➈ ➓✟➔ ➎ ➊ ➒✝→ ➎ ➣ ↔✒↕ ➈ ➈ ➋ ➈ ➙ ➔ ➋ ➛ ↕ ➐ ➛➝➜ ➔ ➋ ➋ ➐ ➞ ➎ ➐ ↕ ➔ ➋ ↕ ↔ ↔ ➏ ➐ ➋ ➟ ➋ ➔ ➓ ➋ ➣ ➒ ↔➡➠ ➈ ➋ ➟ ↕ ➔ ↕ ➢ ↔ ➋ ➢ ➠ ↕ ➈ ➏ ➐ ➜ ↔ ➋ ➟ ➎ ➏➑➐ ➒ ➈ ➋ ➟ ↕ ➔ ↕ ➒ ➎ ➔ ➤➦➥ ➧➑➨ ➩➑➫ ➧ ➫ ➭➲➯ ➳ ➵➸➥ ➺➼➻ ➧➑➨ ➩ ➫ ➧ ➫ ➭ ➯ ➳ ➵ ➥ ➺ ➻ ➽ ➾P➚ ➚ ➪➝➶ ➹ ➚ ➘ ➴ ➚➷ ➴ ➚ ➬ ➮ ➚ ➱ ➴ ➚ ➪ ➹ ➚ ➪✟✃ ➱ ➴ ❐➑❒ ❮ ❐ ❮ ➴ ❰ ➚ ➷ ❐ÐÏ ➚ ❒ Ñ ➴ ❰ ➚ ✃ ❮ ➱ ➚ ➹ ➴ ➘ ❐ ❮ ➴ ➽ ➹ ➚ ➶ ❐➑❒ ❮ Ò Ó ❰ ➘ ➴ ❒ Ñ ➴ ➚ ❮ ➱ ❒ ➹ ➹ ➚➷ ➮ ❒ ❮ ➪ ➷ ➴ ❒ ➮ ❒ ❐ ❮ ➴ Ô ❐ ➴ ❰ ❰ ❐ ➶ ❰ ➚➷ ➴ ➮ ➹ ➚ ➪ ❐ ➱ ➴ ❐ ❒ ❮ ✃ ❮ ➱ ➚ ➹ ➴ ➘ ❐ ❮ ➴ ➽
slide-41
SLIDE 41 ÕÖר ÙÚ Ú Ù Û Ü✻Ý Þ Ý ß Ýà á â ã ä

Measuringpredictioncertainty

å

Classifier-specificmethods

æ

Supportvectormachines:

ç

Distancefromseparator

è

NaïveBayes classifier:

ç

Posteriorprobabilityofwinningclass

è

Decisiontreeclassifier:

ç

Weightedsumofdistancefromdifferentboundaries,errorof theleaf,depthoftheleaf,etc

é

Committee-basedapproach: (Seung, Opper,andSompolinsky1992)

è

Disagreementsamongstmembersofacommittee

è

Mostsuccessfullyusedmethod

slide-42
SLIDE 42 êëìí îï ï î ð ñ✻ò ó ò ô òõ ö ÷ ø ù

Formingaclassifiercommittee

Randomlyperturblearntparameters

é

Probabilisticclassifiers:.

è

Samplefromposteriordistributiononparameters giventrainingdata.

è

Example:binomialparameterphasabeta distributionwithmeanp

é

Discriminativeclassifiers:

è

Randomboundaryinuncertaintyregion

slide-43
SLIDE 43 êëìí îï ï î ð ñ✻ò ó ò ô òõ ö ÷ ø ú

Committee-basedalgorithm

é

TrainkclassifiersC1,C2,..Ckontrainingdata

é

Foreachunlabeledinstancex

è

Findpredictiony1,..,yk fromthekclassifiers

è

ComputeuncertaintyU(x)asentropyofabovey-s

é

Pickinstancewithhighestuncertainty

û ü ý þ ÿ
✂☎✄ ✆ ✆ ✝ ý ✆ ✝✞ ✝
ü ✟ ÿ ✠ ✝
  • ✝✞
✞ ✡ ☛ ☞ ✌ ✍ ✎ ✏ ☞✒✑ ✍ ✌ ✓✔ ✕ ✖✘✗ ✙ ✚ ✛✢✜ ✎ ✏ ☞✒✑ ✍ ✌ ✏ ✛ ✔ ✓✣ ✤ ✥ ☞✒✦ ✑ ✌ ✜ ✔ ✏ ✥ ✏✧ ✌ ✓ ✦ ☞✒✦ ✔ ✌ ✓ ✦ ✧ ✏ ★ ✜ ✩ ✥ ✓ ✪ ✏ ✥ ☞ ✦ ✑✬✫
slide-44
SLIDE 44

Activelearningindeduplication withdecisiontrees

Formingcommitteeoftreesbyrandomperturbation

Selectingsplitattribute

Normally:attributewithlowestentropy

Perturbed:randomattributewithincloserangeoflowest

Selectingasplitpoint

Normally:midpointofrangewithlowestentropy

Perturbed:arandompointanywhereintherangewith lowestentropy

slide-45
SLIDE 45 ✯✰✱✲ ✳✴ ✴ ✳ ✵ ✶✸✷ ✹ ✷ ✺ ✷✻ ✼ ✽ ✾ ✴

Speedofconvergence

With100pairs:

Activelearning:97% (peak)

Random:only30% (fromSarawagi2002)

❁❃❂ ❄ ❅ ❆ ❇ ❆❈ ❉ ❂ ❉❋❊
❇✒■ ❄ ❏ ❇✒❑ ❆ ▲ ❊ ❆■ ❏ ❇ ❑ ❆ ❑ ❆ ▼ ❇ ◆ ❏ ❂ ❖ ❂ ❆ ❏ ❅ ❇ ❂ P
slide-46
SLIDE 46 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴❵

ActivelearninginIEwithHMM

FormingcommitteeofHMMsbyrandom perturbation

Emissionandtransitionprobabilitiesare independentmultinomialdistributions.

PosteriordistributionforMultinomialparameters:

Dirichlet withmeanestimatedasusingmaximum likelihood

Resultsonpartofspeechtagging(Dagan1999)

92.6%accuracyusingactivelearningwith20,000 instancesasagainst100,000random

slide-47
SLIDE 47 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴ ❯

Activelearninginrule-basedIE

Stalker(Musleaetal2000)

Learntwoclassifiers:

  • nebasedonaforwardtraversalofthedocument,

secondbasedonabackwardtraversal

Selectforlabelingthoserecordsthatget conflictingpredictionfromthetwo

Performance:85%accuracywithoutactive learningyield94%withactivelearning

slide-48
SLIDE 48 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴ ❝

Bootstrappingfromstructured databases

Givenadatabaseofstructuredelements

Example:collectionofstructured bibtex entries

Segmenttobestmatchwiththedatabase

HMM:

Initializedictionaryusingdatabase

LearntransitionsusingBaumWelchonunlabeled data

Assigningprobabilitieshard

Stillopentoinvestigation

Rule-basedIE:(Snowball,Agichtein2000)

slide-49
SLIDE 49 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴❞

Semi-supervisedlearning

Canunlabeleddataimproveclassifieraccuracy?

Possibly,forprobabilisticclassifierslikeHMMs

Uselabeleddatatotrainaninitialmodel

UseBaumWelchonunlabeleddatatorefinemodelto maximizedatalikelihood

Unfortunately,nogaininaccuracyreported(Seymore 1999)

Needsfurtherinvestigation

slide-50
SLIDE 50

Unsupervisedlearninginduplicate elimination

[Tailor2002]

Clusterthesimilarityvectorsofrecordpairsinto threegroups

Labeltheclustersbasedondistancetotheideal duplicateandnon-duplicatevector.

(optional)Trainclassifieronthislabeleddata

Results:79.8%accuracyonWalmart’s items table.

slide-51
SLIDE 51 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴ ❡

Summary

InformationExtraction

Variouslevelsofcomplexitydependingoninput

Segmentation,HTMLwrappers,free-format

Model-type:

Rule-basedandprobabilistic(HMM)

Independentorsimultaneous

Severalresearchprototypesineachtype

Duplicateelimination

Challengingbecauseofvariationsindataformat

Learningappliedtodesigndeduplicationfunction

slide-52
SLIDE 52 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴ ❣

ManualVslearningapproach

Manual

Inspectpatterns

Codescripts

Requireshigh-skill programmer Learning

Labelexamples

Choose&trainmodel

Low-skill,cheaperlabor formostpart

Featuredesignand modelselection requiresveryhighskill

slide-53
SLIDE 53 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴ ❤

Summary

Reducingneedforlabeleddata

Activelearning

Variousmethodsproposed

Committee-basedsamplingmostpopular

Applicationwith

HMMforIE

Decisiontreesfordeduplication

slide-54
SLIDE 54 ◗❘❙❚ ❯❱ ❱ ❯ ❲ ❳✸❨ ❩ ❨ ❬ ❨❭ ❪ ❫ ❴ ✐

Topicsoffurtherresearch

InformationExtraction:

Exploitinghigher-levelstructuresininputdata,e.g.trees,tables

IntegratedlearninginthepresenceofalargestructuredDB, smalllabeleddataandlargeunlabeleddata

Efficiencyinthepresenceofalargedatabase/dictionary

Wrappersatthewebsitelevelinvolvingseveralstructuredtables

Duplicateelimination

Multi-tablede-duplication

Integratingsemi-supervisedandactivelearning

Efficientactivelearningwithoutrequiringmaterializationofall possiblepairs

Efficientevaluationofade-duplicationfunction

slide-55
SLIDE 55 ❧♠♥♦ ♣q q ♣ r s✸t ✉ t ✈ t✇ ① ② ③ q

Topicsoffurtherresearch

Combiningmachinelearningofextraction patternswithhumangeneratedscripts

Updatingmodelsasdataarrives:continuous learning

Goingfromresearchprototypestorobust productsandtoolkits

slide-56
SLIDE 56

References

General

  • H. Galhardas,D. Florescu,D. Shasha,E.Simon,andC. Saita.Declarativedatacleaning:

Language,modelandalgorithms.VLDB,2001.

S.Lawrence,C.L.Giles,andK. Bollacker.Digitallibrariesandautonomouscitationindexing. IEEEComputer,32(6):67-71,1999.

A.McCallum,K. Nigam,J.Reed,J. Rennie,andK. Seymore.Cora:Computerscience researchpapersearchengine.http://cora.whizbang.com/,2000.

IEEEDataEngineeringspecialissueonDataCleaning. http://www.research.microsoft.com/research/db/debull/A00dec/issue. htm,December2000.

M.A.HernandezandS.J. Stolfo.Real-worlddataisdirty:Datacleansingandthe merge/purgeproblem.DataMiningandKnowledgeDiscovery,2(1), 1998.

Information extraction

  • E. Agichtein,L.Gravano,“Snowball:Extractingrelationsfromlargeplaintextcollections",

ACMIntl.Conf.onDigitalLibraries“2000

DM.Bikel,SMiller,RSchwartzandR. Weischedel,"Nymble:ahigh-performancelearning name-finder",ANLP1997,

Vinayak R. Borkar, KaustubhDeshmukh,and SunitaSarawagi.Automatictextsegmentation forextractingstructuredrecords.SIGMOD2001.

MaryElaine Calif andR.J.Mooney.Relationallearningofpattern-matchrulesforinformation extraction.AAAI1999.

DFreitag andAMcCallum,InformationExtractionwithHMMStructuresLearnedby StochasticOptimization,AAAI2000

A.McCallumandD. Freitag andF.Pereira,MaximumentropyMarkovmodelsforinformation extractionandsegmentation,ICML-2000

slide-57
SLIDE 57

References

K Seymore,AMcCallum,RRosenfeld.LearningHiddenMarkovModelstructureforinformation extraction.AAAIWorkshoponMachineLearningforInformationExtraction,1999.

  • S. Soderland.Learninginformationextractionrulesforsemi-structuredandfreetext.Machine

Learning,34,1999.

Wrappers

C.Y.Chung,M. Gertz,andN. Sundaresan.Reverseengineeringforwebdata:From

visualtosemanticstructures.ICDE2002.

WilliamW.Cohen,MatthewHurst,andLeeS.Jensen.A exible learningsystemforwrapping tablesandlistsinhtmldocuments.WWW2002.

  • DavidW. Embley,Y.S. Jiang,and Yiu-KaiNg.Record-boundarydiscoveryinwebdocuments.In

SIGMOD1999.

C.-N.HsuandM.-T.Dung.Generatingfinite-statetransducersfor semistructureddataextraction fromtheweb.InformationSystemsSpecialIssueon Semistructured Data,23(8),1998.

  • N. Kushmerick,D.S.Weld,andR. Doorenbos.Wrapperinductionforinformationextraction.

IJCAI,1997.

L.Liu,C. Pu,andW.Han. Xwrap:AnXML-enabledwrapperconstructionsystemforweb informationsources.ICDE,2000.

Ion Muslea,StevenMintonandCraigA. Knoblock,HierarchicalWrapperInductionfor Semistructured InformationSources,"AutonomousAgentsandMulti-AgentSystems",2001.

JussiMyllymaki.EffectivewebdataextractionwithstandardXMLtechnologies.WWW,2001.

slide-58
SLIDE 58

References

Duplicateelimination

AZ.Broder,SC.Glassman,MS.Manasse,GeoffreyZweig,“SyntacticClusteringoftheWeb”, WWW1997

M.G. Elfeky,V.S. Verykios,A.K. Elmagarmid,“Tailor:Arecordlinkagetoolkit”,ICDE2002.

SSarawagi and AnuradhaBhamidipaty,Interactivededuplication usingactivelearning,ACM SIGKDD2002

W.E.Winkler.Matchingandrecordlinkage.InB.G.C.etal,editor,BusinessSurveyMethods, pages355-384.NewYork:J.Wiley,1995.

Activeandsemi-supervisedlearning

ShlomoArgamon-Engelson and IdoDagan.Committee-basedsampleselectionforprobabilistic classififers.J.ofArtificialIntelligenceResearch,11:335--360,1999.

Yoav Freund,H.Sebastian Seung,Eli Shamir,andNaftaliTishby.Selectivesamplingusingthe querybycommitteealgorithm.MachineLearning,28(2-3):133-168,1997.

Ion Muslea,SteveMinton,andCraig Knoblock.“Selectivesamplingwithredundantviews".AAAI, 2000

H.S. Seung,M. Opper,andH. Sompolinsky.Querybycommittee.InComputationalLearing Theory,pages287-294,1992.

T.ZhangandF.J. Oles.Aprobabilityanalysisonthevalueofunlabeleddataforclassification problems.ICML,2000