AutomationinInformation ExtractionandIntegration SunitaSarawagi - - PowerPoint PPT Presentation
AutomationinInformation ExtractionandIntegration SunitaSarawagi - - PowerPoint PPT Presentation
AutomationinInformation ExtractionandIntegration SunitaSarawagi IITBombay sunita@it.iitb.ac.in
Dataintegration
✍Theprocessofintegratingdatafrommultiple, heterogeneous,looselystructuredinformation sourcesintoasinglewell-definedstructured database
✍Atediousexerciseinvolving
✎schemamapping,
✎structure/informationextraction,
✎duplicateelimination,
✎missingvaluesubstitution,
✎errordetection
✎standardization
Applicationscenarios
✍Largeenterprises:
✎Phenomenalamountoftimeandresourcesspenton datacleaning
✎Example:Segmentingandmergingname-address listsduringdatawarehousing
✍Web:
✎Creatingstructureddatabasesfromdistributed unstructuredweb-pages
✑Citationdatabases:CiteseerandCora
✍Otherscientificapplications
✎Bio-informatics
✑Extractinggenerelationsfrommedicaltext(KDDcup2002)
Casestudy:CiteSeer
✍Paperlocation:
✎Extractinformationfromspecificpublisherwebsites
✎Extractps/pdf filesbysearchingthewebwithterms like“publications”
✍Informationextractedfrompapers:
✎Title,authorfromheader
✎Extractcitationentries
✓Bibliographysection
✓Separateintoindividualrecords
✓Segmentintotitle,author,date,pagenumbersetc
✍Duplicateeliminationacrossseveralcitationsto apaper(de-duplication)
Recenttrends
✕Classicalproblemthathasbotheredresearchersand practitionersfordecades
✕Severalexistingcommercialsolutionsforenterprisedata integration[mid-80s]
✖Manual,domain-specific,data-drivenscript-basedtools
✖Example:Name/addresscleaning
✖Requirehigh-expertisetocodeandmaintain
✕Desiretoview“Webasadatabase”gotmachinelearning researchersworkingoncleaning
✗✙✘ ✚✜✛ ✢ ✣ ✤ ✥ ✤✦ ✧ ✣ ★✩ ✛ ✪ ✢ ✩ ✫ ✚✜✛ ✬ ✭ ✮ ✛ ✯ ✛ ✣ ✢ ✚ ✣ ✛ ✬ ✛ ✢ ✣ ✰ ✱ ✫ ✣ ★ ✲ ★ ✲ ✘ ✫ ✛ ✬✴✳ ✫ ✢ ✣ ✲ ✥ ✰ ✵ ✚ ✢ ✣ ✚ ✘ ✥ ✤ ✲ ✱ ✛ ✰ ★ ✤ ✲ ✛ ✪ ✲ ★ ✧ ✶ ✛ ✗ ✷ ✢ ✲ ✢ ✥ ✤ ✲ ✛ ✦ ✣ ✢ ✲ ✥ ★ ✤Scopeofthetutorial
❈Novelapplicationofdataminingandmachine learningtechniquestoautomatedatacleaning
- perations.
Distillrecentresearchresultsfromvariousareas:
❉Machinelearning,datamining,informationretrieval, naturallanguageprocessing,webwrapperextraction
❈Focusontwooperations
❉InformationExtraction
❉Duplicateelimination
Outline
❈InformationExtraction
❉Rule-basedmethods
❉Probabilisticmethods
❈Duplicateelimination
❈Reducingtheneedfortrainingdata:
❉Activelearning
❉Bootstrappingfromstructureddatabases
❉Semi-supervisedlearning
❈Summaryandresearchproblems
InformationExtraction(IE)
TheIEtask:Given,
❉E:asetofstructuredelements(Targetschema)
❉S:unstructuredsourceS
extractallinstancesofEfromS
❈Varyinglevelsofdifficultydependingoninput andkindofextractedpatterns
❉Textsegmentation:Extractionbysegmentingtext
❉HTMLwrapper:Extractionfromformattedtext
❉ClassicalIE:Extractionfromfree-formattext
- IEbytextsegmentation
Source:concatenationofstructuredelementswith limitedreorderingandsomemissingfields
❉Example:Addresses,bibrecords
House number Building Road City Zip
❍ ■❏❑ ▲ ▼ ◆P❖ ◗❘ ❙ ◆P❚ ❯ ❱ ◆ ❚ ❘ ❖ ❲P❳ ❨ ❘ ❩ ❬ ❙ ◆❪❭ ❘ ❫❪❴ ❚ ❬ ◆ ❘ ❯ ❳ ❵ ❛ ❑ ❜ ❝ ❜ ❜ ❞❢❡ ❞❢❡ ❣✐❤❥ ❦ ❧ ♠ ❤ ♥ ♦ ♣ ❡ ❞❢❡ q ♥ ❤ r s ❤ ♥ ♦ t ❡ ✉ ❡ ✈✐✇ ①③② ④ ④ ♦ t ❡ ⑤ ❡ ⑥ ④ ❤ ♥ ♠ ♦ ⑦ ❡ ⑤ ❡ t✐⑧ ♥ ⑨ ❧ s ♠⑩ ❶ ❷ ❷❸ ❹ ❞ ♥ ⑧ ①③② ❧ ❥ ❤❥ ⑨ ⑤ ⑧ ④❻❺ ② ❥ ① ✈ ❥ ❦ ❧ ❥ ② ② ♥ ❧ ❥ ❦ ⑧ ❼ ⑤❾❽ ❿ ① ❧ ④ ❧ ✇ ❧ ❥ ❦ ➀ ❞➁ ➂ ❧ ❥ ➁ ② ❤ ♥ ④ r ✉ ❥ ➃ r ⑨ ♥ ⑧ ❽ ✇ ➄ ♥ ❦ ❤❥ ❧ s ➅ ② ⑨ ❧ ❤ ⑦ ❡ ✉✐➆ ② ♥ ❡ ⑥ ➃ ② ➆ ❡ ⑤ ⑧ s➇❡ ❶ ❶➈ ♦ ❶ ➉ ➉❸ ❶➋➊ ❶ ➉ ➉ ❸ ➌ ❡Author Year Title Journal VolumePage State
- ý
- ö
HTMLWrappers
✤Recordlevel:
✥Extractingelementsofasinglelistofhomogeneous recordsfromapage
✥Discoveringrecordboundarybydetectingregularity
✤Page-level:
✥Extractingelementsofmultiplekindsofrecords
✦Example:name,courses,publicationsfromhomepages
✤Site-level:
✥Example:populatingauniversitydatabasefrompages
- fauniversitywebsite
IEfromfree-formattext
✤Examples:
✥Geneinteractionsfrommedicalarticles
✥Partnumber,problemdescriptionfromemailsinhelp centers
✥Structuredrecordsdescribinganaccidentfrom insuranceclaims,
✥Mergingcompanies,theirrolesandamountfrom newsarticles FocusofNLresearchers[MessageUnderstanding Conferences(MUC)] Requiresdeeplinguisticsandsemanticanalysis Wewilldiscuss:ShallowIEbasedonsyntacticcues
IEviamachinelearning
Givenseveralexamplesshowingpositionof structuredelementsintext, Trainamodeltoidentifytheminunseentext Issues:
✥Whataretheinputfeatures?
✥Buildper-elementclassifiersorasinglejointclassifier?
✥Whichtypeofclassifiertouse?
✥Howmuchtrainingdataisrequired?
✥Canonetellwhentheextractorislikelywrong?
✽ ✱ ✱ ✰ ✯✿✾ ✵ ★ ✹ ★ ✵ ✫ ✭ ✵ ✫ ✪ ✪ ❀ ❁ ❀ ✭ ✫ ✱ ❀ ✰ ✷ ✯ ✬ ✰ ❂ ✵ ★ ✴Inputfeatures
❄Contentoftheelement
❅Specifickeywordslikestreet,zip,vol,pp,
❅Propertiesofwordslikecapitalization,partsof speech,number?
❆Formattinginformation:e.g.,font,size
❆Inter-elementsequencing
❆Intra-elementsequencing
❆Elementlength
❆Externaldatabase
❅Dictionarywords
❅Semanticrelationshipbetweenwords
❆Richerstructure:tree,tables
- ❋
StructureofIEmodels
❘❚❙ ❯❲❱ ❳ ❨❲❩ ❬ ❱ ❭ ❪❚❫ ❴ ❨❲❩ ❨ ❵ ❯ ❵ ❬ ❛ ❵✿❜ ❝✩❞ ❡✿❢ ❣ ❢ ❞ ❡✿❢ ❞ ❤ ✐ ❥ ❢ ❦✳❧ ❢ ♠ ❢ ♥ ❢ ❞ ❤ ♦ ♣ ♥ q ♠ ❤sr ❞ ❢ t q ✉ ✈ ✇②① ③ ④⑥⑤ ⑦ ⑧⑩⑨ ✈ ❶ ✇②❷ ❸ ❹ ❺ ❺ ❻ ❼✩❽❾ ❿②➀ ➁ ➂ ➃ ❽ ➄ ❿ ➅ ❹ ❺ ❺ ❺ ➆ ➇ ❽ ➄➈ ➀ ➁ ➂ ➉⑥➊ ➋ ➄ ➀ ❽ ➌➍ ➍ ❹ ➆⑥➎ ➅ ➇ ➏ ➀ ❽ ➄➑➐ ➂ ➒ ➋ ➊ ❹ ❺ ❺➓ ➔ ❿ ➋ ➈ ➂ ➆ ➎ →➣➀ ➁ ➄ ❽↔ → ❹ ❺ ❺ ❺ ↕ ➐ ➏ ➙ ➄ ➀ ➂ ➛ ❿ ➈ ➀ ➄ ❹ ❺ ❺ ❻ ➜ ➁ ➀ ❿ ➇ ❽➝ ❹ ❺ ❺ ❺ ➆ ➀ ➐ ➏ ➎ ➁ ➀ ❹ ❺ ❺ ❺ ➞ ❽ ➇ ❽ ➏ ➎ ➄ → ➂ ➛ ➎ ➁ ➈ ❽ ➁ ➌ ➍ ➍ ❹Rule-basedIEmodels
Stalker (Musleaetal2001)
➸Modeltype:Ruleswithconjunctsanddisjuncts
➺Foreachelement,tworules:startrulesR1andendruleR2
➸Features:
➺htmltagsprimarily
➺punctuations
➺predefinedtextfeatures:isNumber,isCapitalized
➸Relationshipbetweenelements:
➺Independentwithinsamelevelofhierarchy
➸Trainingmethod:basicsequentialrulecovering algorithm
Example
➸Author:
➺R1:skipTo(<li> )
➺R2:skipTo(( )
➸Title:
➺R1:skipTo(<B> )ORskipTo(“ )
➻disjunction
➺R2:skipTo(</B> )OR skipTo(“ )
➸Volume:
➺R1: skipTo(<B> )skipuntil(Number)
➻conjunction
➺R2: skipTo(</B> )
➼ ➽➾➪➚ ➼ ➶ ➚ ➹➴➘ ➹ ➘ ➷☎➬➮ ➱ ✃ ❐ ➬ ❒ ➼ ❮ ➶ ➚ ❰ Ï ➘ Ð ➘ Ñ☎Ò ❒ Ó ✃☎Ô ❐Õ Ö × ×Ø Ù ➼ Ú ➚ ➹ ❒ Ò ÛÝÜ ✃ ➮ ➬ ➮ Ó Ð Ò Þàß Ü ➮ Û á ➮ ➱ ✃ ➮ Ü Ü ❒ ✃ ➮ ➱ Ò â Ð✩ã ä Û ✃ Þ ✃☎å ✃ ➮ Ú ➹æ ç ✃ ➮ æ Ü ➬ ❒ Þ⑥è é ➮ ê è Ó ❒ Ò ã å ë ❒ ➱ ➬➮ ✃☎Ô ì Ü Ó ✃ ➬ ➼ ❮ Ú ➚ ➼ ➾ ➚ Ï ➘ é☎í Ü ❒ ➘ î ê Ü í ➘ Ð Ò Ô ➘ ➼ ❮ ➾ ➚ ➼ Ú ➚ Ö Ö ï ➼ ❮ Ú ➚ ❰ Ö ð ð Ø Ö ñ Ö ð ðØ ò ➘ ➼ ❮ ➽➾➪➚ ➼ ➽➾➪➚ é ➘ Ú ê ã ➮ ✃ ➬ ❰ Ð ➘ Ñ ã ❒ ➬➮ ✃ ❰ ➼ ➶ ➚ ➹ ➘ ➹➴➘ ➷☎➬➮ ➱ ✃ ❐ ➬ ❒ ➼ ❮ ➶ ➚ Õ ðó ó Ö Ùô õ Ò ❒ å Ü ❒ ➬ Ó ✃ å ê ➹ Ü ❒ Ò ö ✃ Ó ➬ å Ü ì Ü Ó ✃ ➬ ÛÝÜ Ó Ñ Ü ➱ ❒ ➬ Ó ➬ Û ✃ Ò ➮ Ò â Ñ è Ü å ô ➼ ➾➪➚ Ú ✃ Ò ÛÝÜ Ô ê ➮ Ò Þ Ò ➱ è ➬➮ Ó Ú ✃ Ò Ü ➮ ➱ ✃ ➮ Ü Ü ❒ ✃ ➮ ➱ ❰ ➼ ❮ ➾➪➚ ➼ Ú ➚ ò ð ➼ ❮ Ú ➚ ❰ ï ÷ ð ñ ï ÷ ò ➘ ➼ ❮ ➽➾➪➚- ✄
- ☎
Limitationsofrule-basedapproach
✠AsinWEIN,Stalker
✡Noorderingdependencybetweenelements
✡Non-overlapofelementsnotexploited
✡Positioninformationignored
✡Contentlargelyignored
✡Heuristicstoorderrulefiringandordering
- ✄
- ☎
Finitestatemachines
✠Modelorderingrelationshipbetweenelements( Softmealy,Hsu1998)
✡Node:elementstobeextracted
✡Transitionedge:rulesmarkingstartofelement.
☞RulesaresimilartothoseinSTALKER.
✌ ✍ ✎ ✏ ✑✒ ✓ ✔ ✍ ✓ ✕ ✎ ✖ ✗ ✍ ✏ ✘ ✖ ✒ ✏ ✖ ✙ ✚ ✖ ✓✜✛ ✢ ✣✥✤ ✢ ✦ ✧ ✤ ✢ ✣✥✤ ✢ ★ ✩ ✤ ✢ ✪ ✣ ✤ ✢ ★ ✩ ✤ ✢ ✣ ✤Whenmorethanonerulefiresapplymorespecificrule Allallowablepermutationsmustappearintrainingdata
IEwithHiddenMarkovModels
✹ProbabilisticmodelsforIE
✙ ✚ ✖ ✓✜✛ ✌ ✍ ✎ ✏ ✑✒ ✓ ✕ ✎ ✖ ✗ ✍ ✏0.9 0.5 0.5 0.8 0.2 0.1
✺✼✻ ✽✾ ✿ ❀ ❁ ❀❃❂ ✾ ❄ ✻ ❂ ❅ ✽ ❅ ❀ ❆ ❀ ❁ ❀❃❇ ✿ ❈ ✛ ✒ ✏A B C 0.6 0.3 0.1 X B Z 0.4 0.2 0.4 Y A C 0.1 0.1 0.8
❉❃❊ ❀ ✿ ✿ ❀❃❂ ✾ ❄ ✻ ❂ ❅ ✽ ❅ ❀ ❆ ❀ ❁ ❀ ❇ ✿dddd dd 0.8 0.2
HMMStructure
- NaïveModel:Onestateperelement
- Nestedmodel
Eachelementanother HMM
HMMDictionary
✹Foreachword(=feature),associatethe probabilityofemittingthatword
❍Multinomialmodel
■Moreadvancedmodelswithoverlapping featuresofaword,
❏example,
❍partofspeech,
❍capitalizedornot
❍type:number,letter,wordetc
❏Maximumentropymodels(McCallum2000)
Learningmodelparameters
■Whentrainingdatadefinesuniquepaththrough HMM
❏Transitionprobabilities
❍Probabilityoftransitioningfromstateitostatej=
numberoftransitionsfromitoj totaltransitionsfromstatei
❏Emissionprobabilities
❍Probabilityofemittingsymbolkfromstatei=
numberoftimeskgeneratedfromi numberoftransitionfromI
■Whentrainingdatadefinesmultiplepath:
❏AmoregeneralEMlikealgorithm(Baum-Welch)
UsingtheHMMtosegment
■FindhighestprobabilitypaththroughtheHMM.
■Viterbi:quadraticdynamicprogrammingalgorithm
❬❃❭ ❪ ❫❴- t
- t
ComparativeEvaluation
■Naïvemodel– Onestateperelementinthe HMM
■IndependentHMM– OneHMMperelement;
■RuleLearningMethod– Rapier
■NestedModel– EachstateintheNaïvemodel replacedbyaHMM
Results:ComparativeEvaluation
TheNestedmodeldoesbestinallthreecases
(from Borkar 2001) Elem ents insta nces Dataset 6 740 US Addresses 6 769 Company Addresses 17 2388 IITBstudent Addresses
HMMapproach:summary
Inter-elementsequencing Intra-elementsequencing Elementlength Characteristicwords Non-overlappingtags
⑦OuterHMMtransitions
⑦InnerHMM
⑦Multi-stateInnerHMM
⑦Dictionary
⑦Globaloptimization
InformationExtraction:summary
Rule-based
➂And/orcombinationwith heuristicstocontrolfiring
➂Brittletovariationsindata
➂Requirelessertrainingdata, wrappersreportedtolearn with<10examples
➂UsedinHTMLwrappers
Probabilistic
➃Jointprobability distribution,moreelegant
➃Mightgethardingeneral
➃Canhandlevariations
➃Usedfortextsegmentation andNEextraction
Featureengineeringiskey:havetomodelhowto combinethemwithoutunduecomplexity
Outline
➆InformationExtraction
➇Rule-basedmethods
➇Probabilisticmethods
➆Duplicateelimination
➆Reducingtheneedfortrainingdata:
➇Activelearning
➇Bootstrappingfromstructureddatabases
➇Semi-supervisedlearning
➆Summaryandresearchproblems
Thede-duplicationproblem
Givenalistofsemi-structuredrecords, findallrecordsthatrefertoasameentity
➆Exampleapplications:
➈Datawarehousing:mergingname/addresslists
➉Entity: a) Person b) Household
➊Automaticcitationdatabases(Citeseer):references
➉Entity:paper
➋➍➌➏➎ ➐➒➑ ➓ ➔ →↔➣ ↕ ➙ →↔➛ ➜ ➝ ➞ →↔➟ ➜ ➛ ➙ ➑ ➜ ➟ ➑ ➓ ➌ ➠➡ →↔➟ ➌ ➐ ➣ ➔ ➑ ➟ ➙ ➌ ➠ → ➜➢ ➞ ➓ ➠ ➌ ➣ → ➟ ➌ ➌ ➤ ➙ ➌ ➠ ➜ ↕ ➔ ➜ ➛ ➙ →↔➛ ➜ ➛ ➥ ➣ ➛ ➠ ➠ ➌ ➣ ➙ ➜ ➌ ➟ ➟Challenges
➞Errorsandinconsistenciesindata
➞Spottingduplicatesmightbehardastheymay bespreadfarapart:
➊maynotbegroup-ableusingobviouskeys
➞Domain-specific
➊Existingmanualapproachesrequirere-tuningwith everynewdomain
Example:citationsfromCiteseer
➞Ourprior:
➊duplicatewhenauthor,title,booktitleandyearmatch..
➞ ➚ ➑ ➙ ➪ ➛ ➠ ➶ ↕ ➙ ➣ ➪ ➣ ➛ ➑ ➔ ➐ ➹ ➌ ➪ ↕ ➠ ➐ ➝ ➘- L. Breiman,L.Friedman,andP.Stone,(1984).
LeoBreiman,JeromeH.Friedman,RichardA. Olshen,andCharlesJ.Stone.
➞ ➴ ➛ ➜ ➥ ➌ ➠ ➌ ➜ ➣ ➌ ➶ ↕ ➙ ➣ ➪ ➣ ➛ ➑ ➔ ➐ ➹ ➌ ➪ ↕ ➠ ➐ ➌ ➠ ➝ ➘InVLDB-94
➘InProc.ofthe20thInt'lConferenceonVery LargeDatabases,Santiago,Chile,September1994.
JohnsonLaird,PhilipN.(1983).Mentalmodels. Cambridge,Mass.:HarvardUniversityPress.
ßP.N.Johnson-Laird.MentalModels:Towardsa CognitiveScienceofLanguage,Inference,and Consciousness.CambridgeUniversityPress,1983
à Ñ Ð➏á ✃ Û Ù ➱ ➬ Ú ❰ Ò ➮ ❐ Ý ➬ Ò Þ ➱ Ñ Ò ❐ Ñ â Ý Ñ × ✃ Ñ Ø ➮ × ➱ ❰ Ù ß- H. Balakrishnan,S. Seshan,andR.H.Katz.,
ImprovingReliableTransportandHando PerformanceinCellularWirelessNetworks,ACM WirelessNetworks,1(4),December1995.
ß- H. Balakrishnan,S. Seshan,E. Amir,R.H.
Katz,"ImprovingTCP/IPPerformanceover WirelessNetworks,"Proc.1stACMConf.on MobileComputingandNetworking,November1995.
Learningthede-duplicationfunction
Givenexamplesofduplicatesandnon-duplicate pairs,learntopredictifpairisduplicateornot. Inputfeatures:
óVariouskindsofsimilarityfunctionsbetweenattributes
ôEditdistance, Soundex,N-gramsontextattributes
ôAbsolutedifferenceonnumericattributes
óSomeattributesimilarityfunctionsareincompletely specified
ôExample:weighteddistanceswithparameterizedweights
ôNeedtolearntheweightsfirst.
Thelearningapproach
öø÷ öúù û ö ü ý þ ÿ þ ✁ ✂ ✄ þ ☎✝✆ ö✟✞ ✠✡ ☎ þ ☛ ✠☞ ✌✎✍ ✂ ÿ ✏ ✁✒✑ ✁ ✂ ✓ ✑ ✁ ✑ ✔ ✏ ✂ þ ✄ ☞ ✕✗✖ ✘✙ ✚ ✛ ÷ ✜ ✕✗✖ ✘✙ ✚ ✛ ù ✕✗✖ ✘✙ ✚ ✛ ÷ ✢ ✕✗✖ ✘✙ ✚ ✛ ✣ ✕✗✖ ✘✙ ✚ ✛ ✤ ✜ ✕✗✖ ✘✙ ✚ ✛ ✥ ÷✧✦ ★ ★ ✦ ✤ ✩ ★ ✦ ù ÷ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ✣ ★ ★ ✦ ✣ ★ ✦ ✤ ✩ ★ ✦ ✤ ÷ ✪ ✂ ✏ ✏ ✑ ✔ ✑ ✍ ✂ ÿ ✏ ✁✒✑ ☞ ✫ ✁ ✂ ☞ ☞ þ ö þ ✑ ✄ ✕✗✖ ✘✙ ✚ ✛ ✬ ✕✗✖ ✘✙ ✚ ✛ ✭ ✕✗✖ ✘✙ ✚ ✛ ✮ ✕✗✖ ✘✙ ✚ ✛ ✯ ✕✗✖ ✘✙ ✚ ✛ ÷ ★ ✕✗✖ ✘✙ ✚ ✛ ÷ ÷ ✰ ✠ ✁ ✂ ✓ ✑ ✁✒✑ ✔ ✁ þ ☞ ☎ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ✣ ✱ ÷✧✦ ★ ★ ✦ ✤ ✩ ★ ✦ ù ✱ ★ ✦ ✬ ★ ✦ ù ✩ ★ ✦ ✥ ✱ ★ ✦ ✭ ★ ✦ ÷ ✩ ★ ✦ ✬ ✱ ★ ✦ ✣ ★ ✦ ✤ ✩ ★ ✦ ✤ ✱ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ÷ ✱ ★ ✦ ✣ ★ ✦ ✮ ✩ ★ ✦ ÷ ✱ ★ ✦ ✬ ★ ✦ ÷ ✩ ★ ✦ ✥ ✱ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ✣ ★ ÷ ✦ ★ ★ ✦ ✤ ✩ ★ ✦ ù ÷ ★ ✦ ✬ ★ ✦ ù ✩ ★ ✦ ✥ ★ ★ ✦ ✭ ★ ✦ ÷ ✩ ★ ✦ ✬ ★ ★ ✦ ✣ ★ ✦ ✤ ✩ ★ ✦ ✤ ÷ ★ ✦ ★ ★ ✦ ÷ ✩ ★ ✦ ÷ ★ ★ ✦ ✣ ★ ✦ ✮ ✩ ★ ✦ ÷ ÷ ★ ✦ ✬ ★ ✦ ÷ ✩ ★ ✦ ✥ ÷Learningattributesimilarityfunctions
❃Stringeditdistancewithparameters:
❄C(x,y):costofreplacingxwithy
❄d:costofdeletingacharacter
❄i:costofinsertingacharacter
❃Learningparametersfromexamples showingmatchings
❃TransformedExamples:
❄sequenceof
❅Match
❅Insert
❅Deletes
❆[Bilenko&Mooney,2002]
❆[Ristad&Yianilos1998
❇Trainastochasticmodelonsequence
❈ ❉❋❊- ❍❏■
- ◆P❖
Summary:De-deduplication
♦Previousworkconcentratedondesigninggood static,domain-specificstringsimilarityfunctions
♦Recentspateofworkondynamiclearning-based approachappearspromising
♦Twolevels:
♣Attribute-level:Tuningparametersofexistingstring similarityfunctionstomatchexamples
♣Record-level:ClassifierslikeSVMsanddecision treesusedtocombinethesimilarityalongvarious attributessavingtheeffortoftuningthresholdsand conditions
Outline
♦InformationExtraction
♣Rule-basedmethods
♣Probabilisticmethods
♦Duplicateelimination
♦Reducingtheneedfortrainingdata:
♣Activelearning
♣Bootstrappingfromstructureddatabases
♣Semi-supervisedlearning
♦Summaryandresearchproblems
Activelearning
♦Ordinarylearner:
♣learnsfromafixedsetoflabeledtrainingdata
♦Activelearner:
♣Selectsunlabeledexamplesfromalargepooland interactivelyseekstheirlabelsfromauser
♣Carefulselectionofexamplescouldleadtofaster convergence
♣Usefulwhenunlabeledexamplesareabundantand labelingthemrequireshumaneffort
Example:activelearning
⑩❷❶ ❸ ❹ ❸ ❹ ❺❼❻ ⑩ ❶ ❸ ❹❽ ❸ ❹ ❹❾ ❻ ❿ ❹ ❽ ➀❼➁ ❾ ➁ ➂ ❶ ❾ ➃ ❹ ❸ ➄❳➅ ➀ ❾ ➀ ➄✝➆ ➇P➈ ➈ ➉ ➊➋ ➌ ➍P➎ ➏➑➐ ➒ ➈ ➓✟➔ ➎ ➊ ➒✝→ ➎ ➣ ↔✒↕ ➈ ➈ ➋ ➈ ➙ ➔ ➋ ➛ ↕ ➐ ➛➝➜ ➔ ➋ ➋ ➐ ➞ ➎ ➐ ↕ ➔ ➋ ↕ ↔ ↔ ➏ ➐ ➋ ➟ ➋ ➔ ➓ ➋ ➣ ➒ ↔➡➠ ➈ ➋ ➟ ↕ ➔ ↕ ➢ ↔ ➋ ➢ ➠ ↕ ➈ ➏ ➐ ➜ ↔ ➋ ➟ ➎ ➏➑➐ ➒ ➈ ➋ ➟ ↕ ➔ ↕ ➒ ➎ ➔ ➤➦➥ ➧➑➨ ➩➑➫ ➧ ➫ ➭➲➯ ➳ ➵➸➥ ➺➼➻ ➧➑➨ ➩ ➫ ➧ ➫ ➭ ➯ ➳ ➵ ➥ ➺ ➻ ➽ ➾P➚ ➚ ➪➝➶ ➹ ➚ ➘ ➴ ➚➷ ➴ ➚ ➬ ➮ ➚ ➱ ➴ ➚ ➪ ➹ ➚ ➪✟✃ ➱ ➴ ❐➑❒ ❮ ❐ ❮ ➴ ❰ ➚ ➷ ❐ÐÏ ➚ ❒ Ñ ➴ ❰ ➚ ✃ ❮ ➱ ➚ ➹ ➴ ➘ ❐ ❮ ➴ ➽ ➹ ➚ ➶ ❐➑❒ ❮ Ò Ó ❰ ➘ ➴ ❒ Ñ ➴ ➚ ❮ ➱ ❒ ➹ ➹ ➚➷ ➮ ❒ ❮ ➪ ➷ ➴ ❒ ➮ ❒ ❐ ❮ ➴ Ô ❐ ➴ ❰ ❰ ❐ ➶ ❰ ➚➷ ➴ ➮ ➹ ➚ ➪ ❐ ➱ ➴ ❐ ❒ ❮ ✃ ❮ ➱ ➚ ➹ ➴ ➘ ❐ ❮ ➴ ➽Measuringpredictioncertainty
åClassifier-specificmethods
æSupportvectormachines:
çDistancefromseparator
èNaïveBayes classifier:
çPosteriorprobabilityofwinningclass
èDecisiontreeclassifier:
çWeightedsumofdistancefromdifferentboundaries,errorof theleaf,depthoftheleaf,etc
éCommittee-basedapproach: (Seung, Opper,andSompolinsky1992)
èDisagreementsamongstmembersofacommittee
èMostsuccessfullyusedmethod
Formingaclassifiercommittee
Randomlyperturblearntparameters
éProbabilisticclassifiers:.
èSamplefromposteriordistributiononparameters giventrainingdata.
èExample:binomialparameterphasabeta distributionwithmeanp
éDiscriminativeclassifiers:
èRandomboundaryinuncertaintyregion
Committee-basedalgorithm
éTrainkclassifiersC1,C2,..Ckontrainingdata
éForeachunlabeledinstancex
èFindpredictiony1,..,yk fromthekclassifiers
èComputeuncertaintyU(x)asentropyofabovey-s
éPickinstancewithhighestuncertainty
û ü ý þ ÿ- ✁
- ✟
- ✝✞
Activelearningindeduplication withdecisiontrees
Formingcommitteeoftreesbyrandomperturbation
✭Selectingsplitattribute
✮Normally:attributewithlowestentropy
✮Perturbed:randomattributewithincloserangeoflowest
✭Selectingasplitpoint
✮Normally:midpointofrangewithlowestentropy
✮Perturbed:arandompointanywhereintherangewith lowestentropy
Speedofconvergence
✿With100pairs:
❀Activelearning:97% (peak)
❀Random:only30% (fromSarawagi2002)
❁❃❂ ❄ ❅ ❆ ❇ ❆❈ ❉ ❂ ❉❋❊- ❍
ActivelearninginIEwithHMM
FormingcommitteeofHMMsbyrandom perturbation
❛Emissionandtransitionprobabilitiesare independentmultinomialdistributions.
❛PosteriordistributionforMultinomialparameters:
❜Dirichlet withmeanestimatedasusingmaximum likelihood
❛Resultsonpartofspeechtagging(Dagan1999)
❜92.6%accuracyusingactivelearningwith20,000 instancesasagainst100,000random
Activelearninginrule-basedIE
Stalker(Musleaetal2000)
❛Learntwoclassifiers:
❜- nebasedonaforwardtraversalofthedocument,
secondbasedonabackwardtraversal
❛Selectforlabelingthoserecordsthatget conflictingpredictionfromthetwo
❛Performance:85%accuracywithoutactive learningyield94%withactivelearning
Bootstrappingfromstructured databases
❛Givenadatabaseofstructuredelements
❜Example:collectionofstructured bibtex entries
❛Segmenttobestmatchwiththedatabase
❛HMM:
❜Initializedictionaryusingdatabase
❜LearntransitionsusingBaumWelchonunlabeled data
❜Assigningprobabilitieshard
❜Stillopentoinvestigation
❛Rule-basedIE:(Snowball,Agichtein2000)
Semi-supervisedlearning
❛Canunlabeleddataimproveclassifieraccuracy?
❛Possibly,forprobabilisticclassifierslikeHMMs
❜Uselabeleddatatotrainaninitialmodel
❜UseBaumWelchonunlabeleddatatorefinemodelto maximizedatalikelihood
❜Unfortunately,nogaininaccuracyreported(Seymore 1999)
❛Needsfurtherinvestigation
Unsupervisedlearninginduplicate elimination
[Tailor2002]
❛Clusterthesimilarityvectorsofrecordpairsinto threegroups
❛Labeltheclustersbasedondistancetotheideal duplicateandnon-duplicatevector.
❛(optional)Trainclassifieronthislabeleddata
❛Results:79.8%accuracyonWalmart’s items table.
Summary
❛InformationExtraction
❜Variouslevelsofcomplexitydependingoninput
❢Segmentation,HTMLwrappers,free-format
❜Model-type:
❢Rule-basedandprobabilistic(HMM)
❢Independentorsimultaneous
❜Severalresearchprototypesineachtype
❛Duplicateelimination
❜Challengingbecauseofvariationsindataformat
❜Learningappliedtodesigndeduplicationfunction
ManualVslearningapproach
Manual
❛Inspectpatterns
❛Codescripts
❛Requireshigh-skill programmer Learning
❛Labelexamples
❛Choose&trainmodel
❛Low-skill,cheaperlabor formostpart
❛Featuredesignand modelselection requiresveryhighskill
Summary
❛Reducingneedforlabeleddata
❜Activelearning
❢Variousmethodsproposed
❢Committee-basedsamplingmostpopular
❢Applicationwith
❀HMMforIE
❀Decisiontreesfordeduplication
Topicsoffurtherresearch
❥InformationExtraction:
❦Exploitinghigher-levelstructuresininputdata,e.g.trees,tables
❦IntegratedlearninginthepresenceofalargestructuredDB, smalllabeleddataandlargeunlabeleddata
❦Efficiencyinthepresenceofalargedatabase/dictionary
❦Wrappersatthewebsitelevelinvolvingseveralstructuredtables
❥Duplicateelimination
❦Multi-tablede-duplication
❦Integratingsemi-supervisedandactivelearning
❦Efficientactivelearningwithoutrequiringmaterializationofall possiblepairs
❦Efficientevaluationofade-duplicationfunction
Topicsoffurtherresearch
④Combiningmachinelearningofextraction patternswithhumangeneratedscripts
④Updatingmodelsasdataarrives:continuous learning
④Goingfromresearchprototypestorobust productsandtoolkits
References
⑤General
⑥- H. Galhardas,D. Florescu,D. Shasha,E.Simon,andC. Saita.Declarativedatacleaning:
Language,modelandalgorithms.VLDB,2001.
⑥S.Lawrence,C.L.Giles,andK. Bollacker.Digitallibrariesandautonomouscitationindexing. IEEEComputer,32(6):67-71,1999.
⑥A.McCallum,K. Nigam,J.Reed,J. Rennie,andK. Seymore.Cora:Computerscience researchpapersearchengine.http://cora.whizbang.com/,2000.
⑥IEEEDataEngineeringspecialissueonDataCleaning. http://www.research.microsoft.com/research/db/debull/A00dec/issue. htm,December2000.
⑥M.A.HernandezandS.J. Stolfo.Real-worlddataisdirty:Datacleansingandthe merge/purgeproblem.DataMiningandKnowledgeDiscovery,2(1), 1998.
⑤Information extraction
⑥- E. Agichtein,L.Gravano,“Snowball:Extractingrelationsfromlargeplaintextcollections",
ACMIntl.Conf.onDigitalLibraries“2000
⑥DM.Bikel,SMiller,RSchwartzandR. Weischedel,"Nymble:ahigh-performancelearning name-finder",ANLP1997,
⑥Vinayak R. Borkar, KaustubhDeshmukh,and SunitaSarawagi.Automatictextsegmentation forextractingstructuredrecords.SIGMOD2001.
⑥MaryElaine Calif andR.J.Mooney.Relationallearningofpattern-matchrulesforinformation extraction.AAAI1999.
⑥DFreitag andAMcCallum,InformationExtractionwithHMMStructuresLearnedby StochasticOptimization,AAAI2000
⑥A.McCallumandD. Freitag andF.Pereira,MaximumentropyMarkovmodelsforinformation extractionandsegmentation,ICML-2000
References
⑥K Seymore,AMcCallum,RRosenfeld.LearningHiddenMarkovModelstructureforinformation extraction.AAAIWorkshoponMachineLearningforInformationExtraction,1999.
⑥- S. Soderland.Learninginformationextractionrulesforsemi-structuredandfreetext.Machine
Learning,34,1999.
❦Wrappers
⑥C.Y.Chung,M. Gertz,andN. Sundaresan.Reverseengineeringforwebdata:From
⑥visualtosemanticstructures.ICDE2002.
⑥WilliamW.Cohen,MatthewHurst,andLeeS.Jensen.A exible learningsystemforwrapping tablesandlistsinhtmldocuments.WWW2002.
⑥- DavidW. Embley,Y.S. Jiang,and Yiu-KaiNg.Record-boundarydiscoveryinwebdocuments.In
SIGMOD1999.
⑥C.-N.HsuandM.-T.Dung.Generatingfinite-statetransducersfor semistructureddataextraction fromtheweb.InformationSystemsSpecialIssueon Semistructured Data,23(8),1998.
⑥- N. Kushmerick,D.S.Weld,andR. Doorenbos.Wrapperinductionforinformationextraction.
IJCAI,1997.
⑥L.Liu,C. Pu,andW.Han. Xwrap:AnXML-enabledwrapperconstructionsystemforweb informationsources.ICDE,2000.
⑥Ion Muslea,StevenMintonandCraigA. Knoblock,HierarchicalWrapperInductionfor Semistructured InformationSources,"AutonomousAgentsandMulti-AgentSystems",2001.
⑥JussiMyllymaki.EffectivewebdataextractionwithstandardXMLtechnologies.WWW,2001.
References
❦Duplicateelimination
⑥AZ.Broder,SC.Glassman,MS.Manasse,GeoffreyZweig,“SyntacticClusteringoftheWeb”, WWW1997
⑥M.G. Elfeky,V.S. Verykios,A.K. Elmagarmid,“Tailor:Arecordlinkagetoolkit”,ICDE2002.
⑥SSarawagi and AnuradhaBhamidipaty,Interactivededuplication usingactivelearning,ACM SIGKDD2002
⑥W.E.Winkler.Matchingandrecordlinkage.InB.G.C.etal,editor,BusinessSurveyMethods, pages355-384.NewYork:J.Wiley,1995.
⑦Activeandsemi-supervisedlearning
⑧ShlomoArgamon-Engelson and IdoDagan.Committee-basedsampleselectionforprobabilistic classififers.J.ofArtificialIntelligenceResearch,11:335--360,1999.
⑧Yoav Freund,H.Sebastian Seung,Eli Shamir,andNaftaliTishby.Selectivesamplingusingthe querybycommitteealgorithm.MachineLearning,28(2-3):133-168,1997.
⑧Ion Muslea,SteveMinton,andCraig Knoblock.“Selectivesamplingwithredundantviews".AAAI, 2000
⑧H.S. Seung,M. Opper,andH. Sompolinsky.Querybycommittee.InComputationalLearing Theory,pages287-294,1992.
⑧T.ZhangandF.J. Oles.Aprobabilityanalysisonthevalueofunlabeleddataforclassification problems.ICML,2000