1 dataxformer leveraging the web for seman7c
play

1" DataXFormer:"Leveraging"the"Web" - PowerPoint PPT Presentation

1" DataXFormer:"Leveraging"the"Web" for"Seman7c"Transforma7ons" Ziawasch"Abedjan,"John"Morcos," Michael"Gubanov,"Ihab"F."Ilyas,"Mike"


  1. 1"

  2. DataXFormer:"Leveraging"the"Web" for"Seman7c"Transforma7ons" Ziawasch"Abedjan,"John"Morcos," Michael"Gubanov,"Ihab"F."Ilyas,"Mike" Stonebraker,"Paolo"PapoL,"Mourad" Ouzzani" 2"

  3. Integra7on"of"mul7ple"sources" 3"

  4. Different"value"representa7ons" Departure" BOS" Boston","MA"(BOS)" Boston"–"Logan"Interna7onal" Des7na7on" SFO" San"Fransisco,"CA"(SFO)" San"Fransisco,"CA"(SFO)" Cabin" Choice" Coach" Economy"Restricted" Time" 5:50"am"" 5:50"AM"" 14:26" Price" 563"$" 561"$" 613"€" Departure" Boston"(BOS)" Boston"(BOS)" Boston"(BOS)" Des7na7on" San"Fransisco"(SFO)" San"Fransisco"(SFO)" San"Fransisco"(SFO)" Economy" Cabin" Economy" Economy" 02:26p" Time" 5:50a" 5:50a" 4" …" 563"$" 561"$" 731"$"

  5. Data"Transforma7on"Tasks" Airport"code" "! "City" Airport'code' City' • model" ! "brand" BOS" Boston" – Iphone"6" ! "Apple"Inc." JFK" New"York" • ISBN ! "7tle" ORD" Chicago" – 0f553f57340f3" ! "“A"Game"of" BER" Berlin" Thrones”" CDG" Paris" • unit"conversion" " – 1"Mi" ! "1.6"km" • date"format"transforma7ons" • long/lat" ! "loca7on" – MM/DD/YYYY" ! "DD/MM/YY"" • language"transla7on " • currency"conversion" • …" – 1"USD" ! "0.7?"EUR" 5"

  6. Problem"Statement:"" Automa7cally"discover"transforma7ons!" Given' Find' airport City airport City BER ? BER Berlin JFK ? JFK New York ORD ? ORD Chicago HBE ? HBE Alexandria IST ? IST Istanbul FRA ? FRA Frankfurt BOS ? BOS Boston DFW ? DFW Dallas .. … .. … 6"

  7. Syntac7c"Transforma7ons" Liter' Gallon' US'date' EU'date' 1" 0.26" 11/01/2014" 01.11.2014" 5" 1.04" 11/02/2014" 02.11.2014" 100" 26.42" 10/30/2014" 30.10.2014" 34" 8.98" 11/05/2014" 05.11.2014" 6" 1.58" 11/04/2014" 04.11.2014" GB' MB' Name' Last'name' 1" 1,024" Michael"Stonebraker" Stonebraker" 0.49" 500" Michael"Bay" Bay" 100" 102,400" Michael"Brodie" Brodie" 2" 2,048" Michael"Jordan" Jordan" 6" 6,144" 7"

  8. Seman7c"Transforma7ons" ISBN' Title' Name' Nickname' 0f553f57340f3"" A"Game"of"Thrones" Michael" Mike" 0f553f80202fX" Universe"in"a"Nutshell" Samuel" Sam" 0f671f62964f6" The"Hitchhiker's"Guide" to"the"Galaxy" Ziawasch" Zia" 0f374f53355f7" Thinking"Fast"and"Slow" Rebeccea" Becca" 0f875f84585f1" The"Innovator’s" Dilemma" Airport'code' City' BOS" Boston" Model' Category' JFK" New"York" Iphone"6" Mobile"Phone" ORD" Chicago" MacBook"Air" Notebook" BER" Berlin" Logitech"mouse" Accessory" Nexus"5" Mobile"Phone" CDG" Paris" 8"

  9. Problem"Statement"" Given' Find' airport airport City City airport City airport City BER BER ? Berlin BER Berlin BER Berlin Example" JFK JFK ? New York JFK New York JFK New York transforma7ons" ORD ORD ? Chicago ORD Chicago ORD Chicago HBE HBE ? ? HBE Alexandria HBE Alexandria IST IST ? ? IST Istanbul IST Istanbul FRA FRA ? ? FRA Frankfurt FRA Frankfurt BOS BOS ? ? BOS Boston BOS Boston DFW DFW ? ? DFW Dallas DFW Dallas .. .. … … .. … .. … 9"

  10. DataXFormer:"The"Web"as"general" Repository" Given' Web'Tables' airport City BER Berlin JFK New York ORD Chicago HBE ? IST ? FRA ? BOS ? DFW ? .. … Web'Forms' 10"

  11. Web"Tables" • Dataset" – Dresden"Web"table"Corpus" – 120"Million"tables" • Efficiently"discovering" transforma7on"Examples :" Filter"and"Refine"approach" – Filter"irrelevant"tables" Mul7ple"itera7ons"and" – Overcome"fragmenta7on" example"augmenta7on" • Average"rowcount"="12" Rate"transforma7ons"based"on" – Dirty"and"Heterogeneous"" example"hits"and"majority"vote" 11"

  12. TransformaCon'task' Table"1" Table"3" airport City code location 1' 2' apc city … BER Berlin FRA Frankfurt DFW Dallas … JFK New York JFK New York HBE Alexandria … ORD Chicago LookF ORD Chicago Web"tables" IST Istanbul … Filter' HBE ? up' BOS Boston FRA Frankfurt … IST ? BER Berlin Table"4" …….." FRA ? Table"2" apc location BOS ? … … airport city DFW ? JFK New York … … FRA Frankfurt BER Berlin .. … … … DFW Dallas ORD Illinoise … … JFK New York FRA Frankfurt airport City FRA Hessen Augment' … … BER Berlin BOS Boston … … BER Berlin query' … .. … … DFW Dallas JFK New York airport City ORD Chicago Refine' 3' FRA Frankfurt BER Berlin BOS Boston JFK New York Augment'' X Y Score Lineage database' DFW Dallas ORD Chicago FRA Frankfurt 0.83 T1,T2 FRA Frankfurt BOS Boston 1 T1 BOS Boston DFW Dallas 0.67 T2 Result' DFW Dallas FRA Hessen 0.67 T4 HBE ? … … … … IST ? 4' 12" … …

  13. Web"Forms" • How"to"find"Web"forms?" – Use"search"engine" • How"to"use"a"Web"form?" – Generate"a"wrapper" • How"to"avoid"high" response"7me?" – Cache"results"as"new"tables" 13"

  14. Wrapping"Web"forms" • Parse"the"HTML"and" find"request" parameters" " • Locate"output"path" by"probing"with" examples" 14"

  15. Expert"System"for"Corner"Cases" • Evaluate"transforma7ons" • Solve"conflicts" • Create"Transforma7ons" 15"

  16. Experiments" • Collected"50"queries"from"computer"scien7sts" and"Tamr"customers" 1. Fahrenheit"to"Celsius"" 17. sentence"to"language"" 34. patent"ID"to"name"" 2. miles"to"km"" 18. text"to"encoding"" 35. city"to"long/lat"" 3. pound"to"kg" 19. Gregorian"to"Hijri"" 36. En7ty"to"wikipedia"link"" 4. USD"to"EUR" 20. CUSIP"to"company"" 37. En7ty"to"google"graph"id" 5. zip"to"state" 21. CUSIP"to"7cker"" 38. person"to"twiwer"id"" 6. zip"to"city" 22. symbol"to"company"" 39. ip"to"domain"" 7. UPS"tracking" 23. iban"to"bank"name"" 40. company"to"CEO"" to"address" 24. Loca7on"to"temperature" 41. company"to"industry"" 8. english"to"german"" 25. loca7on"to"humidity"" 42. US"standard"to"metric" 9. swiv"code"to"bank"" 26. car"plate"to"details"" 43. frac7ons"to"decimals"" 10. hex"to"RGB"" 27. country"code"to"country" 44. country"to"code"" 11. ISBN"to"publisher"" 28. ascii"to"char"" 45. State"to"state"abbrv"" 12. ISBN"to"7tle"" 29. car"model"to"brand"" 46. 7me"zone"to"abbrv"" 13. ISBN"to"author"" 30. country"to"demonym"" 47. city"to"country"" 14. ISSN"to"7tle"" 31. country"to"language"" 48. airport"code"to"city"" 15. ip"adress"to"country"" 32. country"to"currency"" 49. RGB"to"color" 16. Domain"to"primary"ip"" 33. company"to"BBGID"" 50. ASCII"to"unicode"" 16"

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend