Outline Motivation & Goal Framework & Design Examples - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Motivation & Goal Framework & Design Examples - - PowerPoint PPT Presentation

Hazy Lixing Lian, Cheng Ren Outline Motivation & Goal Framework & Design Examples Future Work Conclusion Two Trends that Drive Hazy Data in a large number of formats - (text, audio, video,


slide-1
SLIDE 1

Hazy

Lixing Lian, Cheng Ren

slide-2
SLIDE 2

Outline

¤ Motivation & Goal ¤ Framework & Design ¤ Examples ¤ Future Work ¤ Conclusion

slide-3
SLIDE 3

Two Trends that Drive Hazy

¤ Data in a large number of formats

  • (text, audio, video, OCR, sensor data, etc.)

¤ Arms race to deeply understand data Statistical tools attack both 1. and 2. Hazy = statistical + data management

slide-4
SLIDE 4

Hazy’s Thesis

¤ The next breakthrough in data analysis

  • may not be a new data analysis algorithm…
  • …but may be in the ability to rapidly combine,

deploy, and maintain existing algorithms.

slide-5
SLIDE 5

Hazy’s Goal

¤ Making big-data analytics-driven systems easier to build and maintain. ¤ Find common patterns when deploying statistical tools on data.

  • Programming abstractions
  • Infrastructure abstractions
slide-6
SLIDE 6
slide-7
SLIDE 7

Programming abstractions

¤ Enable developers to try many algorithms for the same data set. ¤ One algorithm improves, all applications using that algorithm automatic improve.

slide-8
SLIDE 8

Infrastructure abstractions

¤ No need to reinvent or reengineer the wheel when adding a new algorithm to the system ¤ One component of the infrastructure improved, all algorithms benefit automatically

slide-9
SLIDE 9

Markov logic

¤ Easily represent common statistical models: logistic regression and conditional random fields ¤ Build more sophisticated statistical models

slide-10
SLIDE 10

Markov Logic by Example

slide-11
SLIDE 11

Markov Logic by Example

wrote(s, t) ∧ advisedBy(s, p) -> wrote(p,t)

wrote(Tom, P1), advisedBy(Tom, Jerry) -> wrote (Jerry, P1) wrote(Tom, P1), advisedBy(Tom, Bob) -> wrote (Bob, P1) wrote(Bob, P1), advisedBy(Bob, Jerry) -> wrote (Jerry, P1)

Step 1: Grounding Find the field and extract data advisee advisor Tom

  • Jerry

Tom Bob Step 2: Sampling

slide-12
SLIDE 12

Grounding via SQL in Tuffy

Program Transformed into many SQL queries (Bottom-up) wrote(s, t) ∧ advisedBy(s, p) -> wrote(p,t) SELECT w1.id, a.id, w2.id FROM wrote w1, advisedBy a, wrote w2 WHERE w1.person = a.advisee AND w1.paper = w2.paper AND a.advisor = w2.person AND …

slide-13
SLIDE 13

Grounding: Top-down vs. Bottom-up

slide-14
SLIDE 14
slide-15
SLIDE 15

Example 1: DeepDive

¤ Enrich Wikipedia with structured data that is extracted from both unstructured sources

slide-16
SLIDE 16

DeepDive’s Origin

¤ Build a system that is able to read the Web and answer questions. ¤ Machine Reading: “List members of the Brazilian Olympic Team in this corpus with years

  • f membership”

DeepDive

slide-17
SLIDE 17

DeepDive

slide-18
SLIDE 18

DeepDive

slide-19
SLIDE 19
slide-20
SLIDE 20

DeepDive

Given a name, collects all the information related to this name and display together.

slide-21
SLIDE 21

Demo

¤ Wikipedia: http://en.wikipedia.org/wiki/ Barack_Obama ¤ WiscI: http://research.cs.wisc.edu/hazy/ wikidemo/index.php/Barack_Obama ¤ DeepDive: http://research.cs.wisc.edu/hazy/ demos/deepdive/index.php/Barack_Obama

DeepDive

slide-22
SLIDE 22

DeepDive: Demo

Tasks it performs:

  • Web Crawling
  • Information Extraction
  • Deep Linguistic Processing
  • Audio/Video Transcription
  • Tera-byte Parallel Joins

Some Information:

  • 50TB Data
  • 500K Machine hours
  • 500M Webpages
  • 400K Videos
  • 7Bn Entity Mentions
  • 114M Relationship Mentions

Declare graphical models at Web scale

slide-23
SLIDE 23

Example 2: GeoDeepDive

¤ http://hazy.cs.wisc.edu/demo/geo/ ¤ The goal is to help geo-scientists extract data that is buried in the text, tables, and figures of journal articles and web sites, sometimes called dark data. ¤ Extends a database called Macrostrat.

slide-24
SLIDE 24

Future work

¤ Assisted Development

  • expertise, experience of data and algorithms

¤ New Data Platforms

  • Hadoop environment
slide-25
SLIDE 25

Conclusion

¤ Key technical hypothesis: A large fraction of the processing performed by applications that use and analyze these new sources of data can be captured using a small handful of primitives. ¤ Hazy group is building several applications ¤ More information: http://hazy.cs.wisc.edu/hazy/

slide-26
SLIDE 26

Question?