SMART CONSOLIDATION OF PRODUCT INFORMATION Maurice van Keulen 1 , - - PowerPoint PPT Presentation

smart consolidation of product information
SMART_READER_LITE
LIVE PREVIEW

SMART CONSOLIDATION OF PRODUCT INFORMATION Maurice van Keulen 1 , - - PowerPoint PPT Presentation

SMART CONSOLIDATION OF PRODUCT INFORMATION Maurice van Keulen 1 , Dolf Trieschnigg 1,2 , Brend Wanders 1 1 University of Twente, Enschede, Netherlands 2 Mydatafactory, Meppel, Netherlands PRODUCT DATA WHAT IS IT AND WHY IS IT A PROBLEM? What is


slide-1
SLIDE 1

SMART CONSOLIDATION OF PRODUCT INFORMATION

Maurice van Keulen1, Dolf Trieschnigg1,2, Brend Wanders1

1University of Twente, Enschede, Netherlands 2Mydatafactory, Meppel, Netherlands

slide-2
SLIDE 2

What is it § Data and specification on parts, substances, etc. Why is it a problem? § High requirements on data quality § Errors and duplicates may be costly or even pose health risks ØEven so, it is a mess (more on that later!)

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 2

PRODUCT DATA

WHAT IS IT AND WHY IS IT A PROBLEM?

slide-3
SLIDE 3

Proposed approach § Given catalogue / database with data on products § Gather data on the same products from websites (many more or less independent sources) § Consolidate: merge and clean ØOne enriched description

  • f the product

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 3

PRODUCT INFORMATION CLEANING AND ENRICHMENT

slide-4
SLIDE 4

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 4

PILOT: BALL BEARINGS

  • 1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS
slide-5
SLIDE 5

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 5

PILOT: BALL BEARINGS

  • 2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE

Get product pages Extact data Consolidate (merge, clean)

slide-6
SLIDE 6

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 6

PILOT: EXPERIENCES

slide-7
SLIDE 7

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 7

PILOT EXPERIENCES

slide-8
SLIDE 8

So, how to robustly automate this process of gathering, extraction and consolidation of product data? § Probabilistic approach throughout § Architecture for web harvesting § Automatically understand search forms and page structures, extract fields, and handle absurd data and field names § Get or automatically produce feedback to decide about whether something is good or rubbish § Be capable of backing out of a decision to redo something

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 8

PROJECT OBJECTIVE

slide-9
SLIDE 9

§ Flexible and intelligent § Backpedal and Redo (data provenance) § Flows may try multiple methods, sort out results later § Feedback loops to learn from ‘probably good’ data to understand new sites

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 9

WEB HARVESTING ARCHITECTURE

slide-10
SLIDE 10

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 10

PROBABILISTIC THROUGHOUT

JudgeD

Probabilistic DataLog

slide-11
SLIDE 11

Goal: Enrich and clean product data Approach § Gather and extract from websites § Consolidate data of individual products Solution § Intelligent and flexible architecture for web harvesting § Probabilistic approach throughout Repository § https://github.com/utdb/combine Note: academic code — might explode during use

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 11

CONCLUSIONS

slide-12
SLIDE 12

28 Oct 2016 DBDBD 2016 - Smart Consolidation of Product Information 12

(Francis Bacon, 1605) (Jorge Luis Borges, 1979)