Pattern Markup-Language Pattern Markup-Language A tool for - - PowerPoint PPT Presentation

pattern markup language pattern markup language
SMART_READER_LITE
LIVE PREVIEW

Pattern Markup-Language Pattern Markup-Language A tool for - - PowerPoint PPT Presentation

Pattern Markup-Language Pattern Markup-Language A tool for simplifying data extraction A tool for simplifying data extraction from semi-structured sources from semi-structured sources , Jonathan Baker, Hilton Campbell , Jonathan Baker,


slide-1
SLIDE 1

Pattern Markup-Language Pattern Markup-Language

A tool for simplifying data extraction A tool for simplifying data extraction from semi-structured sources from semi-structured sources

Jonathan Baker, Hilton Campbell Jonathan Baker, Hilton Campbell , , Jordan Crabtree, David W. Embley Jordan Crabtree, David W. Embley

slide-2
SLIDE 2

Pattern Markup Language Pattern Markup Language 2 2

Many Sites with Genealogical Many Sites with Genealogical Data Data

slide-3
SLIDE 3

Pattern Markup Language Pattern Markup Language 3 3

slide-4
SLIDE 4

Pattern Markup Language Pattern Markup Language 4 4

slide-5
SLIDE 5

Pattern Markup Language Pattern Markup Language 5 5

Structural Patterns Structural Patterns

slide-6
SLIDE 6

Pattern Markup Language Pattern Markup Language 6 6

slide-7
SLIDE 7

Pattern Markup Language Pattern Markup Language 7 7

slide-8
SLIDE 8

Pattern Markup Language Pattern Markup Language 8 8

slide-9
SLIDE 9

Pattern Markup Language Pattern Markup Language 9 9

slide-10
SLIDE 10

Pattern Markup Language Pattern Markup Language 10 10

Regular Expression A

Programmer Defined Programmer Defined Regular Expressions Regular Expressions

slide-11
SLIDE 11

Pattern Markup Language Pattern Markup Language 11 11

Regular Expression B

Programmer Defined Programmer Defined Regular Expressions Regular Expressions

slide-12
SLIDE 12

Pattern Markup Language Pattern Markup Language 12 12

Regular Expression C

Programmer Defined Programmer Defined Regular Expressions Regular Expressions

slide-13
SLIDE 13

Pattern Markup Language Pattern Markup Language 13 13

Given Name Birth Date Death Date Aliases

Which Relationships Which Relationships Found Found ? ?

slide-14
SLIDE 14

Pattern Markup Language Pattern Markup Language 14 14

Person Birth Death Names Date Date Given Aliases

Simple Schema Simple Schema Represents Relationships Represents Relationships

slide-15
SLIDE 15

Pattern Markup Language Pattern Markup Language 15 15

Combine Schema and Combine Schema and Regular Expressions Regular Expressions

Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression D Regular Expression C

Tree Represented by XML = Tree Represented by XML = PatML PatML

slide-16
SLIDE 16

Pattern Markup Language Pattern Markup Language 16 16

Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D

slide-17
SLIDE 17

Pattern Markup Language Pattern Markup Language 17 17

Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D

slide-18
SLIDE 18

Pattern Markup Language Pattern Markup Language 18 18

Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D

slide-19
SLIDE 19

Pattern Markup Language Pattern Markup Language 19 19

Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D

slide-20
SLIDE 20

Pattern Markup Language Pattern Markup Language 20 20

Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D

Schema Generator

Establishes relationships

PatML Generation Tools

slide-21
SLIDE 21

Pattern Markup Language Pattern Markup Language 21 21

Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D

PatML Editor

Helps write the regular expressions and establish which facts they match

PatML Generation Tools

slide-22
SLIDE 22

Pattern Markup Language Pattern Markup Language 22 22

slide-23
SLIDE 23

Pattern Markup Language Pattern Markup Language 23 23

Using PatML Editor Using PatML Editor

 Get your schema file

Get your schema file

 Browse for sample page

Browse for sample page

 Add nodes

Add nodes

 Add expressions

Add expressions

 See the highlights in source

See the highlights in source

 Adjust

Adjust

slide-24
SLIDE 24

Pattern Markup Language Pattern Markup Language 24 24

PatML Editor PatML Editor Interface Interface

Browser with rendered sample page Text area with sample page source Tree representing PatML structure

slide-25
SLIDE 25

Pattern Markup Language Pattern Markup Language 25 25

slide-26
SLIDE 26

Pattern Markup Language Pattern Markup Language 26 26

Fast and Versatile Fast and Versatile

 Regular sites can be integrated

Regular sites can be integrated in hours in hours

 Adaptable to any type of

Adaptable to any type of information information

slide-27
SLIDE 27

Pattern Markup Language Pattern Markup Language 27 27

Implementation to Date Implementation to Date

 Genesis uses PatML files to search a variety

Genesis uses PatML files to search a variety

  • f sites
  • f sites

 Searches TNG, Retrospect-GDS, Family

Searches TNG, Retrospect-GDS, Family Search, GedCom and Kansas Gunslingers Search, GedCom and Kansas Gunslingers

 Standardizes information for a common

Standardizes information for a common datamodel datamodel

 Simultaneously searches other sites (in

Simultaneously searches other sites (in different formats) for people with similar different formats) for people with similar information information

slide-28
SLIDE 28

Pattern Markup Language Pattern Markup Language 28 28

Results Results

slide-29
SLIDE 29

Pattern Markup Language Pattern Markup Language 29 29

 Produced PatML that correctly extracts

Produced PatML that correctly extracts data from TNG, RGDS, GedCom Sites, data from TNG, RGDS, GedCom Sites, and Kansas Gunslingers and Kansas Gunslingers

 User Interface allows for improved

User Interface allows for improved debugging environment debugging environment

 ~1/10 coding time with PatML

~1/10 coding time with PatML generation tools compared to similarly generation tools compared to similarly functioning hand coded parsers functioning hand coded parsers

Results Results

slide-30
SLIDE 30

Pattern Markup Language Pattern Markup Language 30 30

Limitations Limitations

Sites must be recognizable with Sites must be recognizable with regular expressions regular expressions

 Even regular sites have page to

Even regular sites have page to page HTML variations page HTML variations

Programmer error with regular Programmer error with regular expressions expressions

Regular expression operations can be Regular expression operations can be slow slow

slide-31
SLIDE 31

Pattern Markup Language Pattern Markup Language 31 31

Future work Future work

 Automatic regular expression

Automatic regular expression generation generation

 Parsing links to extract data on

Parsing links to extract data on connected pages connected pages

 Use in other applications and fields

Use in other applications and fields

 XPath approaches

XPath approaches