Pattern Markup-Language Pattern Markup-Language A tool for - - PowerPoint PPT Presentation
Pattern Markup-Language Pattern Markup-Language A tool for - - PowerPoint PPT Presentation
Pattern Markup-Language Pattern Markup-Language A tool for simplifying data extraction A tool for simplifying data extraction from semi-structured sources from semi-structured sources , Jonathan Baker, Hilton Campbell , Jonathan Baker,
Pattern Markup Language Pattern Markup Language 2 2
Many Sites with Genealogical Many Sites with Genealogical Data Data
Pattern Markup Language Pattern Markup Language 3 3
Pattern Markup Language Pattern Markup Language 4 4
Pattern Markup Language Pattern Markup Language 5 5
Structural Patterns Structural Patterns
Pattern Markup Language Pattern Markup Language 6 6
Pattern Markup Language Pattern Markup Language 7 7
Pattern Markup Language Pattern Markup Language 8 8
Pattern Markup Language Pattern Markup Language 9 9
Pattern Markup Language Pattern Markup Language 10 10
Regular Expression A
Programmer Defined Programmer Defined Regular Expressions Regular Expressions
Pattern Markup Language Pattern Markup Language 11 11
Regular Expression B
Programmer Defined Programmer Defined Regular Expressions Regular Expressions
Pattern Markup Language Pattern Markup Language 12 12
Regular Expression C
Programmer Defined Programmer Defined Regular Expressions Regular Expressions
Pattern Markup Language Pattern Markup Language 13 13
Given Name Birth Date Death Date Aliases
Which Relationships Which Relationships Found Found ? ?
Pattern Markup Language Pattern Markup Language 14 14
Person Birth Death Names Date Date Given Aliases
Simple Schema Simple Schema Represents Relationships Represents Relationships
Pattern Markup Language Pattern Markup Language 15 15
Combine Schema and Combine Schema and Regular Expressions Regular Expressions
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression D Regular Expression C
Tree Represented by XML = Tree Represented by XML = PatML PatML
Pattern Markup Language Pattern Markup Language 16 16
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup Language Pattern Markup Language 17 17
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup Language Pattern Markup Language 18 18
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup Language Pattern Markup Language 19 19
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Pattern Markup Language Pattern Markup Language 20 20
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D
Schema Generator
Establishes relationships
PatML Generation Tools
Pattern Markup Language Pattern Markup Language 21 21
Person Birth Death Names Date Date Given Aliases Regular Expression A Regular Expression B Regular Expression C Regular Expression D
PatML Editor
Helps write the regular expressions and establish which facts they match
PatML Generation Tools
Pattern Markup Language Pattern Markup Language 22 22
Pattern Markup Language Pattern Markup Language 23 23
Using PatML Editor Using PatML Editor
Get your schema file
Get your schema file
Browse for sample page
Browse for sample page
Add nodes
Add nodes
Add expressions
Add expressions
See the highlights in source
See the highlights in source
Adjust
Adjust
Pattern Markup Language Pattern Markup Language 24 24
PatML Editor PatML Editor Interface Interface
Browser with rendered sample page Text area with sample page source Tree representing PatML structure
Pattern Markup Language Pattern Markup Language 25 25
Pattern Markup Language Pattern Markup Language 26 26
Fast and Versatile Fast and Versatile
Regular sites can be integrated
Regular sites can be integrated in hours in hours
Adaptable to any type of
Adaptable to any type of information information
Pattern Markup Language Pattern Markup Language 27 27
Implementation to Date Implementation to Date
Genesis uses PatML files to search a variety
Genesis uses PatML files to search a variety
- f sites
- f sites
Searches TNG, Retrospect-GDS, Family
Searches TNG, Retrospect-GDS, Family Search, GedCom and Kansas Gunslingers Search, GedCom and Kansas Gunslingers
Standardizes information for a common
Standardizes information for a common datamodel datamodel
Simultaneously searches other sites (in
Simultaneously searches other sites (in different formats) for people with similar different formats) for people with similar information information
Pattern Markup Language Pattern Markup Language 28 28
Results Results
Pattern Markup Language Pattern Markup Language 29 29
Produced PatML that correctly extracts
Produced PatML that correctly extracts data from TNG, RGDS, GedCom Sites, data from TNG, RGDS, GedCom Sites, and Kansas Gunslingers and Kansas Gunslingers
User Interface allows for improved
User Interface allows for improved debugging environment debugging environment
~1/10 coding time with PatML
~1/10 coding time with PatML generation tools compared to similarly generation tools compared to similarly functioning hand coded parsers functioning hand coded parsers
Results Results
Pattern Markup Language Pattern Markup Language 30 30
Limitations Limitations
Sites must be recognizable with Sites must be recognizable with regular expressions regular expressions
Even regular sites have page to
Even regular sites have page to page HTML variations page HTML variations
Programmer error with regular Programmer error with regular expressions expressions
Regular expression operations can be Regular expression operations can be slow slow
Pattern Markup Language Pattern Markup Language 31 31
Future work Future work
Automatic regular expression
Automatic regular expression generation generation
Parsing links to extract data on
Parsing links to extract data on connected pages connected pages
Use in other applications and fields
Use in other applications and fields
XPath approaches