Analysis of library metadata with Metafacture 1 | 42 | Analysis of - - PowerPoint PPT Presentation

analysis of library metadata with
SMART_READER_LITE
LIVE PREVIEW

Analysis of library metadata with Metafacture 1 | 42 | Analysis of - - PowerPoint PPT Presentation

Christoph Bhme Analysis of library metadata with Metafacture 1 | 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013 Agenda 13:00 a short introduction to Metafacture 13:30 warm-up exercises 14:30


slide-1
SLIDE 1

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013 1

Analysis of library metadata with Metafacture

Christoph Böhme

slide-2
SLIDE 2

Agenda

2

13:00 — a short introduction to Metafacture 13:30 — warm-up exercises 14:30 — triples and counting 15:00 — exercises on counting data

(incl. 30 min coffee break at 15:30)

17:00 — joining data sets and analysing them 17:30 — exercises on joining data 18:50 — wrapping up

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-3
SLIDE 3

3

Part 1 A short introduction to Metafacture

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-4
SLIDE 4

Overview of Metafacture

4

Stream modules Stream modules Metamorph Metamorph Flux Flux

Building blocks for processing flows Building blocks for processing flows Stream module with a DSL* for metadata transformation Stream module with a DSL* for metadata transformation DSL* for constructing processing flows DSL* for constructing processing flows

*DSL: Domain specific Language

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013 | Version 1.0

slide-5
SLIDE 5

The basic building block of Metafacture

5

Stream module Stream module

Receives typed input:

  • strings
  • triples
  • objects
  • metadata events

Receives typed input:

  • strings
  • triples
  • objects
  • metadata events

Sends typed output:

  • strings
  • triples
  • objects
  • metadata events

Sends typed output:

  • strings
  • triples
  • objects
  • metadata events

Processes input to create some output. Modules usually perform rather small tasks to foster reusability Processes input to create some output. Modules usually perform rather small tasks to foster reusability

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013 | Version 1.0

slide-6
SLIDE 6

A simple processing flow

6

  • pen-file
  • pen-file

as-lines as-lines decode-pica decode-pica encode-formeta encode-formeta write("stdout") write("stdout") String  file name file handle   file name file handle   file handle a string for each line   file handle a string for each line   string metadata events   string metadata events   metadata events string   metadata events string   a string nothing   a string nothing 

Read and print a file containing pica records:

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013 | Version 1.0

slide-7
SLIDE 7

Module configuration

  • either a single mandatory value
  • or optional key-value pairs

Module configuration

7

Stream module Stream module

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-8
SLIDE 8

Describing flows with Flux

"file.name" |open-file |as-lines |decode-pica |encode-formeta(style="multiline") |write("stdout");

8

A string as the initial input A string as the initial input Modules are connected with a pipe character Modules are connected with a pipe character Key-value based configuration Key-value based configuration Mandatory parameter Mandatory parameter Flow ends with a semi-colon Flow ends with a semi-colon

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-9
SLIDE 9

Variables and comments in Flux

9

default in = "file.name"; default out = "stdout"; in |open-file // ... |write(out);

Comments start with two slashes Comments start with two slashes Define default values for the variables in in and out

  • ut

Define default values for the variables in in and out

  • ut

Use variable instead

  • f directly entering a

string Use variable instead

  • f directly entering a

string

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-10
SLIDE 10

Running Flux scripts

  • Flux script must be

selected in the IDE

  • Choose “Run with Flux”

to execute the selected Flux script

  • “Flux Help” outputs a

list of all supported modules

10 | 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-11
SLIDE 11

Representation of metadata in Meta- facture: a stream of events

11

Pica record 003@ $0 2809 033A $n Publisher $p Location Pica record 003@ $0 2809 033A $n Publisher $p Location

 Start record 2809  Start record 2809  Start entity 003@  Start entity 003@  Literal 0: 2809  Literal 0: 2809  End entity  End entity  Start entity 033A  Start entity 033A  Literal n: Publisher  Literal n: Publisher  Literal p: Location  Literal p: Location  End entity  End entity  End record  End record

Sequence of metadata events Sequence of metadata events

decode-pica decode-pica

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-12
SLIDE 12

Processing metadata events with Metamorph

12

 Start record id  Start record id  Start entity 021A  Start entity 021A  Literal a: The Trial  Literal a: The Trial  End entity  End entity  End record  End record  Start record id  Start record id  Literal Title: The Trial  Literal Title: The Trial  End record  End record

morph rph morph rph

Listen for 021A.a Output as Title

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-13
SLIDE 13

Metamorph: data statements

<? <?xml xml vers ersion ion="1.0" "1.0" enc encodi

  • ding

ng="U "UTF TF-8"?> ?> <m <meta etamo morph rph xml xmlns ns="h "http ttp:// ://www. ww.cu cultu ltureg regraph aph.o .org/ rg/met metamor morph ph" xm xmlns lns:x :xsi si="ht "http:/ p://w /www. ww.w3. w3.org/ rg/20 2001/ 01/XML XMLSche chema ma-ins instan tance" ce" ve versi rsion

  • n="1"

"1" en entit tityM yMark arker er="." ."> <rule ules> <data ata sour

  • urce

ce="021 021A. A.a" a" name name="Ti "Title" le" /> /> </ </rules rules> </met </metamorp amorph> h>

13

Separator for entities and literal names Separator for entities and literal names Name of the literal to listen for Name of the literal to listen for Name of the literal that is output Name of the literal that is output

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-14
SLIDE 14

Metamorph: modifying data

14

.. ... <rule ules> <data ata sour

  • urce

ce="021 021A. A.a" a" name name="Ti "Title" le"> <re rege gexp match match="^ "^(The) (The) ( (.*)$" .*)$" for

  • rma

mat="${ ${2} 2}, $ ${1 {1}" }" /> /> </ </data data> </ </rules rules> ... ...

Process the data value before outputting it. You can specify multiple functions here Process the data value before outputting it. You can specify multiple functions here

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-15
SLIDE 15

Metamorph: combining data

15

.. ... <rule ules> <comb

  • mbin

ine name name="Pu "Publ blish isher" er" value value="${ "${Pub Pub}: }: ${ ${Loc Loc}" }"> <data ata sour

  • urce

ce="033 033A. A.n" name name="Pu "Pub" /> /> <data ata sour

  • urce

ce="033 033A. A.p" p" name name="Loc Loc" /> /> </ </com combi bine ne> </ </rules rules> .. ...

The data statements do not generate output but create variables instead The data statements do not generate output but create variables instead Name of the generated

  • literal. It can include

variables, too Name of the generated

  • literal. It can include

variables, too Literal value constructed from the variables from the data statements below Literal value constructed from the variables from the data statements below

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-16
SLIDE 16

16

Exercises part 1 Warm-up

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-17
SLIDE 17

17

Part 2 Triples and counting

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-18
SLIDE 18

The triple

18

Triple: Triple: Subject Predicate Object

Inspired by RDF triples but subject und predicate do not need to be URIs Inspired by RDF triples but subject und predicate do not need to be URIs

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-19
SLIDE 19

Generating triples

19

 Start record id  Start record id  Literal name: Klaus  Literal name: Klaus  Start entity died  Start entity died  Literal when: 1401  Literal when: 1401  Literal where: HH  Literal where: HH  End entity  End entity  End record  End record record-id name Klaus record-id died …

Serialised with Formeta Serialised with Formeta Literals

  • n top

level Literals

  • n top

level Entities

  • n top

level Entities

  • n top

level

stream-to-triples stream-to-triples

Metadata events Triples

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-20
SLIDE 20

Counting triples

20

count-triples (countBy="object") count-triples (countBy="object") count 4 count

2

count 3

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-21
SLIDE 21

Outputting triples

21

template ("${o} times ${s}") template ("${o} times ${s}") red count 4

Use ${s}, ${p} and ${o} as placeholders for subject, predicate and

  • bject

Use ${s}, ${p} and ${o} as placeholders for subject, predicate and

  • bject

“4 times red” “4 times red”

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-22
SLIDE 22

Counting data values

22

morph morph stream-to- triples stream-to- triples count- triples count- triples

Converts the literals to triples Converts the literals to triples Extracts data items that should be counted and

  • utputs them as top

level literals Extracts data items that should be counted and

  • utputs them as top

level literals Counts the different

  • bject values of the

triples Counts the different

  • bject values of the

triples

template template

Optionally: converts triples into formatted text Optionally: converts triples into formatted text

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-23
SLIDE 23

Counting data values: flow of data

23

 Start record id  Start record id  Start entity 033A  Start entity 033A  Literal p: Hamburg  Literal p: Hamburg  End entity  End entity  End record  End record  Start record id  Start record id  Literal loc: Hamburg  Literal loc: Hamburg  End record  End record id loc Hamburg Hamburg count 1

Morph Morph Count Count

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-24
SLIDE 24

Metamorph: choosing data

24

.. ... <rul ules es> <choo hoose se name name=„Location"> <data ata sour

  • urce

ce="033 033A. A.p" p"> <rege egexp xp match match="^ "^Ffm Ffm$" $" for format at="F "Fran rankfu kfurt a t a. . M." M." /> /> </ </data data> <data ata sour

  • urce

ce="033 033A. A.p" p" /> /> </ </cho choos

  • se>

</ </rules rules> .. ...

Only the value of the topmost data-statement that generates

  • utput is returned by the choose-

statement Only the value of the topmost data-statement that generates

  • utput is returned by the choose-

statement

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-25
SLIDE 25

Metamorph: generating constant values

25

.. ... <rule ules> <data ata sour

  • urce

ce="021 021A. A.a" a" name name="Ti "Title" le"> <co cons nstan ant va value ue="All "All bo books

  • ks ha

have the the sam ame e nam ame" " /> /> </ </data data> </ </rules rules> ... ...

No matter what the value of literal 021A.a is, always

  • utput the defined value

No matter what the value of literal 021A.a is, always

  • utput the defined value

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-26
SLIDE 26

26

Exercises part 2 Triples and counting

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-27
SLIDE 27

27

Part 3 Joining data sets and analysing them

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-28
SLIDE 28

How is this done?

Joining streams of data

28 | 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-29
SLIDE 29

Converting triples into records

29

collect-triples collect-triples

Metadata events Triples 2 PB OB 2 PE OE 2 PA OA 2 PF … 1 PD OD 1 PC OC record 2 PA: OA PB: OB record 2 PA: OA PB: OB record 1 PC: OC PD: OD record 1 PC: OC PD: OD record 2 PE: OE PF { … } record 2 PE: OE PF { … } Sequences

  • f triples

with the same subject are merged Sequences

  • f triples

with the same subject are merged Serialised entities are deserialised into entities Serialised entities are deserialised into entities

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-30
SLIDE 30

Sorting triples

30

1 2 3 2 1 2 3 1

sort-triples sort-triples

1 2 3 1 2 3 2 1

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-31
SLIDE 31

Linking streams in Flux with wormholes

31

"file1" |open-file // ... |stream-to-triples |@X @X; "file2" |open-file // ... |stream-to-triples |@X @X; @X |wait-for-inputs("2") |sort-triples |collect-triples |encode-formeta |write("stdout");

These three flows must be defined in the same Flux script These three flows must be defined in the same Flux script Sends the triples into a “wormhole” Sends the triples into a “wormhole” Receives triples from a “wormhole” Receives triples from a “wormhole”

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-32
SLIDE 32

Advanced triplification: ID redirection

32

 Start record id  Start record id  Literal _id: new id  Literal _id: new id  Literal name: Klaus  Literal name: Klaus  Literal {to:id}n: v  Literal {to:id}n: v  End record  End record record id name Klaus id n v

Sets subject for individual triples Sets subject for individual triples

stream-to-triples (redirect="true”) stream-to-triples (redirect="true”)

Metadata events Triples

X

Replaces record id in subjects Replaces record id in subjects

new id

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-33
SLIDE 33

Using _id-redirection

33

id: gnd1 … id: gnd1 … id: gnd2 … id: gnd2 … id: gnd3 … id: gnd3 … id: HH gnd: 3 … id: HH gnd: 3 … id: Ffm gnd: 1 … id: Ffm gnd: 1 … id: Ks gnd: 2 … id: Ks gnd: 2 …

 Start record HH  Start record HH  Literal gnd: 3  Literal gnd: 3  Literal country: D  Literal country: D  End record  End record  Start record HH  Start record HH  Literal _id: gnd3  Literal _id: gnd3  Literal country: D  Literal country: D  Literal wiki-id: HH  Literal wiki-id: HH  End record  End record gnd3 country D gnd3 wiki-id HH

Morph Morph

gnd3 HH X

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-34
SLIDE 34

Using {to:ID}-redirection

34

id: Ffm … id: Ffm … id: Ks … id: Ks … id: HH … id: HH … id: gnd1 loc: HH … id: gnd1 loc: HH … id: gnd2 loc: Ffm … id: gnd2 loc: Ffm … id: gnd3 loc: HH … id: gnd3 loc: HH …

Finding the backlinks: Who links to this record? Finding the backlinks: Who links to this record?

 Start record gnd1  Start record gnd1  Literal loc: HH  Literal loc: HH  End record  End record

Morph Morph

 Start record gnd1  Start record gnd1  Literal {to:HH}ref: gnd1  Literal {to:HH}ref: gnd1  End record  End record HH HH ref gnd1

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-35
SLIDE 35

Putting the pieces together

35

morph morph stream-to- triples stream-to- triples morph morph stream-to- triples stream-to- triples

@

wait-for- inputs("2") wait-for- inputs("2") sort- triples sort- triples collect- triples collect- triples

Prepares records to generate the right subject values Prepares records to generate the right subject values Joins the two streams of triples using a “wormhole” Joins the two streams of triples using a “wormhole” Turns the triples back into metadata events Turns the triples back into metadata events Conversion with id redirection Conversion with id redirection Ensures that triples with the same subject form a sequence Ensures that triples with the same subject form a sequence Waits for all flows writing triples into the “wormhole” Waits for all flows writing triples into the “wormhole”

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-36
SLIDE 36

Metamorph: what else?

36

.. ... <rule ules> <data ata sour

  • urce

ce="_ "_id id" name name="reco ecord rdId Id" /> /> <dat ata sou

  • urc

rce="03 033A 3A.n" name name="P "Publis ublishe her" r" /> /> <data ata sour

  • urce

ce="_ "_else else" /> /> </ </rules rules> ... ...

Metamorph makes the record id of the current record as _id _id available Metamorph makes the record id of the current record as _id _id available Any literal not handled by any other data-statement is passed to this statement. It can be used to pass data through Metamorph Any literal not handled by any other data-statement is passed to this statement. It can be used to pass data through Metamorph

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-37
SLIDE 37

37

Exercises part 3 Joining data sets and analysing them

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-38
SLIDE 38

38

Wrapping up

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-39
SLIDE 39

What did we learn today?

– Foundations of processing metadata with Flux and Metamorph – Exploring data sets by quantifying data values – Joining data sets and analysing their relations – Typical patterns for analysing data with Metafacture

These patterns are similar to the way Hadoop operates: This makes migration from your desktop to a Hadoop cluster easy

39 | 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-40
SLIDE 40

Metafacture

– Not only designed for data analysis but for metadata processing in general – Software tool and library: It can easily be integrated into other applications – Flux and Metamorph are extendable – It is open source at http://culturegraph.github.io/

40 | 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-41
SLIDE 41

Job advert

We are looking for a software developer for our solr-based search engine infrastructure For more information please visit: http://www.dnb.de/stellen

41 | 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013

slide-42
SLIDE 42

42

Thank you very much!

Further questions? Contact me at c.boehme@dnb.de

  • r join the mailing list:

http://lists.dnb.de/mailman/ listinfo/metafacture

| 42 | Analysis of library metadata with Metafacture | SWIB 2013 | 25 November 2013