[PPT] - Big Data for Libraries Kia Siang Hock kia_siang_hock@nlb.gov.sg 2 PowerPoint Presentation

SLIDE 1

1/90

SLIDE 2

2/90

Workshop 1 Big Data for Libraries

Kia Siang Hock

kia_siang_hock@nlb.gov.sg

SLIDE 3

3/90

The Workshop Programme

14:00 Welcome 14:10 About National Library Board, Singapore 14:20 What is Big Data? 14:40 Big Data in Libraries 15:15 Break 15:45 Examples of Big Data Implementations: Recommendations, Text Analytics, Ngram Viewer, Named Entity Extraction, Image Matching 16:45 More Q&A 17:00 End of workshop

SLIDE 4

4/90

About the National Library Board, Singapore

SLIDE 5

5/90

Libraries & Archives

1 National Library 1 National Archives 26 Public Libraries

About National Library Board, Singapore

SLIDE 6

6/90

Vision

Readers for Life, Learning Communities, Knowledgeable Nation

Mission

We make knowledge come alive, spark imagination and create possibilities.

About National Library Board, Singapore

SLIDE 7

7/90

About National Library Board, Singapore

The Public Library seeks to be a social learning space that nurtures active readers and knowledge seekers, through the provision of relevant, timely and engaging library services and reading programmes, using physical and digital means.

Public Library Services

SLIDE 8

8/90

About National Library Board, Singapore

Only library in Singapore that collects comprehensively published and distributed Content in the country for preservation and long term access Enable easy access to country’s Shared Memory to build rootedness and national identity Forge International Collaborations and advise on library development

National Library

SLIDE 9

9/90

About National Library Board, Singapore

The National Archives of Singapore (NAS) is the official custodian of Singapore’s collective memory. Ranging from government files, private memoirs, historical maps and photographs to oral history interviews and audio- visual materials, the NAS is responsible for the collection, preservation and management of Singapore's public and private archival records. The Asian Film Archive is founded to preserve the rich film heritage of Singapore and Asian Cinema, to encourage scholarly research on film, and to promote a wider critical appreciation of this art form.

SLIDE 10

10/90

A typical day in Singapore libraries…

79,000 people visit libraries 300 new members join the library 100,000 loans are made 27,000 people attend library programs and exhibitions

SLIDE 11

11/90

About National Library Board, Singapore

Libraries & Archives

1 National Library 1 National Archives 26 Public Libraries

Membership

More than 2m members

Visits

More than 27m visits

Collection

More than 1m titles More than 8.5m items

Loans

More than 35m loans

Online Usage

Digital User Visits: > 11m e-Retrievals: > 70m

FY2013 figures

SLIDE 12

12/90

What is Big Data?

SLIDE 13

13/90

What is Big Data?

“data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”

Oxford English Dictionary

“an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”

Wikipedia

“datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

McKinsey

“The ability of society to harness information in novel ways to produce useful insights or goods and services of significant values” and “… things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms

f value.”

Viktor Mayer-Schonberger & Kenneth Cukier

“The broad range of new and massive data types that have appeared over the last decade or so.”

Tom Davenport

Source: http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/

SLIDE 14

14/90

What is Big Data?

Source: http://datascience.berkeley.edu/what-is-big-data/

Top recurrent themes in the definitions of Big Data by 40 thought leaders

SLIDE 15

15/90

The Four V’s of Big Data

Source: http://www.ibmbigdatahub.com/infographic/four-vs-big-data

SLIDE 16

16/90

The Fifth V: Values

Big is relative.

Five broad ways in which using Big Data can create value

Source: Big data: The next frontier for innovation, competition, and productivity (McKinsey) http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

❶ Unlock significant value by making information transparent and usable at much higher frequency ❷ Collects more accurate and detailed performance information ❸ Allows ever-narrower segmentation of customers ❹ Sophisticated analytics can substantially improve decision-making ❺ Improves the development of the next generation of products and services

SLIDE 17

17/90

IDA Infocomm Technology Roadmap

Opportunities

> Analysis of unstructured data such as images and audio on top

f text data to unearth insights

from a bigger data pool > Insights from the data analytics

utcomes to augment decision

making processes > Analytics (retrospective to predictive) to proactively identify

pportunities or tackle problems

Challenges

> Understand and framing Big Data problems > Maturity in some of the underlying analytics algorithms > Shortage of data analytics talent

Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx

‘Big Data’ is a key technology theme that will shape the ICT landscape

SLIDE 18

18/90

Technology Stack Radar

< Hadoop MapReduce & distributed file system < NoSQL DBMS < Text Analytics < Visualisation-based discovery < In-memory analytics < Audio analytics < Predictive analytics < Master data management < SaaS-based business analytics

Complex event processing
Data-federation/visualisation
Video analytics
Mobile business analytics
Non-volatile memory

<03

Years

03-05

Years

Source: IDA’s Public Consultation on Infocomm Technology Roadmap 2012, 17 Aug 2012 http://www.ida.gov.sg/Technology/20060417212727.aspx

IDA Infocomm Technology Roadmap

‘Big Data’ is a key technology theme that will shape the ICT landscape

SLIDE 19

19/90

Big Data for Libraries

SLIDE 20

20/90

Disclaimers

Not a comprehensive study of the use of big data in libraries. A practitioner's high level overview of use of big data in libraries. Do not cover big data issues including data management, privacy and ownership.

SLIDE 21

21/90

Big Data Goals

Leverages NLB’s unique data assets Actionable Insights

Better foresights for future libraries planning Customer satisfaction improvements with better service offerings Better usage of NLB services and resources Unearthing the hidden treasures

Patrons Books Loans

Visits

Newspapers DVDs

VCDs E-databases

E-Books Digitised newspapers

Demographics

Locations

Digitised books Facebook pages

Events Browse Count

Structured & Unstructured Data Blogs

Tweets

Productivity gain with better decisions

SLIDE 22

22/90

Big Data for Libraries

Library Planning Patron Profiling Collection Optimisation Business Operations Digital Library Service Delivery

SLIDE 23

23/90

Big Data for Libraries

Library Planning using Geospatial Analytics

 Where are our users?  What do they read?  Are our libraries serving the residents in the vicinity?  Where shall we target our outreach campaign?  What is the impact on the usage of existing libraries when a new library opens?  Can our libraries cope with the population growth?

SLIDE 24

24/90

Big Data for Libraries

Patron Profiling & Footfall Analysis to Optimise Use of Library Space

Crowd Density Audience Profiling Human Traffic Flow

Source: Video Analytics as a Service http://vaaas.kaisquare.com/

SLIDE 25

25/90

Big Data for Libraries

Measuring & Analysing Energy Consumption using Smart Meters

SLIDE 26

26/90

Big Data for Libraries

Collection Optimisation – Collection Planning

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e- newsletter-april-2013.pdf

SLIDE 27

27/90

Big Data for Libraries

Collection Optimisation – Collection Planning

Collection Planning Model

Forecast of usage Cost of books Shelf space Initial collection Available budget Min/max collection size Planned Budget Planned Acquisition Planned Weeding Planned Space Projected Loans

Planned Final Collection

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e- newsletter-april-2013.pdf

SLIDE 28

28/90

Big Data for Libraries

Collection Optimisation – Demand Forecast

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e- newsletter-april-2013.pdf

SLIDE 29

29/90

Big Data for Libraries

Business Operations (Corporate KPIs, Finance, HR)

Source: http://www.ifla.org/files/assets/hq/news/documents/nlb-collection-management-e- newsletter-april-2013.pdf

SLIDE 30

30/90

Big Data for Libraries

Library Analytics Toolkit

Source: https://osc.hul.harvard.edu/liblab/projects/library-analytics-toolkit

The Library Analytics Toolkit is a dashboard that pulls library data together in a way that allows both librarians and library users to identify and respond to trends and changes in collections, usage, and other data

SLIDE 31

31/90

Big Data for Libraries

Integrated & Operational Analytics

~20% of items are borrowed within 3 days of their return

Auto-sorter

Just Return Bin

Patrons can easily access to

Big Data for Libraries

Digital Library - Curation

For staff to generate pathfinders
An easier publishing tool for staff curation of

content

Crowdsensing of user interests interfaces

with NLB content and pushes recommended content back to patron.

Find Curate Publish

Analyse search keywords at NLB websites

User Interests

Analyse search results & click-throughs

Collection Gaps

Building relationships between entities

Relationships

Analyse social media & relevant websites

Big Data for Libraries

Digital Library – Contextual Discovery

The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War II. It was unveiled on 31 March 1922 by the Prince

f Wales. On 28 December 2010, it was

gazetted as a national monument together with two other structures in Esplanade Park, the Lim Bo Seng Memorial and the Tan Kim Seng Fountain…

Gwee Peng Kwee (Oral History) Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall 'needle- like' monument... Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China

d. 29 June 1944, Perak, Malaya)

was a prominent ... Master Plan for Singapore - Central Area (1958) Singapore’s War Memorial to the Glorious Dead (11 Nov 1920) Lest we forget (8 Nov 1953) Singapore students learn to care about history (13 Jul 1997) Arrival of the Prince (31 Mar 1922) Singapore’s War Memorial (21 Sep 1921) Newspaper articles His daily routine school… Laying of foundation stone and unveiling of Cenotaph…

SLIDE 34

34/90

Big Data for Libraries

Digital Library – Content Analytics to search the ‘Un-searchable’

Image/video search Voice-to-text Named entity recognition Buildings People Streets Dates Organisations welcome

欢迎

Selamat datang

நல௎வரவூ

Cross-language discovery

SLIDE 35

35/90

Examples of Big Data Implementations

SLIDE 36

36/90

Recommending Good Reads to Patrons

Source: http://www.amazon.com/

SLIDE 37

37/90

Patron also borrowed these titles…

> 33m loans a year > 2m patrons Recommendations tailored to NLB patrons

Flag your wings by P. D. Eastman

1,070 patrons borrowed this title:

M00014123D M00025872A M00032776C M00032897A M00039928K M00040123B M00042334H M00045167I M00051921E M00056997H . . .

Recommendations

402 patrons 289 patrons 260 patrons Other titles the 1,070 patrons borrowed:

9342951 12547108 12910631 13085283 . . . 8734188 10247657 13046840 13085283 . . .

Collaborative filtering

SLIDE 38

38/90

A Simple Implementation

select book_id, count(*) from loans where patron_id in (select patron_id from loans where book_id = 5127546) group by book_id

rder by 2 desc

limit 20; book_id patron_id

loans table: Patrons who borrowed title ‘5127546’ also borrowed these other titles:

+----------+----------+ | book_id | count(*) | +----------+----------+ | 5127546 | 115 | | 3652671 | 23 | | 9136504 | 21 | | 3857787 | 20 | | 6132951 | 19 | | 4235852 | 19 | | 3049673 | 18 | | 12863855 | 18 | | 4624247 | 18 | | 4643539 | 18 | | 3718516 | 18 | | 5018345 | 18 | | 2908246 | 17 | | 4235878 | 17 | | 2085361 | 17 | | 3718517 | 17 | | 3260602 | 16 | | 4317577 | 16 | | 9043935 | 16 | | 6373666 | 16 | +----------+----------+

SLIDE 39

39/90

A Simple Implementation

Title level recommendations Patron level recommendations

NLB Mobile app

SLIDE 40

40/90

Contextual discovery via text analytics

SLIDE 41

41/90

The ability to mine unstructured data is key to an organisation’s competitive advantage

7,910 EB 1,227 EB 130 EB 1 EB (Exabyte) = 1,000,000 TB 2005 2010 2015 $20 $10 $0.50 90% - unstructured data 68% of all unstructured data in 2015 will be created by consumers All digital data Unstructured digital data

Storage cost per GB (US$)

Source: IDC’s Digital Universe Study, sponsored by EMC, Jun 2011 IDC - Singapore National Library Transforms Structured and Unstructured Data Into Insights

n Cloud, 2014

“For companies who are looking to derive insights from nontransactional data sources with Big Data analytics and cloud technologies, NLB's analytics journey shines the light on approaches and lessons learned in deriving value from both transactional and nontransactional data sources on cloud infrastructure.”

SLIDE 42

42/90

The Growing Digital Collection

Digitised books Historic newspapers Images Oral history recordings Audio-visual recordings

Other collections

Infopedia

articles

Web Archives
Singapore

Memories

Music
Posters
Building Plans
Govt Records
Private Records

Maps

SLIDE 43

43/90

NLB users retrieved tens of millions of e- content every year It would be really nice if we could convert every single e-retrieval instance into an enriching discovery experience for every single user every time…

SLIDE 44

44/90

Contextual Discovery

NLB users collectively contribute to tens of millions of e-retrievals every year

The Cenotaph, located at Esplanade Park along Connaught Drive, is a war memorial which commemorates the sacrifice of the men who perished during World War I and World War

II. It was unveiled on 31 March

1922 by the Prince of Wales. On 28 December 2010, it was gazetted as a national monument together with two other structures in…

Gwee Peng Kwee His daily routine school… Laying of foundation stone and unveiling of Cenotaph… Dalhousie Obelisk (Article) Dalhousie Obelisk, landmark, located at Empress Place in the Central Region. The tall… Lim Bo Seng (Article) Major-General Lim Bo Seng (b. 27 April 1909, Nan Ann, Fujian, China - d. 29 June… Master Plan for Singapore - Central Area (1958)

Singapore’s War Memorial to the Glorious Dead (11 Nov 1920) Lest we forget (8 Nov 1953) Singapore students learn to care about history (13 Jul 1997) Arrival of the Prince (31 Mar 1922) Singapore’s War Memorial (21 Sep 1921)

Newspaper articles

SLIDE 45

45/90

Using text analytics to automatically identify related content

Text tokenised; tokens parsed and weighted (TF/IDF) Text tokenised; tokens parsed and weighted (TF/IDF) Weighted tokens similarity computed Similarity = 0.295

SLIDE 46

46/90

Using Mahout to identify related content

Scalable, commercial-friendly, machine learning

for building intelligent applications

Use cases:
Recommendation
User Info + Community Info
Classification
Places new items into categories
Clustering
Group documents based on the notion of similarity
Frequent Itemset Mining
Analyze items in a group and then identifies which item typically

appear together

What is Apache Mahout?

http://mahout.apache.org/

SLIDE 47

47/90

Using Mahout to identify related content

The steps

Obtain the text of the content to be analysed, one file per item. Put them in the

“datafiles” folder. Obtain the text files

mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -

seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv

Create TF/IDF weighted vectors

mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
mahout rowsimilarity -i matrix/matrix -o similarity -

similarityClassName SIMILARITY_COSINE -m -ess Get the similarity results

SLIDE 48

48/90

Using Mahout to identify related content

The steps

Obtain the text of the content to be analysed, one file per item. Put them in the

“datafiles” folder. Obtain the text files

mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90 -

seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv

Create TF/IDF weighted vectors

mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
mahout rowsimilarity -i matrix/matrix -o similarity -

similarityClassName SIMILARITY_COSINE -m -ess Get the similarity results

Sequence file is a binary key-value file format used extensively in Mahout and Hadoop.

SLIDE 49

49/90

Using Mahout to identify related content

The steps

Obtain the text of the content to be analysed, one file per item. Put them in the

“datafiles” folder. Obtain the text files

mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90
seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf –nv

Create TF/IDF weighted vectors

mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
mahout rowsimilarity -i matrix/matrix -o similarity -

similarityClassName SIMILARITY_COSINE -m -ess Get the similarity results -s

Min support of the term in the entire collection

md

Min document frequency

x

Max document frequency percentage

ng

Maximum size of the n-grams

Key parameters:

SLIDE 50

50/90

Using Mahout to identify related content

The steps

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Sources: https://en.wikipedia.org/wiki/Tf%E2%80%93idf http://filotechnologia.blogspot.sg/2014/01/a-simple-java- class-for-tfidf-scoring.html http://criminalintent.org/2011/01/rapid-prototyping-with- mathematica/

SLIDE 51

51/90

Using Mahout to identify related content

The steps

Obtain the text of the content to be analysed, one file per item. Put them in the

“datafiles” folder. Obtain the text files

mahout seqdirectory -c UTF-8 -i datafiles -o seqfiles

Create the sequence files

mahout seq2sparse -i seqfiles -o vectors -ow -chunk 100 -x 90
seq -ml 50 -n 2 -s 5 -md 5 -ng 3 –wt tfidf -nv

Create TF/IDF weighted vectors

mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
mahout rowsimilarity -i matrix/matrix -o similarity
similarityClassName SIMILARITY_COSINE -m 10

Get the similarity results

SLIDE 52

52/90

Using Mahout to identify related content

The steps

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle.

① Julie loves me more than Linda loves me ② Jane likes me more than Julie loves me ① Julie (1) loves (2) me (2) more (1) than (1) Linda (1) likes (0) Jane (0) ② Julie (1) loves (1) me (2) more (1) than (1) Linda (0) likes (1) Jane (1) term frequency sim (①, ②) = 1x1 + 2x1 + 2x2 + 1x1 + 1x1 + 1x0 + 0x1 + 0x1 sqrt(12+22+22+12+12+12+02+02) x sqrt(12+12+22+12+12+02+12+12)

=0.822

SLIDE 53

53/90

Using Mahout to identify related content

The results

mahout seqdumper -i similarity > similarity.txt

Key: 0: Value: {14458:0.2966480826934176, 11399:0.30290014772966095, 12793:0.22009858979452146, 3275:0.1871791030103281, 14613:0.3534278632679437, 4411:0.2516380602790199, 17520:0.3139731583634198, 13611:0.18968888212315968, 14354:0.17673965754661425, 0:1.0000000000000004} Key: 1: Value: ... Article ID Similarity score between article 0 with article 14458 is 0.297

Similarity scores are between 0 and 1

SLIDE 54

54/90

An event unfolds…

SLIDE 55

55/90

Handling large data sets

Online resource of current

and historic Singapore and Malaya newspapers

− include The Straits Times, The Business Times, 星洲日报, 南洋商报, 联合早报, Berita Harian, TODAY

Over 20,000,000 articles

published, and growing

NewspaperSG

SLIDE 56

56/90

Using clustering to handle large datasets

Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)

Mahout K-Means Clustering with Cosine Distance

SLIDE 57

57/90

Using Mahout to cluster related content

SLIDE 58

58/90

Using Mahout to cluster related content

mahout kmeans -i vectors/tfidf-vectors/ -c initial-clusters

o kmeans-clusters
dm org.apache.mahout.common.distance.CosineDistanceMeasure
cd 0.1 -k 20 -x 20 -cl
k

Number of clusters

x

Maximum number of iterations

cd

Threshold of convergence (default: 0.5)

dm

Distance measurement (default: SquaredEuclidean) Source: https://mahout.apache.org/users/clustering/k-means-commandline.html

SLIDE 59

59/90

Handling large data sets

Size of cluster Top 40 stemmed terms

52,678 exhibit, art, artist, paint, museum, singapor, work, displai, galleri, open, mr, year, organis, centr, chines, cultur, held, on, nation, world, pictur, photograph, collect, hall, colour, includ, first, time, featur, sculptur, design, dai, peopl, piec, two, societi, part, visitor, fair, intern 86,881

lymp, athlet, game, sport, medal, event, team, gold, championship, record, world,

singapor, metr, swim, won, year, champion, win, nation, women, time, coach, asian, meet, competit, train, swimmer, two, second, race, compet, first, amateur, bronz, intern, associ, best, finish, yesterdai, silver 142,289 school, student, educ, teacher, univers, secondari, singapor, children, year, primari, studi, pupil, parent, mr, teach, colleg, cours, ministri, english, languag, chines, on, institut, examin, time, learn, princip, train, programm, work, graduat, help, nation, class, two, govern, girl, scienc, boi, first 125,629 polic, arrest, offic, suspect, two, men, yesterdai, man, investig, report, found, mr, gang, road, raid, on, station, crime, detain, year, night, arm, robberi, car, believ, peopl, forc, charg, singapor, robber, hous, todai, told, stolen, seiz, spokesman, old, held, four, escap

67 clusters averaging 93,000 articles each
Worked well for Chinese and Malay articles too
Close to 1 billion associations identified

Using automatic clustering technique

SLIDE 60

60/90

Implementations

Implemented in Jul 2013 Implemented in Sep 2013 Implemented in Nov 2013 Implemented in Jul 2014

Infopedia PictureSG NewspaperSG

SLIDE 61

61/90

NLB’s Hadoop cluster for text analytics

VM Host 1
Server 5

TaskTracker Server 6 TaskTracker Server 7 DataNode Server 2 JobTracker Server 8 TaskTracker Server 9 TaskTracker Server 10 DataNode Server 11 TaskTracker Server 12 TaskTracker Server 13 DataNode Server 3 NameNode Server 4 Checkpoint Server 1 Cluster Mgr

VM Host 3 VM Host 2

SLIDE 62

62/90

Benefits of Contextual Discovery

Referrals from Infopedia Pageviews per month Pageviews per visit 0.14%

10.65%

(after 6 months)

37,841

84,341

3.64

6.41

Infopedia PictureSG

SLIDE 63

63/90

N-gram Viewer

SLIDE 64

64/90

Google Books Ngram Viewer

Graph showing how phrases have occurred in a corpus over time

https://books.google.com/ngrams

SLIDE 65

65/90

Ngram viewer using Bookworm

Open source software from Culturomics

http://bookworm.culturomics.org/

SLIDE 66

66/90

Ngram viewer using Bookworm

To create an Ngram Viewer for your collection

Metadata Catalog

/metadata/jsoncatalog.txt
The list of the metadata for each

text Field Descriptions

/metadata/field_descriptions.json
Describes the properties of each

available metadata field Raw Text

/texts/raw/*.txt
The text files in your collection (in

.txt format)

SLIDE 67

67/90

Ngram viewer using Bookworm

To create an Ngram Viewer for your collection

Metadata Catalog

/metadata/jsoncatalog.txt
The list of the metadata for each

text Field Descriptions

/metadata/field_descriptions.json
Describes the properties of each

available metadata field Raw Text

/texts/raw/*.txt
The text files in your collection (in

.txt format)

Key Description of Value filename The filename of the corresponding text file (with .txt omitted and no whitespace in the name). date The date corresponding to a text file. Dates which are not integers should be specified as a string in the format: YYYY-MM-DD. searchstring The HTML code displayed for a text when points are clicked on in the ngram graph.

3 required fields:

{"filename": "s1541-104", "date": "1997-2-7", "searchstring": "A bill to extend, reform, and improve agricultural commodity, trade, conservation, and other programs, and for other purposes. | Read at: <a href=\"http://www.govtrack.us/congress/bills/104/s1541\" target=\"_blank\">govtrack.us</a>"}

SLIDE 68

68/90

Ngram viewer using Bookworm

To create an Ngram Viewer for your collection

Metadata Catalog

/metadata/jsoncatalog.txt
The list of the metadata for each

text Field Descriptions

/metadata/field_descriptions.json
Describes the properties of each

available metadata field Raw Text

/texts/raw/*.txt
The text files in your collection (in

.txt format)

Key Description field The name of the metadata variable. datatype The type of the data: searchstring, time, categorical, etc. type The format of the data: integer, decimal, character, text. unique Whether any given text can have only one type of this field (e.g. title) or not (e.g. subject). {"datatype": "searchstring", "field": "searchstring", "unique": true, "type": "text"}, {"datatype": "categorical", "field": "enacted", "unique": false, "type": "text"}, {"datatype": "time", "field": "date", "unique": true, "type": “character", "derived":[{"resolution":"year"}]}

SLIDE 69

69/90

Ngram viewer using Bookworm

To create an Ngram Viewer for your collection

Metadata Catalog

/metadata/jsoncatalog.txt
The list of the metadata for each

text Field Descriptions

/metadata/field_descriptions.json
Describes the properties of each

available metadata field Raw Text

/texts/raw/*.txt
The text files in your collection (in

.txt format)

Example Files:

/texts/raw/s1541-104.txt /texts/raw/hr2854-104.txt

SLIDE 70

70/90

Ngram viewer using Bookworm

Demonstration

http://bookworm.culturomics.org/congress/

SLIDE 71

71/90

Named Entity Recognition

SLIDE 72

72/90

Automatic extraction of time-based and location related information

12 Aug 1956 07 Sep 1971 30 Mar 1988 26 Jul 1992 16 Aug 2002 11 Feb 2009

Users navigate through

ld images of Singapore

building, streets, satellite images and events via augmented reality apps Resources can be mapped for contextual discovery Resources are time- stamped for discovery

n a time-line
Time and location are two of the most fundamental ways we organise things.
The automatic extraction of geo- and time-based references from the full-text can

yield more data than through manual tagging.

SLIDE 73

73/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection Tokenization Part-of- speech tagging Named-entity detection

Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.

SLIDE 74

74/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection Tokenization Part-of- speech tagging Named-entity detection

This is Mike. → ①This ②is ③Mike Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike.

SLIDE 75

75/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection Tokenization Part-of- speech tagging Named-entity detection

This is Mike. → ①This ②is ③Mike Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike. ①This (DT) ②is (VBZ) ③Mike (NNP)

SLIDE 76

76/90

Natural Language Processing (NLP)

Typical Steps of NLP

Sentence detection Tokenization Part-of- speech tagging Named-entity detection

This is Mike. → ①This ②is ③Mike Hi, How are you? This is Mike. → ①Hi, How are you? ②This is Mike. ①This (DT) ②is (VBZ) ③Mike (NNP) Mike person was in Singapore location

n 3rd October date

This is Mike person

SLIDE 77

77/90

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

General Architecture for Text

Engineering

Developed at the University of

Sheffield in 1995

A Java suite of tools, GUI & library
Provides means of analyzing text
Makes computers analyze and

understand the language that humans use naturally

Plugin to support different

languages ANNIE

A Nearly-New IE system
IE: Information Extraction
Distributed in GATE

SLIDE 78

78/90

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

Dates in Infopedia articles (http://eresources.nlb.gov.sg/infopedia/)

SLIDE 79

79/90

Natural Language Processing (NLP)

Named Entity Recognition using GATE/ANNIE

Handling local street and building names using ‘Gazetteers’

SLIDE 80

80/90

Natural Language Processing (NLP)

Some other options for NLP

http://www.alchemyapi.com/

Free for up to 1,000 transactions per day

SLIDE 81

81/90

Natural Language Processing (NLP)

Some other options for NLP

http://new.opencalais.com/

Free for up to 5,000 submissions per day

SLIDE 82

82/90

Image matching

SLIDE 83

83/90

Visual Search & Discovery

Visual Search User uploads his old photo to search for similar images

Upload another image 128 similar images found:

Visual Discovery Images without metadata description; cannot use text mining to cluster similar images

SLIDE 84

84/90

Visual Search & Discovery

Image Database Query Image Feature Detector & Descriptor Extractor Features (Super Matrix) Feature Detector & Descriptor Extractor Descriptor Matcher Similar Images Algorithms: SIFT (Scale Invariant Feature Transform) SURF (Speeded Up Robust Features) FAST (Features from Accelerated Segment Test) BRIEF (Binary Robust Independent Elementary Features) ORB (Oriented FAST and Rotated BRIEF)

SLIDE 85

85/90

Visual Search & Discovery

Options for image matching

Free open-source library based on BSD

license

Image processing, computer vision and

machine learning

Supports large number of algorithms
Key Features
Optimized for real time image

processing and computer vision applications

Interfaces to C++, C, C#, Java, Python
Run on Windows, Linux, Mac, iOS

and Android

http://opencv.org/ https://visenze.com/

Hosted on cloud
Scalable image

database

Search returns result

in milliseconds

APIs available

SLIDE 86

86/90

Question?

Kia Siang Hock

kia_siang_hock@nlb.gov.sg

SLIDE 87

87/90

Backup slides

SLIDE 88

88/90

Using Mahout to identify related content

Parameters for seq2sparse command

Option Flag Description Default value Overwrite (bool)

ow

If set, the output folder is overwritten. If not set, the output folder is created if the folder doesn’t exist. If the output folder does exist, the job fails and an error is thrown. Default is unset. NA Lucene analyzer name (String)

a

The class name of the analyzer to use.

rg.apache.lucene.analysis.st

andard.StandardAnalyzer Chunk size (int)

chunk

The chunk size in MB. For large document collections (sizes in GBs and TBs), you won’t be able to load the entire dictionary into memory during vectorization, so you can split the dictionary into chunks of the specified size and perform the vectorization in multiple stages. It’s recommended you keep this size to 80 percent of the Java heap size of the Hadoop child nodes to prevent the vectorizer from hitting the heap limit. 100 Weighting (String)

wt

The weighting scheme to use: tf for termfrequency based weighting and tfidf for TFIDF based weighting. tfidf Minimum support (int)

s

The minimum frequency of the term in the entire collection to be considered as a part of the dictionary file. Terms with lesser frequency are ignored. 2 Minimum document frequency (int)

md

The minimum number of documents the term should occur in to be considered a part of the dictionary file. Any term with lesser frequency is ignored. 1

SLIDE 89

89/90

Using Mahout to identify related content

Option Flag Description Default value Max document frequency percentage (int)

x

The maximum number of documents the term should occur in to be considered a part of the dictionary file. This is a mechanism to prune out high frequency terms (stop-words). Any word that occurs in more than the specified percentage

f documents is ignored.

99 N-gram size (int)

ng

The maximum size of n-grams to be selected from the collection of documents. 1 Minimum log- likelihood ratio (LLR) (float)

ml

This flag works only when n-gram size is greater than 1. Very significant n-grams have large scores, such as 1000; less significant ones have lower scores. Although there’s no specific method for choosing this value, the rule of thumb is that n-grams with a LLR value less than 1.0 are irrelevant. 1.0 Normalization (float)

n

The normalization value to use in the Lp space. A detailed explanation of normalization is given in section 8.4. The default scheme is to not normalize the weights. Create sequential access sparse vectors (bool)

seq

If set, the output vectors are created as

SequentialAccessSparseVectors. By default the dictionary

vectorizer generates RandomAccessSparseVectors. The former gives higher performance on certain algorithms like k- means and SVD due to the sequential nature of vector

perations. By default the flag is unset.

NA

Parameters for seq2sparse command

SLIDE 90

90/90

Using Mahout to cluster related content

Option Description

-input (-i) input

Path to job input directory. Must be a SequenceFile of VectorWritable

-clusters (-c) clusters

The input centroids, as Vectors. Must be a SequenceFile of Writable, Cluster/Canopy. If k is also specified, then a random set of vectors will be selected and written out to this path first

-output (-o) output

The directory pathname for output.

-distanceMeasure (-dm)

distanceMeasure The classname of the DistanceMeasure. Default is SquaredEuclidean

-convergenceDelta (-cd)

convergenceDelta The convergence delta value. Default is 0.5

-maxIter (-x) maxIter

The maximum number of iterations.

-maxRed (-r) maxRed

The number of reduce tasks. Defaults to 2

-k (-k) k

The k in k-Means. If specified, then a random selection of k Vectors will be chosen as the Centroid and written to the clusters input path.

-overwrite (-ow)

If present, overwrite the output directory before running job

-clustering (-cl)

If present, run clustering after the iterations have taken place

Parameters for kmeans command