Efficient Entity Annotation for Large Scale Web Archives Elena - PowerPoint PPT Presentation

Efficient Entity Annotation for Large Scale Web Archives Elena Demidova, Julian Szymanski, Sergej Zerr and Karol Draszawka L3S Research Center, Hannover, Germany, Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Poland 1

i/ bəˈrɑːk hu ːˈseɪn ɵˈbɑːmə /; born August 4, 1961) is the 44th and current President of the Barack Hussein Obama II (US United States, and the first African Americanto hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney and taught constitutional law at University of Chicago Law School from 1992 to 2004. He served three terms representing the 13th District in the Illinois Senate from 1997 to 2004, running unsuccessfully for the United States House of Representatives in 2000. https://en.wikipedia.org/wiki/Barack_Obama

i/ bəˈrɑːk hu ːˈseɪn ɵˈbɑːmə /; born August 4, 1961) is the 44th and current President of the Barack Hussein Obama II (US United States, and the first African Americanto hold the office. Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. He was a community organizer in Chicago before earning his law degree. He worked as a civil rights attorney and taught constitutional law at University of Chicago Law School from 1992 to 2004. He served three terms representing the 13th District in the Illinois Senate from 1997 to 2004, running unsuccessfully for the United States House of Representatives in 2000. https://en.wikipedia.org/wiki/Barack_Obama [] Danqi Chen and Christopher D Manning. 2014. A Fast and Accurate Dependency Parser using Neural Networks. Proceedings of EMNLP 2014

[] Stefan Siersdorfer, Hanno Ackermann, Philipp Kemkes, Sergej Zerr Who with Whom and How? - Extracting Large Social Networks using Search Engines [full paper- in press] 24th ACM Conference on Information and Knowledge Management (CIKM), Melbourne, Australia, 2015

Naive extraction algorithms, running times Alexandria Project (Foundations for Temporal Retrieval, Exploration and Analytics in Web - ERC 339233) Access to over 80 TB compressed web pages 1995-2014. Experiments were conducted on a sample of 2,5 Mio. Web pages. Dictionary based NLP based Total runtime 35 357 949 (9h 42min) 133 265 869 (37h 1min) ms Algorithm runtime 3 328 493 (56 min) (x27) 99 232 735 (27h 34min) ms Entities found 3 487 182 8 188 293 persons Distinct all entities 156 197 1 448 828 (x10) persons Overlap persons 130 900 persons

Hashing and Locality Sensitive hashing (LSH) LSH - ([kay elizabeth, elizabeth kay, elizabeth kempe]) 00 - ([tony de sergio, tony di sergio]) 01 - ([lawrence bragg, floss hodges]) 02 03 - ([annie hammond, ann hammond]) 04 - ([c. williamson, m. williamson, william williamson, g. williamson, h. williamson, jb williamson, ed williamson, williamson jr.]) : 15

First experiments We applied LSH on a set of 291954 person names extracted from the Web archive. Features: - 3-gramms (Hammond May= {ham,amm,mmo,mon,ond,may}) - First letters (h,m) - String count Examined: - 291954 different person names - 40200 different features - Time ~ 1 minute Levenstein distance count Basket size count For a sample of 1000 names from baskets of size 1, for around 60% 0 1665 near duplicates (according to levenstein distance) could be found 1 1426 1 267183 2 684 Example: inga fossa [bossa rosa, hugh foss]: distance:(6.0) 2 7749 3 477 3 805 4 499 4 318 5 321 5 161 6 300 6 92 7 274 7 66 8 400 8 44 9 190 9 37 10 207 10 28 >10 3240 >10 116 b) Levenstein distance by collisions a) LSH basket size distribution (max=284) (max=120)

Conclusion Observation : - Dictionary based approach is very fast, however is limited to fixed set of strings - NLP based approach captures more variations, but is very slow Idea : - (1) Extend dictionary based approach with near duplicate matching (LSH) to obtain more entity variations efficiently - (2) Entity grouping by similarity Challenges : - Algorithm parametrization - Feature selection for similarity measure

Discussions / Questions / Remarks 10

Efficient Entity Annotation for Large Scale Web Archives Elena - PowerPoint PPT Presentation

Efficient Entity Annotation for Large Scale Web Archives Elena Demidova, Julian Szymanski, Sergej Zerr and Karol Draszawka L3S Research Center, Hannover, Germany, Gdansk University of Technology, Faculty of Electronics, Telecommunications and

The National Archives Engagement Team Working with the wider archives sector Emma Jay 16

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives Tarcsio

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Library and Archives Canada Wallot-Sylvestre Seminar 2018 Archives Matter Jeff James, Chief

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

A Collaborative Named Entity Focused URI Collection to Explore Web Archives Workshop on Web

Introduction to Journal Archives Over 4 million articles from over 600 journals, sourced from 8

Multilayer Optical X-ray Coatings GOAL OF PROJECT: CHARACTERIZING THE NEW PROFILE COATING

MONARCH Trial Darunavir + RTV Monotherapy versus Triple Therapy MONARCH: Study Design Study

Multi-Agent Simulation of Protein Folding Luca Bortolussi 1 Agostino Dovier 1 Federico Fogolari 2 1

CSE 110A: Winter 2020 Fundamentals of Compiler Design I Datatypes and Higher-order

A Plan 9 Approach to Hierarchical Patch Dynamics John (EBo) David IWP9 2010 Seattle, WA Many

MassHealth Member Experience Input Session June 24, 2014 Steve Somers Rob Houston Center for

TARA: Topology-Aware Resource Adaptation for Congestion Avoidance in Wireless Sensor Networks

Search for Nucleon Decay with Super-K Hide-Kazu TANAKA (University of Tokyo, ICRR) for the

Sambuz

Useful Links

Newsletter

Mail Us

Efficient Entity Annotation for Large Scale Web Archives Elena - PowerPoint PPT Presentation

Efficient Entity Annotation for Large Scale Web Archives Elena Demidova, Julian Szymanski, Sergej Zerr and Karol Draszawka L3S Research Center, Hannover, Germany, Gdansk University of Technology, Faculty of Electronics, Telecommunications and

The National Archives Engagement Team Working with the wider archives sector Emma Jay 16

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives Tarcsio

Annotation Processing in a Kotlin World Zac Sweers @pandanomic Annotation Processing in a

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Web Annotations Building the Experience Annotation An annotation is something added. It is not

Library and Archives Canada Wallot-Sylvestre Seminar 2018 Archives Matter Jeff James, Chief

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield University of

Lecture 2 Annotation tools &amp; Segmentation Summary of Part 1 Annotation theory

Systematic Annotation Mark Voorhies 4/5/2012 Mark Voorhies Systematic Annotation Review RTFM

Assessing annotation Assessing annotation consistency in the Gene consistency in the Gene

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

A Collaborative Named Entity Focused URI Collection to Explore Web Archives Workshop on Web

Introduction to Journal Archives Over 4 million articles from over 600 journals, sourced from 8

Multilayer Optical X-ray Coatings GOAL OF PROJECT: CHARACTERIZING THE NEW PROFILE COATING

MONARCH Trial Darunavir + RTV Monotherapy versus Triple Therapy MONARCH: Study Design Study

Multi-Agent Simulation of Protein Folding Luca Bortolussi 1 Agostino Dovier 1 Federico Fogolari 2 1

CSE 110A: Winter 2020 Fundamentals of Compiler Design I Datatypes and Higher-order

A Plan 9 Approach to Hierarchical Patch Dynamics John (EBo) David IWP9 2010 Seattle, WA Many

MassHealth Member Experience Input Session June 24, 2014 Steve Somers Rob Houston Center for

TARA: Topology-Aware Resource Adaptation for Congestion Avoidance in Wireless Sensor Networks

Search for Nucleon Decay with Super-K Hide-Kazu TANAKA (University of Tokyo, ICRR) for the

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 2 Annotation tools & Segmentation Summary of Part 1 Annotation theory