Wikipedia job extractor by John Helbrink and Love Malmros

Introduction ▪ Text classification is a powerful tool in a lot of areas. ▪ Creating a deep learning algorithm for classifying people's profession based on first paragraph. ▪ Utilized tools such as Python, Keras, Wikidata, Wikipedia and Sparql. 2

What we are going to talk about today Related work ● What was the problem? ● How did we solve the problem? ● Preprocessing ● Architecture ● Results ● Demo ● Further improvements ● 3

Related work Earlier project. ▪ Solved the problem part-of-speech and dependency ▪ relations. We wanted to solve it by using deep learning. ▪ 4

How did we solve the problem? Keras is a high-level neural networks API, written i ▪ python. Wikidata is a central database storage that contains ▪ structured wikipedia data. Extraction using SPARQL Wikipedia contained the relevant first paragraph for ▪ each person. 5

Extracting data Extract occupations Extract first Combining the jobs from wikidata paragraph from with the paragraphs. wikipedia Used SPARQL to Generated the final extract occupations Combined the json corpus used to train from 900 000 people. file with people and our model containing corresponding jobs all the relevant Converted to a json with docria file information needed. file for easier containing the first processing. ~900k people. paragraph from Wikipedia. 6

Preprocessing The texts: ● Tokenize, max length = 150 ● To integer sequences ● Pad sequences with 0s The occupations: ● One hot encoded [ 1 0 0 1 0 0 0 ] 7

Scaling the input data in order to make it less imbalanced. Imbalanced with a ▪ majority (10%) of the jobs being politicians. 2 new sets: 50 - 500 (438 jobs) ▪ 100 - 500 (247 jobs) ▪ 8

Architecture Constructed our deep learning network by using: ● Linear stack of layers (Sequential) ● Embedding layer ● Bidirectional LSTM (Long short term memory) ● two hidden layer (dense) ● Softmax and categorical cross entropy 9

How did the road to our final model look? 10

What different models did we create? First draft (Binary Second draft (Feed Final model (Bidirectional LSTM) classification) forward network) Created a model with Did not produce good Created a few different models using binary classification, results, however was a LSTM with varying results. Big determining whether a step in the right improvements when scaling imbalanced person had a certain direction! data. job or not. 11

F1, precision, recall and accuracy 438 classes Precision Recall F1 Accuracy Limit of 50 - 500 labels. 0.818 0.809 0.813 0.772 247 classes Precision Recall F1 Accuracy Limit of 100 - 500 labels. 0.826 0.883 0.853 0.821 12

Model (50 - 500) sample Using the model with 438 classes, we supplied the model with the first paragraph corresponding to three people. The model (50-500) predicted: Edmund (or Eadmund; died 1041) was Bishop of Durham from 1021 to 1041. ● WIKIDATA: PRIEST PREDICTED: PRIEST Elita Krūmiņa (born 21 November 1965 in Jelgava) has been the auditor ● general of Latvia since 21 January 2013. Prior to becoming auditor general, Krūmiņa was a member of the council of Latvia's state audit offjce; she became a member in 2005 after spending six years at the Latvian Ministry of Finance. PREDICTED: POLITICIAN WIKIDATA: BANKER AND ECONOMIST Heriberto Andrés Bodeant Fernández (born 15 June 1955 in Young, Río ● Negro Department) is a Uruguayan Roman Catholic cleric. 13 PREDICTED: CATHOLIC PRIEST WIKIDATA: CATHOLIC PRIEST

Model (100 - 500) sample Using the model with 247 classes, we supplied the model with the first paragraph corresponding to three people. The model (100-500) predicted: John A. ("Jack") LaSota is a former Arizona Attorney General (1977–1978). ● LaSota also served as Bruce Babbitt's Chief of Staff when the former was governor of Arizona. He is a lobbyist for the firm LaSota & Peters, P.L.C. PREDICTED: LOBBYIST AND OTHER WIKIDATA: LAWYER AND JUDGE Ernesto Alciati (3 December 1901 – November 1984) was an Italian ● long-distance runner. PREDICTED: ATHLETICS COMPETITOR AND MARATHON RUNNER WIKIDATA: MARATHON RUNNER 14

Future work Extend the algorithm to support other languages aswell. ▪ Create usable web application for classifying texts. ▪ Retrieve more data to increase accuracy and number of ▪ classes. 15

THANKS! Any questions? John Helbrink, mat14jhe@student.lu.se Love Malmros, sta15lma@student.lu.se 16

Wikipedia job extractor by John Helbrink and Love Malmros - PowerPoint PPT Presentation

Wikipedia job extractor by John Helbrink and Love Malmros Introduction Text classification is a powerful tool in a lot of areas. Creating a deep learning algorithm for classifying people's profession based on first paragraph.

AD-EXTRACTOR TOOL Developer: Lalit Agarwal About Ad-Extractor A tool to extract and identify

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Semantic Wikipedia [[enhances::Wikipedia]] Wikipedia today A free online encyclopdia

Genealogy Wikis & Wikipedia Dave Barton Agenda What is a Wiki Genealogy Wikis

Wikipedia: n ++ made easy Matt Might University of Utah / NGLY1.org matt.might.net What

Wikipedia Sociographics Jimmy Wales President, Wikimedia Foundation Wikipedia Founder Todays

Computers Session 1 INST 346 Agenda The Computer The Course Source: Wikipedia

Physical Infrastructure Week 1 INFM 603 Agenda The Computer The Internet The Web

Saturday, 29 January 2011 OVERVIEW What is Wikipedia/Wikimedia? (Mike) What makes a

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

Tor and Wikipedia Roger Dingledine The Free Haven Project 1 Motivation China blocks

6.2 Online Job Search Objectives Identify the steps for an effective job search

Job 31:40b-32:5 The words of Job are ended. So these three men ceased to answer Job, because he

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in

Improving BGP routing security Job Job S Snijders NTT / / AS AS 2 2914 job ob@ntt.net

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING Lecture 16 Job Shop 1. Job Shop

Tech, Remote Work and More: Questions and Answers for Estate Planners Handout materials are

draft04 1 802 LMSC Executive Committee Opening Meeting 8 AM-10:30AM 2 2.01 Chairs

THE COURSE OF REGULATION OF BANKING ACTIVITIES OVERVIEW Gavar State University Teacher Gurgen

Inf 2B: AVL Trees Lecture 5 of ADS thread Kyriakos Kalorkoti School of Informatics University

XXV Meeting of the Central Bank Researchers Network Bank Competition and Risk-Taking Jorge Pozo

Nash Bargaining Julio D avila 2009 Julio D avila Nash Bargaining a bargaining problem

Incentives and Behavior Prof. Dr. Heiner Schumacher KU Leuven 16. Bargaining Prof. Dr. Heiner

Collective Download an electronic copy of the slides, at