A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , - PowerPoint PPT Presentation

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , Prasha Shrestha, and Thamar Solorio

Introduction • Task • Given an anonymous document • Predict • Age group [18-24 | 25-34 | 35-49 | 50-64 | 65-plus] • Gender [Male | Female] • Provided: Training data in English and Spanish • English – Blog, Reviews, Social Media, and Twitter • Spanish – Bog, Social Media, and Twitter

Motivation • Started experimenting with PAN’13 data • PAN’13 dataset • 1.8 GB of training data for English • 384 MB of training data for Spanish • Explored MapReduce for fast processing of huge amount of data

Data Distribution English Spanish Category Files Size (MB) Files Size (MB) Blog 147 88 7.6 8.3 Reviews 4160 - 18.3 - Social Media 7746 1272 562.3 51.9 Twitter 306 178 104.0 85.0 Total 12359 692.2 1538 145.2 Table 1: Training data distribution.

Methodology • Preprocessing • Sequence File Creation • Tokenization • DF Calculation • Filter • Features • Word n-grams (unigrams, bigrams, trigrams) • Weighing Scheme: TF-IDF • Classification Algorithm • Logistic Regression with L2 norm regularization

Tokenization Remove xml and html <authorid>_<lang>_<age>_<gender tags from documents >.xml Filename Content Filename 1,2,3 grams [A, B, C, C, A B, F1 “A B C C” F1 Preprocessing Map A B C, ..] [B, D, E, A, B D F2 “B D E A” F2 Sequence Files E, E A, ..] [A, B, D, E, B D, F3 “A B D E” F3 Map B D E, …] F4 “C C C” [C, C, C, C C, C F4 C C, …] F5 “B F G H” [B, F, G, H, G H, F5 …] Map F6 “A E O U” [A, E, O, U, E O, F6 A U,…] Tokenization job

DF Calculation Job Token 1 Token 1 Token DF count A 1 Reduce A 1 A 4 … … A 1 B 1 B 4 … … … … DF count job D 1 B 1 C 2 Group By … … Reduce B 1 D 2 C 1 … … … … E 3 F 1 B 1 … … G 1 A 1 F 1 Reduce .. … … …

Filter Job Filter count DF count Filename 1,2,3 grams [A, B, C, C, A B, [A, B, C, C, A F1 F1 Map A B C, ..] B, ..] [B, D, E, A, B D [B, D, E, A, B D F2 F2 E, E A, ..] E, ..] [A, B, D, E, B D, [A, B, D, E, B F3 F3 Map B D E, …] D, B D E, …] [C, C, C, C C, C C [C, C, C, C C, F4 F4 C, …] …] [B, F, G, H, G H, F5 [B, H, …] F5 …] Map [A, E, O, U, E O, F6 [A, E,…] F6 A U,…] Filter job

TF-IDF Job • Mapper • Setup: • Read in dictionary and DF score files • Map: • Map(“<authorid>_<lang>_<age>_<gender>.xml”, filtered token list)- >(“<authorid>_<lang>_<age>_<gender>.xml”, VectorWritable) • Compute tf-idf scores for each token • Creates RandomSparseVector(mahout-math) • Finally writes vectors

Training • Trained on : • Naïve Bayes (MR) • Cosine Similarity (MR) • Weighted Cosine Similarity (MR) • Logistic regression (LibLinear) • SVM (LibLinear) • Final model uses LibLinear’s logistic regression

Experiments • Local Hadoop cluster with 1 master node and 7 slave nodes • Each node has 16 cores and 12 GB memory • Training data split into 70:30 ratio for training and development • Modeled as 10 class classification problem

Experiments English (%) Spanish (%) Classification Algorithm Blogs Reviews Social Media Twitter Blog Social Media Twitter Naïve Bayes 27.50 21.55 20.62 28.89 55.00 20.48 34.78 Cosine similarity 20.00 23.64 19.72 27.78 35.00 26.33 36.96 Weighted Cosine Similarity 30.00 23.16 19.97 26.67 40.00 22.07 32.61 Logistic Regression 27.50 23.08 20.62 33.33 35.00 25.80 32.61 SVM 25.00 22.28 19.80 32.22 30.00 26.33 34.78 Table 2: Accuracy for word 1, 2, 3 -grams for cross validation dataset. English (%) Spanish (%) Classification Algorithm Blogs Reviews Social Media Twitter Blog Social Media Twitter Naïve Bayes 25.00 18.99 18.33 24.44 40.00 19.68 23.91 Cosine similarity 20.00 21.63 17.90 30.00 50.00 21.81 26.09 Weighted Cosine Similarity 20.00 21.15 16.78 23.33 40.00 19.68 28.26 Logistic Regression 22.50 21.71 16.78 25.56 35.00 23.67 17.39 SVM 20.00 20.83 15.92 24.44 35.00 23.14 17.39 Table 3: Accuracy for character 2, 3 -grams for cross validation dataset.

Experiments • Separate Model: Different models for blog, social media, twitter and reviews per language • Single Model: A single, combined model for each language English (%) Spanish (%) Classification Algorithm Separate Models Single Model Separate Models Single Model Naïve Bayes 21.21 20.13 23.53 21.04 Cosine similarity 19.89 17.34 27.83 27.60 Weighted Cosine Similarity 21.32 18.18 23.98 24.89 Logistic Regression 21.83 21.92 26.92 28.96 SVM 20.99 20.48 27.37 28.05 Table 4: Accuracy for single and separate models for all categories.

Results • Number of features in English : 7,299,609 • Number of features in Spanish: 1,154,270 System Average Accuracy(%) PAN’14 Best 28.95 Ours 27.60 Baseline 14.04 Table 5: Accuracy comparison with other systems.

Results Test 1 Test 2 Language Category Both Age Gende Runtimes Both Age Gende Runtime r r Blog 16.67 25.00 54.17 00:01:50 23.08 38.46 57.69 0:01:56 Reviews 20.12 28.05 62.80 00:01:46 22.23 33.31 66.87 0:02:13 English Social Media 20.09 36.27 53.32 00:07:18 20.62 36.52 53.82 0:26:31 Twitter 40.00 43.33 73.33 00:02:01 30.52 44.16 66.88 0:02:31 Blog 28.57 42.86 57.14 00:00:35 25.00 46.43 42.86 0:00:39 Spanish Social Media 30.33 40.16 68.03 00:01:13 28.45 42.76 64.49 0:03:26 Twitter 61.54 69.23 88.46 00:00:43 43.33 61.11 65.56 0:01:10 1st 3rd Table 6: Accuracy by category and language on test dataset. 2nd

Conclusion • Word n-grams proved to be better features than character n-grams for this task • MapReduce is ideal for feature extraction from large dataset • Our system works better when there is a large dataset • Simple approaches can work

Demo • http://coral- projects.cis.uab.edu:8080/authorprofile14/

Thank you.

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , - PowerPoint PPT Presentation

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , Prasha Shrestha, and Thamar Solorio Introduction Task Given an anonymous document Predict Age group [18-24 | 25-34 | 35-49 | 50-64 | 65-plus] Gender [Male |

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Twitter User Profiling: Bot and Gender Identification 7 th Author Profiling Task PAN 2019 CLEF

author profiling shared task on: Bots and gender profiling Francisco Rangel & Paolo Rosso

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Information Visualization Online lectures and office hours start today, using Zoom:

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully connected networks

TrustNeighborhoods: Visualizing Trust in Distributed File Sharing Trust in Distributed File

File storage and file sharing for human rights organizations A design research case study

The Kalman Filter (part 1) Administrative Stuff Rudolf Emil Kalman

Nonlinear Prefiltering for Surface Shading Presenter: Chun-Po Wang, Pramook Khungurn MOTIVATION

HVAC Overview & Updated Respiratory Guidance Jessica Scott, DHSc, RDH Oral Health

Building and Evaluating a Distributional Memory for Croatian Jan o , and Snajder ,

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , - PowerPoint PPT Presentation

A Simple Approach for Author Profiling in MapReduce Suraj Maharjan , Prasha Shrestha, and Thamar Solorio Introduction Task Given an anonymous document Predict Age group [18-24 | 25-34 | 35-49 | 50-64 | 65-plus] Gender [Male |

Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author: Bill Buchanan Author:

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Twitter User Profiling: Bot and Gender Identification 7 th Author Profiling Task PAN 2019 CLEF

author profiling shared task on: Bots and gender profiling Francisco Rangel &amp; Paolo Rosso

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Information Visualization Online lectures and office hours start today, using Zoom:

CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS Neural networks Fully connected networks

TrustNeighborhoods: Visualizing Trust in Distributed File Sharing Trust in Distributed File

File storage and file sharing for human rights organizations A design research case study

The Kalman Filter (part 1) Administrative Stuff Rudolf Emil Kalman

Nonlinear Prefiltering for Surface Shading Presenter: Chun-Po Wang, Pramook Khungurn MOTIVATION

HVAC Overview &amp; Updated Respiratory Guidance Jessica Scott, DHSc, RDH Oral Health

Building and Evaluating a Distributional Memory for Croatian Jan o , and Snajder ,

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

author profiling shared task on: Bots and gender profiling Francisco Rangel & Paolo Rosso

HVAC Overview & Updated Respiratory Guidance Jessica Scott, DHSc, RDH Oral Health