profiling big data sources to
play

Profiling Big Data sources to assess their selectivity Piet Daas - PowerPoint PPT Presentation

Profiling Big Data sources to assess their selectivity Piet Daas and Joep Burger With special thanks to Marco Puts & Dong Nguyen 1 Big Data More and more organizations want to use Big Data as a new/additional source of information


  1. Profiling Big Data sources to assess their selectivity Piet Daas and Joep Burger With special thanks to Marco Puts & Dong Nguyen 1

  2. Big Data – More and more organizations want to use Big Data as a new/additional source of information – However, there are some major challenges : – Selectivity of Big Data – Source does not have to completely cover the target population – What part of the population is included? 2

  3. Profiling: extracting ‘features’ – Extract background characteristics (‘features’) from the ‘units’ in Big Data in an attempt to determine its selectivity ‐ The need for this depends on the ‘type’ of Big data source and its foreseen use – Important background characteristics for statistics are: ‐ Persons : gender , age, income, education, origin, urbanicity, household composition, .. ‐ Companies: number of employees, turnover, type of economic activity, legal form, .. 3

  4. Social Media: Twitter as an example – On Social media persons, companies and ‘others’ can create an account and create messages ‐ In the Netherlands 70% of the population is active on social media – What kind of information is available on Twitter of a user ‐ Focus on gender! – Let’s look at a profile: @ pietdaas 4

  5. 4) Picture 3) Messages content 1)Name 2) Short bio 5

  6. Studied a Twitter sample – From a list of Dutch Twitter users (~330.000) – A random sample of 1000 unique ids was drawn – Of the sample: ‐ 844 profiles still existed • 844 had a name • 583 provided a short bio • 473 created ‘tweets’ • 804 had a ‘non - default’ picture Default Twitter picture • 409 Men (49%) • 282 Women (33%) • 153 ‘Others’ (18%) • companies, organizations, dogs, cats, ‘bots’.. 6

  7. Gender findings: 1) First name – Used Dutch ‘ Voornamenbank ’ website (First name database) – Score between 0 and 1 (female – male); 676 of 844 (80%) names were registered – Unknown names scored -1 (usually companies/organizations) 7

  8. Gender findings: 2) Short bio – If a short bio is provided ‐ Quite a number of people mention there ‘position’ in the family • Mother, father, papa, mama, ‘son of’, etc. ‐ Sometimes also occupations are mentioned that reflect the gender (‘ studente ’) ‐ 155 of 583 (27%) indicated there gender in short bio ‐ Need to check both English and Dutch texts 8

  9. Gender findings: 3) Tweets content – In cooperation with University of Twente (Dong Nguyen) – Machine learning approach that determines gender specific writing style ‐ Language specific: Messages need to be Dutch! ‐ 437 of 473 (92%) persons that created tweets could be classified

  10. Gender findings: 4) Profile picture 1 3 2 – Use OpenCV to process pictures 1) Face recognition 2) Standardisation of faces (resize & rotate) 3) Classify faces according to gender 10 - 603 of 804 (75%) profile pictures had 1 or more faces on it

  11. Gender findings: overall results Diagnostic Odds Ratio (log) Diagnostic Odds Ratio = (TP/FN) / (FP/TN) First name 6.41 Short bio 3.50 random guessing Tweet content 2.36 log(DOR) = 0 Picture (faces) 0.72 ‐ Multi-agent findings • Need clever ways to combine these • Take processing efficiency of the ‘agent’ into consideration 11

  12. Thank you for your attention ! 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend