how have data science skills evolved a case study using
play

How have Data Science Skills Evolved? A case study using embeddings - PowerPoint PPT Presentation

How have Data Science Skills Evolved? A case study using embeddings Maryam Jahanshahi Ph.D. Research Scientist TapRecruit.co http://bit.ly/dataengconf2018 TapRecruit uses NLP to understand career content Converting unstructured documents into


  1. How have Data Science Skills Evolved? A case study using embeddings Maryam Jahanshahi Ph.D. Research Scientist TapRecruit.co http://bit.ly/dataengconf2018

  2. TapRecruit uses NLP to understand career content Converting unstructured documents into structured data Smart Editor for JDs Pipeline Health Monitoring Salary Estimation Data-driven suggestions on Analytics dashboards to help Data-driven salary estimates both the content and language diagnose quality and diversity based on a job’s requirements use in job descriptions. issues in talent pipelines. rather than just title and location.

  3. Language matters in job descriptions Same title, Same Title Finance Manager Finance Manager Different job Kraft Foods Roche Required Experience Senior (6-8 Years) Junior (3 Years) Required Responsibility No Managerial Experience Division Level Controller Preferred Skill Strategic Finance Role Required Education MBA / CPA

  4. Language matters in job descriptions Same title, Same Title Finance Manager Finance Manager Different job Kraft Foods Roche Required Experience Senior (6-8 Years) Junior (3 Years) Required Responsibility No Managerial Experience Division Level Controller Preferred Skill Strategic Finance Role Required Education MBA / CPA Different title, Performance 
 Senior Analyst, Same job Marketing Manager Customer Strategy PocketGems The Gap Required Experience Mid-Level Mid-Level Required Skills Quantitative Focus Quantitative Focus Required Experience iBanking Expertise Finance Expertise Required Skills Data Analysis Tools (SQL) Relational Database Experience Preferred Experience Consulting Experience Preferred External Consulting Experience Preferred Preferred Education MBA Preferred BA in Accounting, Finance, MBA Preferred

  5. How have data science skills changed over time?

  6. Strategies to identify changes within datasets MBA SQL PhD Tableau Python PowerBI Manual Feature Extraction: Require a priori selection of key attributes, therefore difficult to discover new attributes

  7. Strategies to identify changes within datasets 1880 1920 1960 2000 MBA SQL force atom radiat state energy theory energy energy motion electron electron electron PhD Tableau differ energy measure magnet light measure ray field Python PowerBI Matter Quantum Electron Manual Feature Extraction: Dynamic Topic Models: Require a priori selection of key Uses a bag of words approach, attributes, therefore difficult to and require experimentation with discover new attributes topic number. Adapted from Blei and Lafferty, ICML 2006.

  8. Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Context Word Context

  9. Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Context Word Context Python

  10. Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Context Word Context Python Object- Programming orientated Language Java C++

  11. Word embeddings capture semantic similarities Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python ) Word Context Experience in Python , Java or other object-oriented programming languages Context Word Context Proficiency programming in Python , Java or C++. Esperanto Context Word Context French German Python Object- Programming orientated Language Java C++ Japanese

  12. Embeddings capture entity relationships Dimensionality enables comparison between word pairs along many axes Exxon Tillerson McMillon Wal-Mart Dauman McAdam Colao Viacom Verizon Vodafone Hierarchies Adapted from Stanford NLP GLoVE Project

  13. Embeddings capture entity relationships Dimensionality enables comparison between word pairs along many axes Slowest Slower Exxon Tillerson Shortest Slow McMillon Wal-Mart Shorter Dauman McAdam Short Colao Viacom Stronger Verizon Vodafone Strongest Strong Hierarchies Comparatives and Superlatives Adapted from Stanford NLP GLoVE Project

  14. Embeddings capture entity relationships Dimensionality enables comparison between word pairs along many axes Slowest Slower Exxon Tillerson Man Shortest Slow McMillon Wal-Mart Shorter Dauman McAdam Short King Woman Colao Viacom Queen Stronger Verizon Vodafone Strongest Strong Hierarchies Comparatives and Superlatives Woman :: Queen as Man :: ? Adapted from Stanford NLP GLoVE Project

  15. Pretrained embeddings facilitate fast prototyping Corpus Generation Corpus Processing Language Model Generation Language Model Tuning Final Application

  16. Pretrained embeddings facilitate fast prototyping Corpus Twitter Common Crawl GoogleNews Wikipedia Corpus Generation Tokens 27 B 42-840 B 100 B 6 B Corpus Processing Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k Algorithm GLoVE GLoVE word2vec GLoVE Language Model Generation Vector Length 25 - 200 d 300 d 300 d 50 - 300 d Language Model Tuning Final Application

  17. Problems with pretrained embedding models Abbreviations vs Words Casing e.g. IT vs it Out of Vocabulary Words Domain Specific Words & Acronyms Words with multiple meanings Polysemy e.g. drive (a car) vs drive (results) e.g. Chef (the job) vs Chef (the language) Phrases that have new meanings Multi-word Expressions e.g. Front-end vs front + end

  18. Tools for developing custom language models Modularized for different data and modeling requirements SyntaxNet CoreNLP Corpus Processing Language Modeling Tokenization, POS tagging, Sentence Different word embedding models Segmentation, Dependency Parsing (GLoVE, word2vec, fastText)

  19. Hyperparameter tuning on final model outputs Window sizes capture semantic similarity vs semantic relatedness Esperanto French German Python Object- Programming orientated Language Java C++ Japanese Small Window Size Capture Semantic similarity, Substitutes and Word-level differences

  20. Hyperparameter tuning on final model outputs Window sizes capture semantic similarity vs semantic relatedness Esperanto Esperanto Statistical French French modeling SPSS German German Python Software Object- Python Programming Programming orientated Japanese C++ Language Java Java C++ Language Object-orientated Japanese Small Window Size Large Window Size Capture Semantic similarity, Capture Semantic relatedness, Substitutes and Word-level differences Alternatives and Domain-level differences

  21. Career language embedding model Identified equal opportunity and perks language

  22. Career language embedding model Identified equal opportunity and perks language

  23. Career language embedding model Identified equal opportunity and perks language

  24. Career language embedding model Identified 'soft' skills and language around experience

  25. Career language embedding model Identified 'soft' skills and language around experience

  26. I’ve got 300 dimensions… but time ain’t one

  27. Two approaches to connect embeddings Static embeddings Dynamic embeddings stitched together trained together 2018 8 1 0 2 2017 7 1 0 2 6 1 0 2 2016 5 1 0 2 2015 Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Balmer and Mandt, arXiv: 1702:08359 Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315. 
 Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607 Rudolph and Blei, arXiv: 1703:08052

  28. Two approaches to connect embeddings Static embeddings Dynamic embeddings stitched together trained together 2018 8 1 0 2 2017 7 1 0 2 6 1 0 2 2016 5 1 0 2 Data hungry: Sufficient data for each 2015 time slice for a quality embedding. Requires alignment : Each time slice is trained independently, therefore dimensions are not comparable across slices. Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Balmer and Mandt, arXiv: 1702:08359 Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315. 
 Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607 Rudolph and Blei, arXiv: 1703:08052

  29. Two approaches to connect embeddings Static embeddings Dynamic embeddings stitched together trained together 2018 8 1 0 2 2017 7 1 0 2 6 1 0 2 2016 5 1 0 2 Data hungry: Sufficient data for each Data efficient: Treats each time slice as 2015 time slice for a quality embedding. a sequential latent variable, enabling time slices with sparse data. Requires alignment : Each time slice Does not require alignment: Treating is trained independently, therefore dimensions are not comparable across time slice as a variable ensures slices. embeddings are connected across slices. Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Balmer and Mandt, arXiv: 1702:08359 Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315. 
 Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607 Rudolph and Blei, arXiv: 1703:08052

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend