eliminating the regular expression
play

Eliminating the regular expression Datalogue Datalogue CEO & - PowerPoint PPT Presentation

Eliminating the regular expression Datalogue Datalogue CEO & Co-Founder Cornell Tech MS Htech Merck Data science & Insights Tim Delisle Me, feeling the pain Feeling the Pain Obsession How might we automate the mundane, painful


  1. Eliminating the regular expression Datalogue

  2. Datalogue CEO & Co-Founder Cornell Tech MS Htech Merck Data science & Insights Tim Delisle

  3. Me, feeling the pain Feeling the Pain

  4. Obsession How might we automate the mundane, painful process of data preparation to get data into the hands of the people who need it!

  5. Data prep means many di fg erent things to di fg erent people Casual data user Data engineer Data scientist

  6. But the process is similar 1 2 3 Semantic + structural Parsing of Translation of data from understanding unstructured data one format to another

  7. Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2015

  8. {day: 22, month: “march”, year: 1991 } {day: 08, month: “april”, year: 1962 } {day: 05, month: “may”, year: 2017 }

  9. Semantic + structural 1 understanding Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017

  10. Semantic + structural 1 understanding Dates Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017

  11. Semantic + structural 1 understanding Dates Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_

  12. Semantic + structural 1 understanding Dates Mardi_22_Mars, 1991_ Tuesday_April_8th_1962_ 05/5/2017_

  13. Parsing of compound/ 2 unstructured data Dates Mardi 22 Mars, 1991 Tuesday April 8th 1962 05/5/2017 WD D M Y

  14. Translation of data from one 3 format to another Mardi 22 Mars, 1991 {day: 22, month: “march”, year: 1991 } Tuesday April 8th 1962 {day: 08, month: “april”, year: 1962 } 05/5/2017 {day: 05, month: “may”, year: 2017 }

  15. How do we do this today?!?

  16. Regular expressions

  17. Regex approach Mardi 22 Mars, 1991 ([a - zA-Z]{3,}( |,)) Tuesday April 8th 1962 (\d * )*\w * | (\d * )|(\d * )\/ (\d * ) 05/05/2017

  18. “You can write a million test cases and regexs will still blow up in your hands” Jai Chaudhary, Google

  19. Regular expressions

  20. Regular Regexes… expressions

  21. Regular Impossible to scale! Regexes… expressions

  22. + M a c h i L n e e a Regular r n i n g expressions

  23. Regex + Machine Learning approach Dates Week Day Day Month Year Mardi 22 Mars, 1991 Length 4 # Letters 0 Week Day Month Day Year # Digits 4 Tuesday March 22nd 1991 # Special chars 0 Month Day Year Index special char -1 03/22/1991 … … Text Numbers Special chars

  24. + M a c h i L n e e a Regular r n i n g expressions

  25. + M a c h i L n e e a Hand generated features Regular r n i n g expressions

  26. + M a c h i L n e e a Hand generated features Regular r n Hard to scale with new classes i n g expressions

  27. + M a c h i L n e e a Regular r n i n g expressions

  28. Deep Learning

  29. “Convolutional neural networks take advantage of the 2D structure of the input”

  30. Address Phone Number

  31. Phone Number Char Embedding VD CNN Label

  32. Layers: 45 Params: 1,016,101 Test Acc: 94% Highest Error rate classes: Name -> Business Name

  33. Char Embedding ConvNet + Bidirectional LSTM 10 Airport Road SE,Salem,NY,97301 Parsed String AAAAAAAAAAAAAAAAAAUCCCCCUSSUZZZZZ

  34. Layers: 7 Params: 232,121 Val Acc: 99.73%

  35. But the process is similar and can be automated 1 2 3 Semantic + structural Parsing of Translation of data from understanding unstructured data one format to another using VDCNN using ConvNet + using Seq2Seq models LSTMs

  36. Thank you! Ask away.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend