a functional data scientist
play

a functional data scientist Richard Minerich, Director of R&D - PowerPoint PPT Presentation

A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus Projecting onto a 2D Plane The Pairwise Entity Resolution Process Two Datasets (Customer Data and Sanctions) Pairs of


  1. A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus

  2. Projecting onto a 2D Plane

  3. The Pairwise Entity Resolution Process • Two Datasets (Customer Data and Sanctions) • Pairs of Somehow Similar Records Blocking • Pairs of Records • Probability of Representing Same Entity Scoring • Records, Probability, Similarity Features • True/False Labels (Mostly by Hand) Review

  4. Blocking

  5. Scoring: Risk vs Probability (The Ideal) Likely to Launder Money Probably the Same Person

  6. The Reality (Dominated by Garbage) 161,358 Tiny Bump 937 Upper Threshold 161

  7. Jimmy Cournoyer Let’s dig into a single point El: 95/ SI:16

  8. Citation Network (Safe View)

  9. Relationship Network (Safe View)

  10. Flow of Drugs Rizzuto Crime Family Hells Angels Quebec Jimmy “Cosmo” British “Superman” Columbia Cournoyer New York/NYC California Reinvested in Cocaine Bonanno Crime Family El Chapo John “Big Man” Venizelos Sinaloa Cartel

  11. Family & Friends Jorge HankRhon $100s Millions Citibank, CH Brother Murdered

  12. % Time Spent 0.6 0.5 0.4 0.3 0.2 0.1 0 Munging Data Redoing Work / Fun Algorithms Investigating Problems

  13. Disgustingly Bad but Fairly Large Datasets ▪ Both Wide (many fields) and Tall (many records) ▪ From different systems (different encodings) ▪ Missing data NAME LARRY O BRIAN ▪ Poorly merged data STATE CANADA ▪ Extra data CITY 121 Buffalo Drive, Montreal, Quebec H3G 1Z2 ▪ Non-unique IDs ADDRESS NULL Every client is awful in a ZIP 00000 completely different way. DOB 10/24/80; 1/1/1979

  14. SAM – Building for Bad Data ▪ Lazy Pure Functional Core UI (C#) & Analysis (C#) ▪ Programmable Data Cleaning Data & Data ▪ Programmable ETL Config Out In ▪ Ad-Hoc Behaviors Glue (F# and Barb) All with an F# Core and Barb Algorithms (F#) for scripting.

  15. Other Kinds of Problems (sometimes even my fault) ▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins) ▪ Wrong version of data (e.g. bad sync in SQL) ▪ Bad configuration of dependencies The data lives in a locked down environment and so feedback cycles are slow. Lesson: Be Paranoid

  16. F# Tools From Bayard Rock http://github.com/BayardRock Tokens Classification Pegasus Airlines ORGANIZATION Istanbul LOCATION Sochi LOCATION Russia LOCATION Turkey LOCATION Transportation ORGANIZATION Ministry

  17. FSharpWebIntellisense https://github.com/BayardRock/FSharpWebIntellisense

  18. iFSharp Notebook https://github.com/BayardRock/IfSharp

  19. Barb, a simple .net record query language Name.Contains " John“ and (Age > 20 or Weight > 200) https://github.com/Rickasaurus/Barb

  20. MITIE Dot Net (a wrapper for MIT’s MITIE) Tokens Classification A Pegasus Airlines plane landed at Pegasus Airlines ORGANIZATION an Istanbul airport Friday after a Istanbul LOCATION passenger "said that there was a bomb on board" and wanted the Sochi LOCATION plane to land in Sochi , Russia , the Russia LOCATION site of the Winter Olympics, said Turkey LOCATION officials with Turkey's Transportation Ministry . Transportation ORGANIZATION Ministry https://github.com/BayardRock/MITIE-Dot-Net

  21. Other F# Community Tools (Not by Us) ▪ Data Type Providers (SQL, OData, CSV, etc..) ▪ Language Type Providers (R, Matlab, Python soon) ▪ Deedle (like Pandas but for F#) ▪ F# Charting

  22. The Magic of Type Providers type Netflix = ODataService<"http://odata.netflix.com"> let avatarTitles = query { for t in netflix.Titles do where (t.Name.Contains "Avatar") sortBy t.Name take 100 }

  23. Type Providers! How it works! Libraries For Free! The World! Types Compiler Type Provider Erased Types

  24. Deedle (Like Python’s pandas but for F#) ▪ Designed with Data Type Providers in Mind ▪ Interops with the R Type Provider

  25. But what about algorithmic code?

  26. Ranking vs Regression ▪ Regression - you’re trying to guess a number, only distance matters ▪ May do a very bad job at ordering ▪ In Ranking you’re trying to figure out some order, only order matters ▪ May do a very bad job at providing a meaningful number Example: You’re a doctor with 20 spots open and 100 patents who want to see you today, which method would be the best for selecting 20?

  27. Regression y is labels X is features 𝑧 = 𝑌𝛾 + 𝜁 𝛾 is weights 𝜁 is errors

  28. “OLS” Regression via Gradient Descent in F#

  29. Simple Ranking? You Can Use Regression. ▪ The features are the difference in would-be regression features ▪ The value to predict is the difference in rank Select 2 labeled samples randomly => (x1,y1) (x2,y2) Sample 1 Sample 2 Result x = x1 – x2 Names? 1 1 0 y = y1 – y2 Addresses? 1 0 1 DOB? 0 1 -1 Same Person? 0 0 0

  30. Simple Ranking in F#

  31. Combined Ranking and Regression – D. Sculley You can improve your regression with ranking, and your ranking with regression. The best of both worlds!

  32. Combined Ranking and Regression – D. Sculley @ Google, Inc

  33. Thank You! Check out the NYC F# User Group: http://www.meetup.com/nyc-fsharp Find out more about F#: http://fsharp.org Contact me on twitter: @Rickasaurus

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend