the panama papers graphs and data science
play

The Panama Papers, Graphs, and Data Science Unravelling the Shady - PowerPoint PPT Presentation

The Panama Papers, Graphs, and Data Science Unravelling the Shady World of Offshore Finance One Data Structure at a Time Dr. Jim Webber Chief Scientist, Neo4j About @jimwebber Graphs, databases, distributed systems Socialist, activist,


  1. The Panama Papers, Graphs, and Data Science Unravelling the Shady World of Offshore Finance One Data Structure at a Time Dr. Jim Webber Chief Scientist, Neo4j

  2. About @jimwebber Graphs, databases, distributed systems Socialist, activist, agitator, #SJW

  3. #panamapapers

  4. Disclaimer Offshore companies are not illegal. There is no suggestion that parties listed in the Panama Papers documents have necessarily broken the law or acted improperly.

  5. Almost 400 journalists Based in 76 countries “Our aim is to bring journalists from different countries together in teams - eliminating rivalry and promoting collaboration . Together, we aim to be the world’s best cross-border investigative team .” icij.org/about

  6. You may remember them from... #BahamasLeak

  7. Source Material • The ICIJ presentation • The Reddit AMA • Online publications (SZ, Guardian, TNW et.al.) • The ICIJ website • https://panamapapers.icij.org/ • The Power Players • Key Numbers & Figures

  8. Hidden Secrets No Longer Exposed the offshore holdings of 12 current and former world leaders. And dealings of 128 more politicians and public officials around the world. In all: 150 politicians from 50 countries, connected to companies in 21 tax havens.

  9. System Architecture

  10. Stack Unstructured data extraction ● Nuix professional OCR service ● ICIJ Extract (open source, Java: https://github.com/ICIJ/extract), leverages Apache Tika, Tesseract OCR and JBIG2-ImageIO. ● Python for wiring Database ● Apache Solr (open source, Java) ● Redis (open source, C) ● Neo4j (open source, Java) App ● Oxwall (open source, secure social network) ● Blacklight (open source, Rails) ● Linkurious (closed source, JS) Other ● Redis for queues ● Talend for ETL from other DBs ● AWS for cloud hardware

  11. Data Flow Architecture Raw Files Database POWER Meta-Data Discovery Search

  12. 3 million files for OCR Investigators used Nuix’s optical character x recognition to make millions of scanned documents text- 10 seconds per file searchable. They used Nuix’s named entity = extraction and other analytical tools to identify and cross- 1 yr / 35 servers reference the names of Mossack Fonseca clients through millions = 1.5 weeks of documents.

  13. 400 users Lucene syntax queries with proximity matching!

  14. Disconnected Documents

  15. Context is King name: “John” last: “Miller“ name: ”Jose" role: “Negotiator“ last: “Pereia“ position: “Governor“ PERSON PERSON PERSON name: “Alice” name: "Maria" PERSON last: “Smith“ name: “Some Media Ltd” last: "Osara" role: “Advisor“ value: “$70M”

  16. Context is King name: “John” last: “Miller“ name: ”Jose" role: “Negotiator“ last: “Pereia“ position: “Governor“ MENTIONS PERSON PERSON since: Jan 10, 2011 PERSON name: “Alice” name: "Maria" PERSON last: “Smith“ name: “Some Media Ltd” last: "Osara" role: “Advisor“ value: “$70M”

  17. Need to store and query connections between entities. Whether they’re physical or inferred by algorithms or humans.

  18. Neo4j: All about Patterns KNOWS Ann Dan NODE NODE (:Person { name:"Dan"} ) -[:KNOWS]-> (:Person {name:"Ann"}) LABEL LABEL PROPERTY PROPERTY http://neo4j.com/developer/cypher

  19. Cypher: Find Patterns KNOWS ??? Dan NODE NODE MATCH (:Person { name:"Dan"} ) -[:KNOWS]-> (who:Person) RETURN who LABEL ALIAS LABEL PROPERTY ALIAS http://neo4j.com/developer/cypher

  20. Data Model Meta Data Entities Actual Entities • Person • Document, Email, Contract, DB- Record • Representative (Officer) • Meta: Author, Date, Source, • Address Keywords • Client • Conversation: Sender, Receiver, • Company Topic • Account

  21. Data Model for Relationships Meta-Data Activities • sent, received, cc‘ed • open account • mentioned, topic-of • manage • created, signed • has shares • attached • registered address • roles • money flow • family relationships

  22. The Basic ICIJ Data Model

  23. The Basic ICIJ Data Model in Neo4j

  24. The Real ICIJ Data Model

  25. What’s Been Delivered?

  26. Data initially exposed as interactive visualization • Public figures and leaders • Different shell companies & involvements

  27. @apcj @technige

  28. OSS Stack Enables • Find interesting spots with full-text and fuzzy search • See neighbourhoods of suspects and interesting facts • Find connections and shortest paths between seemingly disconnected information • Add new knowledge as relationships enriching the graph structure • Stories emerge from the collaboration • Add more information from other sources

  29. Neo4j ICIJ Distribution We have also made a distribution of Neo4j available with the data in it. This will allow you to query the database to fully explore from your computer the connections between people and companies. The package also includes a guide that explains how to use Neo4j.

  30. What’s Been Discovered?

  31. Distorting markets London is wonderful, but expensive. Tax-avoiding investors have been able to distort the property market to suit their objective of capital gains. Tax-avoidance multiplies their advantage to the disadvantage of regular Londoners.

  32. Tax is optional for the rich Lionel Messi’s net worth is estimated at €200,000,000. Average income in Barcelona is a more modest €33,000 per year at headline 30% income tax. Messi remains entitled to roads, emergency services and all other benefits of citizenry in his host country.

  33. We’re not all in this together Britain’s former Prime Minister declared that the country was “all in it together” after the 2008 financial collapse. The British people have seen massive declines in education, healthcare, and social services. Cameron benefitted from his dad’s investment funds involvement with Mossack Fonseca.

  34. Ice, Ice, Baby Icelandic citizens took the brunt of their banking system’s collapse. Their Prime Minister had a conflict of interest in deciding how much government money would be used to compensate shareholders. He was a (transitive) beneficiary.

  35. Lava Jato I’ll leave this one to you, folks. Bad behaviour spans borders, but so does our technology stack and our commitment to building better communities.

  36. What does this mean for us?

  37. Open source data technology democratises the capabilities that were once the domain of the Web giants. What they can do, we can approximate at low cost and with high effectiveness. Bad guys beware - it’s cheap to find you!

  38. One more thing

  39. Enjoy the conference! @jimwebber

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend