The Panama Papers, Graphs, and Data Science Unravelling the Shady - - PowerPoint PPT Presentation

the panama papers graphs and data science
SMART_READER_LITE
LIVE PREVIEW

The Panama Papers, Graphs, and Data Science Unravelling the Shady - - PowerPoint PPT Presentation

The Panama Papers, Graphs, and Data Science Unravelling the Shady World of Offshore Finance One Data Structure at a Time Dr. Jim Webber Chief Scientist, Neo4j About @jimwebber Graphs, databases, distributed systems Socialist, activist,


slide-1
SLIDE 1

The Panama Papers, Graphs, and Data Science

Unravelling the Shady World of Offshore Finance One Data Structure at a Time

  • Dr. Jim Webber

Chief Scientist, Neo4j

slide-2
SLIDE 2

About @jimwebber

Graphs, databases, distributed systems Socialist, activist, agitator, #SJW

slide-3
SLIDE 3

#panamapapers

slide-4
SLIDE 4

Disclaimer

Offshore companies are not illegal.

There is no suggestion that parties listed in the Panama Papers documents have necessarily broken the law or acted improperly.

slide-5
SLIDE 5

Almost 400 journalists Based in 76 countries

“Our aim is to bring journalists from different countries together in teams - eliminating rivalry and promoting

  • collaboration. Together, we aim to be the

world’s best cross-border investigative team.”

icij.org/about

slide-6
SLIDE 6

You may remember them from...

#BahamasLeak

slide-7
SLIDE 7
slide-8
SLIDE 8

Source Material

  • The ICIJ presentation
  • The Reddit AMA
  • Online publications (SZ, Guardian, TNW et.al.)
  • The ICIJ website
  • https://panamapapers.icij.org/
  • The Power Players
  • Key Numbers & Figures
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Exposed the offshore holdings of 12 current and former world leaders. And dealings of 128 more politicians and public

  • fficials around the world.

In all: 150 politicians from 50 countries, connected to companies in 21 tax havens.

Hidden Secrets No Longer

slide-13
SLIDE 13

System Architecture

slide-14
SLIDE 14

Unstructured data extraction

  • Nuix professional OCR service
  • ICIJ Extract (open source, Java: https://github.com/ICIJ/extract),

leverages Apache Tika, Tesseract OCR and JBIG2-ImageIO.

  • Python for wiring

Database

  • Apache Solr (open source, Java)
  • Redis (open source, C)
  • Neo4j (open source, Java)

App

  • Oxwall (open source, secure social network)
  • Blacklight (open source, Rails)
  • Linkurious (closed source, JS)

Other

  • Redis for queues
  • Talend for ETL from other DBs
  • AWS for cloud hardware

Stack

slide-15
SLIDE 15

POWER

Raw Files Meta-Data Database Search Discovery

Data Flow Architecture

slide-16
SLIDE 16

3 million files for OCR x 10 seconds per file = 1 yr / 35 servers = 1.5 weeks

Investigators used Nuix’s optical character recognition to make millions of scanned documents text-

  • searchable. They used

Nuix’s named entity extraction and other analytical tools to identify and cross- reference the names of Mossack Fonseca clients through millions

  • f documents.
slide-17
SLIDE 17

Lucene syntax queries with proximity matching! 400 users

slide-18
SLIDE 18

Disconnected Documents

slide-19
SLIDE 19

Context is King

name: “John” last: “Miller“ role: “Negotiator“ name: "Maria" last: "Osara" name: “Some Media Ltd” value: “$70M”

PERSON PERSON PERSON PERSON

name: ”Jose" last: “Pereia“ position: “Governor“ name: “Alice” last: “Smith“ role: “Advisor“

slide-20
SLIDE 20
slide-21
SLIDE 21

Context is King

MENTIONS

name: “John” last: “Miller“ role: “Negotiator“ name: "Maria" last: "Osara" since: Jan 10, 2011 name: “Some Media Ltd” value: “$70M”

PERSON PERSON PERSON PERSON

name: ”Jose" last: “Pereia“ position: “Governor“ name: “Alice” last: “Smith“ role: “Advisor“

slide-22
SLIDE 22

Need to store and query connections between entities. Whether they’re physical or inferred by algorithms or humans.

slide-23
SLIDE 23

Neo4j: All about Patterns

(:Person { name:"Dan"} ) -[:KNOWS]-> (:Person {name:"Ann"})

KNOWS

Dan Ann

NODE NODE LABEL PROPERTY http://neo4j.com/developer/cypher LABEL PROPERTY

slide-24
SLIDE 24

Cypher: Find Patterns

MATCH (:Person { name:"Dan"} ) -[:KNOWS]-> (who:Person) RETURN who

KNOWS

Dan ???

LABEL NODE NODE LABEL PROPERTY ALIAS ALIAS http://neo4j.com/developer/cypher

slide-25
SLIDE 25

Data Model

Meta Data Entities

  • Document, Email, Contract, DB-

Record

  • Meta: Author, Date, Source,

Keywords

  • Conversation: Sender, Receiver,

Topic

Actual Entities

  • Person
  • Representative (Officer)
  • Address
  • Client
  • Company
  • Account
slide-26
SLIDE 26

Data Model for Relationships

Meta-Data

  • sent, received, cc‘ed
  • mentioned, topic-of
  • created, signed
  • attached
  • roles
  • family relationships

Activities

  • open account
  • manage
  • has shares
  • registered address
  • money flow
slide-27
SLIDE 27

The Basic ICIJ Data Model

slide-28
SLIDE 28

The Basic ICIJ Data Model in Neo4j

slide-29
SLIDE 29

The Real ICIJ Data Model

slide-30
SLIDE 30

What’s Been Delivered?

slide-31
SLIDE 31

Data initially exposed as interactive visualization

  • Public figures and leaders
  • Different shell companies & involvements
slide-32
SLIDE 32
slide-33
SLIDE 33

@apcj @technige

slide-34
SLIDE 34

OSS Stack Enables

  • Find interesting spots with full-text and fuzzy search
  • See neighbourhoods of suspects and interesting facts
  • Find connections and shortest paths between seemingly disconnected

information

  • Add new knowledge as relationships enriching the graph structure
  • Stories emerge from the collaboration
  • Add more information from other sources
slide-35
SLIDE 35

Neo4j ICIJ Distribution

We have also made a distribution of Neo4j available with the data in it. This will allow you to query the database to fully explore from your computer the connections between people and companies. The package also includes a guide that explains how to use Neo4j.

slide-36
SLIDE 36

What’s Been Discovered?

slide-37
SLIDE 37

Distorting markets

London is wonderful, but expensive. Tax-avoiding investors have been able to distort the property market to suit their

  • bjective of capital gains.

Tax-avoidance multiplies their advantage to the disadvantage of regular Londoners.

slide-38
SLIDE 38

Tax is optional for the rich

Lionel Messi’s net worth is estimated at €200,000,000. Average income in Barcelona is a more modest €33,000 per year at headline 30% income tax. Messi remains entitled to roads, emergency services and all other benefits of citizenry in his host country.

slide-39
SLIDE 39

We’re not all in this together

Britain’s former Prime Minister declared that the country was “all in it together” after the 2008 financial collapse. The British people have seen massive declines in education, healthcare, and social services. Cameron benefitted from his dad’s investment funds involvement with Mossack Fonseca.

slide-40
SLIDE 40

Ice, Ice, Baby

Icelandic citizens took the brunt of their banking system’s collapse. Their Prime Minister had a conflict of interest in deciding how much government money would be used to compensate shareholders. He was a (transitive) beneficiary.

slide-41
SLIDE 41

Lava Jato

I’ll leave this one to you, folks. Bad behaviour spans borders, but so does our technology stack and our commitment to building better communities.

slide-42
SLIDE 42

What does this mean for us?

slide-43
SLIDE 43

Open source data technology democratises the capabilities that were once the domain of the Web giants. What they can do, we can approximate at low cost and with high effectiveness. Bad guys beware - it’s cheap to find you!

slide-44
SLIDE 44

One more thing

slide-45
SLIDE 45
slide-46
SLIDE 46

Enjoy the conference!

@jimwebber