A QUEST TO UNDERSTAND THE ORGANISATION OF LARGE RELATIONAL DATA: - - PowerPoint PPT Presentation
A QUEST TO UNDERSTAND THE ORGANISATION OF LARGE RELATIONAL DATA: - - PowerPoint PPT Presentation
A QUEST TO UNDERSTAND THE ORGANISATION OF LARGE RELATIONAL DATA: FROM WILD GOATS TO BITCOINS Rmy Cazabet INTRODUCTION: THE QUEST QUEST Data coming from the real world Human/Animal/Natural activity Complex Systems Many
INTRODUCTION: THE QUEST
QUEST
- Data coming from the real world
- Human/Animal/Natural activity
- Complex Systems
- “Many entities in interaction”
- “The whole is more than the sum of its parts” (…?…)
- The system is not understandable by reductionism: understanding each part
very well is not enough to understand how the system works
QUEST
- Data coming from the real world
- Human/Animal/Natural activity
- Complex Systems
- “Many entities in interaction”
- “The whole is more than the sum of its parts” (…?…)
- The system is not understandable by reductionism: understanding each part
very well is not enough to understand how the system works
Note : why “understand” ?
- Goal in itself (physics, sociology, biology, (CS ?)…)
- Understanding => building good models => predict, detect “exceptions”, …
TOOL: COMPLEX NETWORKS
- Entities in relations/interaction:
- Individuals exchange information/money/physical things
- Genes/Proteins/Cells interact through known or unknown means
- Web pages/articles/Patents… reference each other
- Individuals/animals/things belong to same groups/have common traits
- …
- => Entities: nodes
- =>Relations: edges
- With/Without attributes (categories, numeric, time, …)
TOOL: COMPLEX NETWORKS
TOOL: COMPLEX NETWORKS
- Networks are interesting for their structure, their
- rganisation
- Neighbours of my neighbours are also my neighbours ?
- Individuals with same attributes than me are more likely to be my Nb. ?
- There are “dense groups” (communities) ?
- Some nodes are more “strategically” positioned ?
- …
- Objective: Understand/discover/analyse/reproduce this
structure
CHAPTER ONE : WHAT I’VE DONE
SCIENTIFIC JOURNEY
- PhD : Toulouse, Dynamic Community Detection in Temporal
networks
- Postdocs:
- Tokyo (2y), Understanding cooperation in social media
- ENS de Lyon (1y), Understanding usages of Bicycle Sharing Systems
- Paris (1y), Fraud detection in crypto-currencies
IZARDS (WILD GOATS)
- Social animals
- 20y of observations
- (Position/co-location)
- Persistence of groups ?
- Despite deaths/climate
change ?
1999 1998 1997 1996 1995 1994 2001 2002 2003 2004 2005 2006 2007- Can we discover your “social
circles” from your ego- networks ?
- How do you like it ?
TRENDING TOPICS
TRENDING TOPICS
´ Ev´ enement d´ etect´ e Date de cr´ eation Date de fin Date de sortie D´ elai de d´ etection (j) Devil May Cry 02/12/2007 08/08/2008 31/01/2008
- 60
Fable 2 06/12/2008 03/02/2009 18/12/2008
- 12
Gears Of War 2 14/10/2008 29/12/2008 07/11/2008
- 24
Assassin’s Creed 25/01/2008 26/02/2008 31/01/2008
- 6
Soul Calibur IV 07/07/2008 15/11/2008 31/07/2008
- 24
Uncharted 11/11/2007 02/01/2008 16/11/2007
- 5
2009 2008 2007 METAL GEAR JEU DESARMEMENT METAL GEAR SOLID VIDEO DE JEU
SPACE-CORRECTED COMMUNITIES
Normal community detection
SPACE-CORRECTED COMMUNITIES
Spatially corrected communities
USER IDENTIFICATION IN BITCOIN
btc faucet coinbase easycoin easywallet flexcoin instawallet paytunia strongcoin 1081887 1125389 1164699 136 1382255 1383742 2 2060685 2170323 2213276 221533 2272939 2373452 2450702 2523225 2594636 2913748 3017504 310121 3104470 3142946 317 3211606 3327158 339363 3525055 3596858 3708232 377177 4351029 4641355 4888339 490726 4952060 4975459 5005079 5053363 511932 52 540648 5453832 5467309 551132 573705 619957 667033 70 81113 859718
Without community detection
Ground truth
USER IDENTIFICATION IN BITCOIN
With community detection
GT H1 H4-l2
btc faucet coinbase easycoin easywallet flexcoin instawallet paytunia strongcoin 1081887 1125389 1164699 136 1382255 1383742 2 2060685 2170323 2213276 221533 2272939 2373452 2450702 2523225 2594636 2913748 3017504 310121 3104470 3142946 317 3211606 3327158 339363 3525055 3596858 3708232 377177 4351029 4641355 4888339 490726 4952060 4975459 5005079 5053363 511932 52 540648 5453832 5467309 551132 573705 619957 667033 70 81113 859718 107506 138756 139285 145491 146623 149568 170183 182296 18993 195281 23616 34076 48774 53655 68195 82461 90460 91473
DYNAMIC COMMUNITY DETECTION
- “Community Discovery in Dynamic Networks: A Survey”
- With Giulio Rossetti (Pisa)
- 50 methods, 40-60 pages
- To be (Should be) published in ACM Computer Surveys
(Slooooow)
TWITTER IN TIME OF CRISIS
2 4 6 8 10 12 6th Mrach 7th March 8th March 9th March 10th March 10th March 11th March 12th March 13th March 14th March 15th March 16th March 17th March 18th March 19th March 20th March 21st March 21st March 22nd March 23rd March 24th March
Normalized Retweet Count Time (per hour)
IS only AMP only Mixed
MASSIVE PEER COOPERATION PROCESSES
MASSIVE PEER COOPERATION PROCESSES
MASSIVE PEER COOPERATION PROCESSES
MASSIVE PEER COOPERATION PROCESSES
MASSIVE PEER COOPERATION PROCESSES
GI LI LI AGG AGG AGG BB AGG
Simple variant Complex variant Exploiting creation 0.00 0.25 0.50 0.75 1.00 20 40 60user frequency
categories 2nd category DANCE MAD MASHUPS MUSICALPERFORMANCE ORIGINALMUSIC PICTURE SINGING VOCALOIDVOICE VOICE
Fraction of famous videos
SINGING CG3D DANCE NOCATEGORY MASHUPS MUSIC MAD MUSICALPERFORMANCE MOVIE ORIGINALMUSIC ANIMATION VOICE OTHER PICTURE VOCALOIDVOICE VOCALOIDVOICE PICTURE OTHER VOICE ANIMATION ORIGINALMUSIC MOVIE MUSICALPERFORMANCE MAD MUSIC MASHUPS NOCATEGORY DANCE CG3D SINGING 0.2 0.6Value
Color Key 0.00 0.25 0.50 0.75 1.00 1e+01 1e+03 1e+05 userRank cumulativeFrequency Type views referencesTEMPORAL PROFILES EVOLUTION
25
5000 10000 15000 20000 25000 30000 35000 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21MONDAY TUESDAY WEDNESDAY TURSDAY FRIDAY SATURDAY SUNDAY
5000 10000 15000 20000 25000 30000 35000 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21MONDAY TUESDAY WEDNESDAY TURSDAY FRIDAY SATURDAY SUNDAY
5000 10000 15000 20000 25000 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21MONDAY TUESDAY WEDNESDAY TURSDAY FRIDAY SATURDAY SUNDAY
2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21 0 3 6 9 12 15 18 21MONDAY TUESDAY WEDNESDAY TURSDAY FRIDAY SATURDAY SUNDAY
“Commercial” “Work” “Bars-Restaurants (?)” “Leisure” NMF : extract temporal profiles
26
Main city Mall Main commercial street Main train station(c) TPU3
Main campuses
- f universities
CHAPTER 2 : WHAT I’M DOING NOW
CHAPTER 2 : WHAT I’M DOING NOW
(Struggling)
CHAPTER 2 : WHAT I’M DOING NOW
(Struggling) (Trying to get fundings)
DYNAMIC COMMUNITY DETECTION:
EMPIRICAL EVALUATION
- Survey : classification, qualitative comparison
- Empirical evaluation => strengths, weaknesses, …
CHAPTER 3: WHAT’S NEXT
WHAT’S NEXT
- I’m open to all opportunities
- There are “theoretical” questions I would like to explore:
- Community Detection —VS— Clustering
- Finding automatically the best network model
- Communities?
- Spatial?
- Embedding? (many works now in ML/Data Mining)
- Core Periphery?
- …
- =>Multi-criteria analysis/optimisation : Model cost (information theory) VS model accuracy