BETTER TOGETHER USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH - - PowerPoint PPT Presentation

better together
SMART_READER_LITE
LIVE PREVIEW

BETTER TOGETHER USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH - - PowerPoint PPT Presentation

BETTER TOGETHER USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH PUBLIC DATASETS EUGENE MANDEL (@EUGMANDEL) JAWBONE QCON SF 2014 JAWBONE DATA MOVEMENT SLEEP WORKOUTS MEALS MOOD SOUTH NAPA EARTHQUAKE 2014 % OF PEOPLE AWAKE AT 3:25


slide-1
SLIDE 1

BETTER TOGETHER

USING SPARK AND REDSHIFT TO COMBINE YOUR DATA WITH PUBLIC DATASETS

EUGENE MANDEL (@EUGMANDEL) JAWBONE QCON SF 2014

slide-2
SLIDE 2

JAWBONE DATA

MOVEMENT SLEEP WORKOUTS MEALS MOOD

slide-3
SLIDE 3

SOUTH NAPA EARTHQUAKE 2014

slide-4
SLIDE 4

% OF PEOPLE AWAKE AT 3:25 DISTANCE FROM EPICENTER (MILES)

slide-5
SLIDE 5
slide-6
SLIDE 6

DATA FUSION IS THE PROCESS OF INTEGRATION OF MULTIPLE DATA AND KNOWLEDGE REPRESENTING THE SAME REAL-WORLD OBJECT INTO A CONSISTENT, ACCURATE, AND USEFUL REPRESENTATION.

(WIKIPEDIA)

slide-7
SLIDE 7

DATA FUSION - HOW TO FIND THE ELEPHANT

IMAGE SOURCE: HTTP:/ /COMMONS.WIKIMEDIA.ORG/WIKI/FILE%3ABLIND_MEN_AND_ELEPHANT.PNG

slide-8
SLIDE 8

DATA FUSION

POWERFUL BUT HARD DATA IS NOISY DOMAIN UNDERSTANDING IS KEY

slide-9
SLIDE 9

LET’S TALK ABOUT THE WEATHER

slide-10
SLIDE 10

MODEL THE PROBLEM

slide-11
SLIDE 11 1,400 2,800 4,200 5,600 7,000

1 2 3 4 5 6 7 8 9 1 1 1 1 2

4300.0 4400.0 4500.0 5000.0 6000.0 6700.0 6700.0 7000.0 6600.0 6500.0 6100.0 5000.0

ACTIVITY AIR TEMP (°F)

?

slide-12
SLIDE 12

FIND THE DATA

slide-13
SLIDE 13
slide-14
SLIDE 14

UNDERSTAND THE DATA

slide-15
SLIDE 15

HOURLY DAILY

slide-16
SLIDE 16

DATA GENERATION PROCESS

NETWORK OF WEATHER STATIONS FREQUENCY OF MEASUREMENTS - HOURLY TO DAILY

  • COLLABORATION WITH INTERNATIONAL AGENCIES
  • AGGREGATION AND QA BY NCDC
slide-17
SLIDE 17

UNDERSTAND THE DOMAIN

WEATHER STATION TIME: 2014-07-09 13:04:00 AIR TEMP: 86°F PRECIPITATION: 3CM

slide-18
SLIDE 18

QA THE DATA

slide-19
SLIDE 19

BUT ISN’T IT DONE?

slide-20
SLIDE 20

AIR TEMP: 105°F

BAKERSFIELD, CA JULY 17, 15:00 DULUTH, MN JAN 12, 05:00

…MAYBE NOT!

slide-21
SLIDE 21

DATA VALIDATION

DOMAIN KNOWLEDGE

  • COMPARE MULTIPLE SOURCES - E.G. CLIMATE
  • MANUAL REVIEW OF FLAGGED DATA POINTS
slide-22
SLIDE 22

JOIN

slide-23
SLIDE 23

DOMAIN SPECIFIC

HOW?

WEATHER STATION B LAT: 39.35 LON: -74.44 TIME: 2014-07-09 13:00:00 AIR TEMP: 60°F WEATHER STATION A LAT: 39.36 LON: -74.45 TIME: 2014-07-09 13:04:00 AIR TEMP: 74°F ELEVATION: 30FT ELEVATION: 120FT

slide-24
SLIDE 24

DO THE DATASETS INTERSECT ENOUGH?

COVERAGE

PLACES

  • TIMES
  • USERS
slide-25
SLIDE 25

ISOLATE THE EFFECT

slide-26
SLIDE 26

CONFOUNDING VARIABLES

WEEKDAYS/WEEKENDS

  • DAYLIGHT
  • RAIN/SNOW

WHAT ELSE AFFECTS ACTIVITY?

slide-27
SLIDE 27

REDSHIFT VS SPARK

slide-28
SLIDE 28

AMAZON REDSHIFT

RELATIONAL ANALYTICAL DATABASE BY AMAZON

  • COMPLEX QUERIES ON LARGE DATASETS IN SECONDS
  • SQL INTERFACE (POSTGRES)
  • MANAGED CLUSTER
slide-29
SLIDE 29

EXAMPLE: DAYLIGHT

PYTHON REDSHIFT

slide-30
SLIDE 30

IN-MEMORY DATA PROCESSING FRAMEWORK

  • MODELS COMPUTATION AS A GRAPH OF RDDS (RESILIENT DISTRIBUTED

DATASETS)

  • FUNCTIONAL PROGRAMMING MODEL (SCALA, PYTHON)
  • SQL
  • CAN READ FROM SAME SOURCES AS HADOOP
slide-31
SLIDE 31

SPARK

EXAMPLE: DAYLIGHT

slide-32
SLIDE 32

PICK YOUR OWN ADVENTURE

SILVER BULLET?

PROGRAMMER-FRIENDLY

  • END-TO-END SOLUTION
  • SELF-DOCUMENTING

SPARK REDSHIFT

EASY TO SHARE DATA WITH NON-DEVELOPERS

  • MANAGED - EASY SCALING
slide-33
SLIDE 33

WHAT DID WE FIND?

slide-34
SLIDE 34

IDEAL TEMP FOR MOVEMENT

DAILY STEPS MAX TEMP (F)

slide-35
SLIDE 35

AND NOW BY STATE…

DAILY STEPS MAX TEMP (F)

slide-36
SLIDE 36

HOURLY STEPS BY AIR TEMP

WEEKENDS

slide-37
SLIDE 37

LESS CHOICE = SMALLER EFFECT

WEEKDAYS

slide-38
SLIDE 38

DATA FUSION

POWERFUL BUT HARD DATA IS NOISY DOMAIN UNDERSTANDING IS KEY

slide-39
SLIDE 39

THANK YOU!

@EUGMANDEL WWW.LINKEDIN.COM/IN/EUGENEMANDEL