http://www.mmds.org Much of the course will be devoted - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining ¡of ¡Massive ¡Datasets ¡ Jure ¡Leskovec, ¡Anand ¡Rajaraman, ¡Jeff ¡Ullman ¡ Stanford ¡University ¡ http://www.mmds.org ¡ ¡

¡ Much ¡of ¡the ¡course ¡will ¡be ¡devoted ¡to ¡ ¡ large ¡scale ¡compu-ng ¡for ¡ data ¡mining ¡ ¡ Challenges: ¡ § How ¡to ¡distribute ¡computa6on? ¡ § Distributed/parallel ¡programming ¡is ¡hard ¡ ¡ Map-‑reduce ¡addresses ¡all ¡of ¡the ¡above ¡ § Google’s ¡computa6onal/data ¡manipula6on ¡model ¡ § Elegant ¡way ¡to ¡work ¡with ¡big ¡data ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 2 ¡

CPU ¡ Machine ¡Learning, ¡Statistics ¡ Memory ¡ “Classical” ¡Data ¡Mining ¡ Disk ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 3 ¡

¡ 20+ ¡billion ¡web ¡pages ¡x ¡20KB ¡= ¡400+ ¡TB ¡ ¡ 1 ¡computer ¡reads ¡30-‑35 ¡MB/sec ¡from ¡disk ¡ § ~4 ¡months ¡to ¡read ¡the ¡web ¡ ¡ ~1,000 ¡hard ¡drives ¡to ¡store ¡the ¡web ¡ ¡ Takes ¡even ¡more ¡to ¡ do ¡something ¡useful ¡ ¡ with ¡the ¡data! ¡ ¡ Today, ¡a ¡standard ¡architecture ¡for ¡such ¡ problems ¡is ¡emerging: ¡ § Cluster ¡of ¡commodity ¡Linux ¡nodes ¡ § Commodity ¡network ¡(ethernet) ¡to ¡connect ¡them ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 4 ¡

2-‑10 ¡Gbps ¡backbone ¡between ¡racks ¡ 1 ¡Gbps ¡between ¡ ¡ Switch ¡ any ¡pair ¡of ¡nodes ¡ in ¡a ¡rack ¡ Switch ¡ Switch ¡ CPU ¡ CPU ¡ CPU ¡ CPU ¡ … ¡ … ¡ Mem ¡ Mem ¡ Mem ¡ Mem ¡ Disk ¡ Disk ¡ Disk ¡ Disk ¡ Each ¡rack ¡contains ¡16-‑64 ¡nodes ¡ In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 5 ¡

J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 6 ¡

¡ Large-‑scale ¡compu-ng ¡for ¡ data ¡mining ¡ ¡ problems ¡on ¡ commodity ¡hardware ¡ ¡ Challenges: ¡ § How ¡do ¡you ¡distribute ¡computa-on? ¡ § How ¡can ¡we ¡make ¡it ¡easy ¡to ¡write ¡distributed ¡ programs? ¡ § Machines ¡fail: ¡ § One ¡server ¡may ¡stay ¡up ¡3 ¡years ¡(1,000 ¡days) ¡ § If ¡you ¡have ¡1,000 ¡servers, ¡expect ¡to ¡loose ¡1/day ¡ § People ¡es6mated ¡Google ¡had ¡~1M ¡machines ¡in ¡2011 ¡ § 1,000 ¡machines ¡fail ¡every ¡day! ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 7 ¡

¡ Issue: ¡Copying ¡data ¡over ¡a ¡network ¡takes ¡-me ¡ ¡ Idea: ¡ § Bring ¡computa6on ¡close ¡to ¡the ¡data ¡ § Store ¡files ¡mul6ple ¡6mes ¡for ¡reliability ¡ ¡ Map-‑reduce ¡addresses ¡these ¡problems ¡ § Google’s ¡computa6onal/data ¡manipula6on ¡model ¡ § Elegant ¡way ¡to ¡work ¡with ¡big ¡data ¡ § Storage ¡Infrastructure ¡– ¡File ¡system ¡ § Google: ¡GFS. ¡Hadoop: ¡HDFS ¡ § Programming ¡model ¡ § Map-‑Reduce ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 8 ¡

¡ Problem: ¡ § If ¡nodes ¡fail, ¡how ¡to ¡store ¡data ¡persistently? ¡ ¡ ¡ Answer: ¡ § Distributed ¡File ¡System: ¡ § Provides ¡global ¡file ¡namespace ¡ § Google ¡GFS; ¡Hadoop ¡HDFS; ¡ ¡ Typical ¡usage ¡paIern ¡ § Huge ¡files ¡(100s ¡of ¡GB ¡to ¡TB) ¡ § Data ¡is ¡rarely ¡updated ¡in ¡place ¡ § Reads ¡and ¡appends ¡are ¡common ¡ ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 9 ¡

¡ Chunk ¡servers ¡ § File ¡is ¡split ¡into ¡con6guous ¡chunks ¡ § Typically ¡each ¡chunk ¡is ¡16-‑64MB ¡ § Each ¡chunk ¡replicated ¡(usually ¡2x ¡or ¡3x) ¡ § Try ¡to ¡keep ¡replicas ¡in ¡different ¡racks ¡ ¡ Master ¡node ¡ § a.k.a. ¡Name ¡Node ¡in ¡Hadoop’s ¡HDFS ¡ § Stores ¡metadata ¡about ¡where ¡files ¡are ¡stored ¡ § Might ¡be ¡replicated ¡ ¡ Client ¡library ¡for ¡file ¡access ¡ § Talks ¡to ¡master ¡to ¡find ¡chunk ¡servers ¡ ¡ § Connects ¡directly ¡to ¡chunk ¡servers ¡to ¡access ¡data ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 10 ¡

¡ Reliable ¡distributed ¡file ¡system ¡ ¡ Data ¡kept ¡in ¡“chunks” ¡spread ¡across ¡machines ¡ ¡ Each ¡chunk ¡replicated ¡on ¡different ¡machines ¡ ¡ § Seamless ¡recovery ¡from ¡disk ¡or ¡machine ¡failure ¡ C 0 C 1 D 0 C 1 C 2 C 5 C 0 C 5 … D 1 D 0 D 0 C 5 C 2 C 2 C 5 C 3 Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N Bring ¡computation ¡directly ¡to ¡the ¡data! ¡ Chunk ¡servers ¡also ¡serve ¡as ¡compute ¡servers ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 11 ¡

Warm-‑up ¡task: ¡ ¡ We ¡have ¡a ¡huge ¡text ¡document ¡ ¡ Count ¡the ¡number ¡of ¡6mes ¡each ¡ ¡ dis6nct ¡word ¡appears ¡in ¡the ¡file ¡ ¡ Sample ¡applica-on: ¡ ¡ § Analyze ¡web ¡server ¡logs ¡to ¡find ¡popular ¡URLs ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 12 ¡

Case ¡1: ¡ ¡ § File ¡too ¡large ¡for ¡memory, ¡but ¡all ¡<word, ¡count> ¡ pairs ¡fit ¡in ¡memory ¡ Case ¡2: ¡ ¡ Count ¡occurrences ¡of ¡words: ¡ § words(doc.txt) | sort | uniq -c § where ¡ words ¡takes ¡a ¡file ¡and ¡outputs ¡the ¡words ¡in ¡it, ¡ one ¡per ¡a ¡line ¡ ¡ Case ¡2 ¡captures ¡the ¡essence ¡of ¡ MapReduce ¡ § Great ¡thing ¡is ¡that ¡it ¡is ¡naturally ¡parallelizable ¡ ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 13 ¡

¡ Sequen6ally ¡read ¡a ¡lot ¡of ¡data ¡ ¡ Map: ¡ § Extract ¡something ¡you ¡care ¡about ¡ ¡ Group ¡by ¡key: ¡Sort ¡and ¡Shuffle ¡ ¡ Reduce: ¡ § Aggregate, ¡summarize, ¡filter ¡or ¡transform ¡ ¡ Write ¡the ¡result ¡ Outline ¡stays ¡the ¡same, ¡ Map ¡ and ¡ Reduce ¡ change ¡to ¡fit ¡the ¡problem ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 14 ¡

Input Intermediate key-value pairs key-value pairs k ¡ v ¡ map ¡ v ¡ k ¡ k ¡ v ¡ map ¡ v ¡ k ¡ k ¡ v ¡ … ¡ … ¡ v ¡ k ¡ v ¡ k ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 15 ¡

Output Intermediate Key-value groups key-value pairs key-value pairs reduce ¡ k ¡ v ¡ k ¡ v ¡ v ¡ v ¡ k ¡ v ¡ reduce ¡ Group ¡ k ¡ v ¡ k ¡ v ¡ k ¡ v ¡ v ¡ by ¡key ¡ k ¡ v ¡ … ¡ … ¡ … ¡ k ¡ v ¡ k ¡ v ¡ k ¡ v ¡ J. ¡Leskovec, ¡A. ¡Rajaraman, ¡J. ¡Ullman: ¡Mining ¡of ¡Massive ¡Datasets, ¡hJp://www.mmds.org ¡ 16 ¡

http://www.mmds.org Much of the course will be devoted - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

www.Every-Mind.org www.Every-Mind.org www.Every-Mind.org www.Every-Mind.org

WEARABLES?!? LIZA KINDRED @LIZAK SO MUCH HYPE SO MUCH MEH SO MUCH MATTERS. HERES HOW TO THINK

POLITICS POLITICS POLITICS POLITICS How much should I be involved? How much should I be

MSMS (02PCYQW) 2016-2017 Organization: the course is composed of two parts: the first

MSMS (01PCYQW) 2014-2015 Organization: the course is divided into two parts: the first

Who needs Standards... Patrick Curran: Chair, Java Community Process (patrick@jcp.org)

Course Search Widget Topics StudyLink Course Search Widget Demo Generic Course Search

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Scaling FIBs with Virtual Aggregation: How Much Stretch? How Much FIB savings? An Evaluation By

Welcome to CS 61A About the Course Parts of the Course 4 Parts of the Course Lecture : Videos

How Much Risk Is Too Much Risk? Boston SPIN January 20 th 2004 Tim Lister lister@acm.org The

Course overview J. Gomes Ferreira http://ecowin.org/ Universidade Nova de Lisboa Coastal and

maptile mapping in Stata, made easy Michael Stepner MIT 7.36 23.67 4.59 7.36 3.07

Learning Lane Graph Representations for Motion Forecasting Ming Liang, Bin Yang, Rui Hu, Yun

Mapping Quality of Life with Chernoff Faces Joseph G. Spinelli and Yu Zhou Abstract

Point-wise map recovery Task : Recover a point-to-point map from its functional representation n

Mapping and Folding Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

A Mixture of Experts Approach for Runtime Mapping in Dynamic Environments Murali Emani School of

Maps and organisation Ramon van Alteren VP Product Unomaly (EX Program manager Cloud Migration @

Dynamical systems Expanding maps on the circle Jana Rodriguez Hertz ICTP 2018 lifts and degree

Sambuz

Useful Links

Newsletter

Mail Us