Introduction Wordcount K-Means RHadoop Wrap-Up
Efficient Analysis of Big Data and Big Models through Distributed - - PowerPoint PPT Presentation
Efficient Analysis of Big Data and Big Models through Distributed - - PowerPoint PPT Presentation
Introduction Wordcount K-Means RHadoop Wrap-Up Efficient Analysis of Big Data and Big Models through Distributed Computation Benjamin E. Bagozzi & John Beieler The Pennsylvania State University Big Data Week Presentation Series Penn
Introduction Wordcount K-Means RHadoop Wrap-Up
Why Move to Distributed Computation?
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.”
- Grace Hopper
Introduction Wordcount K-Means RHadoop Wrap-Up
Hadoop
- An open source framework for distributed computing
- Two primary subprojects:
- MapReduce: distributed data processing
- HDFS: distributed data storage
- MapReduce jobs typically written in Java
- Hadoop Streaming: API for using MapReduce with other languages
- E.g., Ruby, Python, R
- Additional subprojects: Pig, HBase, ZooKeeper, Hive, Chuckwa, etc.
Introduction Wordcount K-Means RHadoop Wrap-Up
MapReduce in Detail
- A two step paradigm for big data processing
- To implement:
- Specify key-value pairs as input & output for each phase
- Specify two functions: map function and reduce function
- Map phase: perform a transformation (e.g., field-extraction, parsing,
filtering) on each individual piece of data (e.g., row of text, tweet, vector) and output a key-value pair
- Reduce phase: (1) sort and group output by key, (2) compute an
aggregate function over the values associated with each key, (3)
- utput aggregates to disk
Introduction Wordcount K-Means RHadoop Wrap-Up
Hadoop vs. Other Parallelization Approaches
- Other Parallelization Approaches:
- Break tasks up by hand, submit pieces individually to HPCs
- Split tasks via other parallelization paradigms (e.g., MPI)
- Hadoop Drawbacks:
- More complex (debugging, configuration)
- Less intuitive, steep learning curve
- Availability & access
- Bleeding edge
- Hadoop Benefits:
- Flat scalability & efficient processing
- Open Source
- Integration with other languages, computing tasks
- Reliable/robust big data storage and processing
Introduction Wordcount K-Means RHadoop Wrap-Up
Hadoop on SDSC’s ‘Gordon’ Supercomputer
- Overviews of Gordon can be found here and here
- Available via the NSF’s Extreme Science and Engineering Discovery
Environment (XSEDE)
- Register at XSEDE (free)
- Request or join (via, e.g., PSU’s Campus Champion Allocation) a
Gordon-Allocation (not always free)
- Benefits:
- Full base Hadoop framework available (see here)
- Easy Hadoop job scheduling/submission via MyHadoop
- Drawbacks (as of April 2013):
- Hadoop compliments (e.g., Hive, HBase, Pig) aren’t available
- Relevant libraries for (e.g.,) R and Python aren’t installed
Introduction Wordcount K-Means RHadoop Wrap-Up
A Selection of Hadoop’s Built-in (Java) Example Scripts
- wordcount: A map/reduce program that counts the words in the input files
- aggregatewordcount: Aggregate map/reduce program to count words in input files
- multifilewc: A job that counts words from several files
- grep: A map/reduce program that counts the matches of a regex in the input
- dbcount: An example job that counts the pageview counts from a database
- randomwriter: A map/reduce program that writes 10GB of random data per node
- randomtextwriter: A map/reduce program that writes 10GB random text per node
- sort: A map/reduce program that sorts the data written by the random writer
- secondarysort: An example defining a secondary sort to the reduce
- teragen/terrasort/teravalidate: terabyte generate/sort/transfer
- pi: A map/reduce program that estimates Pi using a monte-carlo method
For online tutorials on these, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Running Example: ICEWS News-Story Corpus
- 60 European and Middle Eastern countries
- All politically relevant stories, January 2001 to July 2011
- Document: Individual news-story (first 3-4 sentences)
- 6,681,537 Stories
- Removed: punctuation, stopwords, numbers, proper nouns, etc.
- Stemmed words
Introduction Wordcount K-Means RHadoop Wrap-Up
Applying wordcount to Entire News-Story Corpus
- What are the most frequent (stemmed) words?
- Map Stage: assign <key,value> pairs to corpus
- Read in X-lines (stories) of text from corpus
- Input <key,value>: <line-number, line-of-text>
- Output <key,value>: <word, one>
- Reduce Stage: sum individual <key, value>’s from Map tasks
- Input <key,value>: <word, one>
- Output <key,value>: <word, occurrence>
- 1 Node→ 8 minutes; 4 Nodes→ 5 minutes
- 406,466 unique “words”
For online tutorials on Hadoop’s wordcount, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Applying wordcount to Entire News-Story Corpus
- What are the most frequent (stemmed) words?
- Map Stage: assign <key,value> pairs to corpus
- Read in X-lines (stories) of text from corpus
- Input <key,value>: <line-number, line-of-text>
- Output <key,value>: <word, one>
- Reduce Stage: sum individual <key, value>’s from Map tasks
- Input <key,value>: <word, one>
- Output <key,value>: <word, occurrence>
- 1 Node→ 8 minutes; 4 Nodes→ 5 minutes
- 406,466 unique “words”
For online tutorials on Hadoop’s wordcount, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Applying wordcount to Entire News-Story Corpus
- What are the most frequent (stemmed) words?
- Map Stage: assign <key,value> pairs to corpus
- Read in X-lines (stories) of text from corpus
- Input <key,value>: <line-number, line-of-text>
- Output <key,value>: <word, one>
- Reduce Stage: sum individual <key, value>’s from Map tasks
- Input <key,value>: <word, one>
- Output <key,value>: <word, occurrence>
- 1 Node→ 8 minutes; 4 Nodes→ 5 minutes
- 406,466 unique “words”
For online tutorials on Hadoop’s wordcount, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Applying wordcount to Entire News-Story Corpus
- What are the most frequent (stemmed) words?
- Map Stage: assign <key,value> pairs to corpus
- Read in X-lines (stories) of text from corpus
- Input <key,value>: <line-number, line-of-text>
- Output <key,value>: <word, one>
- Reduce Stage: sum individual <key, value>’s from Map tasks
- Input <key,value>: <word, one>
- Output <key,value>: <word, occurrence>
- 1 Node→ 8 minutes; 4 Nodes→ 5 minutes
- 406,466 unique “words”
For online tutorials on Hadoop’s wordcount, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Applying wordcount to Entire News-Story Corpus
- What are the most frequent (stemmed) words?
- Map Stage: assign <key,value> pairs to corpus
- Read in X-lines (stories) of text from corpus
- Input <key,value>: <line-number, line-of-text>
- Output <key,value>: <word, one>
- Reduce Stage: sum individual <key, value>’s from Map tasks
- Input <key,value>: <word, one>
- Output <key,value>: <word, occurrence>
- 1 Node→ 8 minutes; 4 Nodes→ 5 minutes
- 406,466 unique “words”
For online tutorials on Hadoop’s wordcount, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Applying wordcount to Entire News-Story Corpus
- What are the most frequent (stemmed) words?
- Map Stage: assign <key,value> pairs to corpus
- Read in X-lines (stories) of text from corpus
- Input <key,value>: <line-number, line-of-text>
- Output <key,value>: <word, one>
- Reduce Stage: sum individual <key, value>’s from Map tasks
- Input <key,value>: <word, one>
- Output <key,value>: <word, occurrence>
- 1 Node→ 8 minutes; 4 Nodes→ 5 minutes
- 406,466 unique “words”
For online tutorials on Hadoop’s wordcount, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Applying wordcount to Entire News-Story Corpus
- What are the most frequent (stemmed) words?
- Map Stage: assign <key,value> pairs to corpus
- Read in X-lines (stories) of text from corpus
- Input <key,value>: <line-number, line-of-text>
- Output <key,value>: <word, one>
- Reduce Stage: sum individual <key, value>’s from Map tasks
- Input <key,value>: <word, one>
- Output <key,value>: <word, occurrence>
- 1 Node→ 8 minutes; 4 Nodes→ 5 minutes
- 406,466 unique “words”
For online tutorials on Hadoop’s wordcount, see here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Word-Stem Frequencies for ICEWS News Corpus
countri
govern meet report
- ffici
told newspeopl
forc minist agenc militari leader visit attack talk secur
region kill elect issu intern polit day presid
support call relat war nation parti cooper includ discuss statement time foreign plan
week peac polic head ad develop month troop
- ffic
held confer former border nuclear
- per
agreement follow econom continu citi press spokesman world decis situat sourc inform expect terror
vote author sign power effort near bomb soldier accord parliament newspap repres say help posit negoti law diplomat public terrorist base set process polici recent hold carri
- pposit
rule arrest
- rgan
servic respons start arm weapon announc capit million fire current fight believ result releas concern milit summit major propos conflict hope tri activ chief main express close return charg agre ministri receiv deputi deleg resolut action prime court local percent investig joint chang move particip territori accus question build remain provid crime armi live independ claim deal demand protest arriv five suspect administr step note coalit lead aim increas trade presidenti interview involv reach attend prepar bilater wound establish aid command websit join futur stress media southern condit consid publish hand control compani town prison citizen found met act name
communiti comment provinc allow right alleg project home commit confirm role special system crisi due site death term yesterday launch violenc hous threat reform journalist candid earlier past initi senior warn mission caus measur level quot parliamentari session left six campaign air decid person serious train run ago document declar view top civilian direct take bodi stop th regard settlement hour critic target tie leav northern form approv poll
- ffer
televis missil programm
- il
explos station ambassador radio human implement defenc chairman led famili attempt dead strong billion readi car unit late die come detail begin injur withdraw constitut event urg committe creat program immedi regim bring stabil complet seek villag post counterpart final mean alli night fail protect key accept particular draft prevent sent assist address incid central request exchang economi affair agenc" effect legal strike busi energi emerg commiss daili appear achiev land import link pressur gas solut toward adopt republ schedul invit reject resid deploy refuge movement district dialogu progress stand refus embassi financi success improv speak secretari suicid illeg intellig line reason send plane respect focus leadership
- utsid
ahead rebel pass refer fund morn look speech council rais construct market expert centr ethnic dollar deni sanction demonstr road conduct soon resolv invest list web travel field detain enter depart word third access serv hospit contact ground children mass trial institut free letter defens disput director camp ensur correspond seven cent global integr violat describ messag possibl brief play despit bank total suppli previous crimin fear peacekeep
- pinion
budget team vehicl strengthen cultur condemn matter stage state murder requir contribut
- bserv
privat date addit victim welcom round drug life pay cabinet
- rganis
partner seen damag suggest execut threaten sentenc promot manag thousand prosecutor south common hit appeal democrat remark transport civil neighbour eight friend mark product women promis happen connect figur membership indic zone ceremoni period short share financ strateg exist gather industri sector appoint insurg abl determin combat politician various guard money humanitarian coordin debat raid envoy understand cut cross blast presenc fighter cours airport clash social trip status ban consult destroy eastern legisl water
- pportun
resign channel stay defend shot evid north popul bloc
- ccup
ship treati democraci challeng staff agenda account seiz danger suffer confid use tension especi rocket bill brought signific intend referendum insist helicopt headquart mutual wit school even allianc
- ppos
forward destruct separ judg maintain basi ceasefir natur read search broadcast principl
- bject
religi feder aircraft seat face worker impos exercis deliv approach telephon firm hundr educ
- ccupi
mile plant idea win captur host extend street network respond transfer record half tax identifi kidnap crash grow potenti push equip export cover minor framework resist spoke goal battl pledg repeat health faction board similar hostag decad restor lawyer amend produc western task student assassin price victori ralli anniversari headlin voter largest bid review tribun monitor front resum medic histori governor corrupt voic boost remov difficult assess hear freedom explain secret strategi test insid expand enemi submit job cite island locat blame won entitl solv immigr attent tour societi prospect delay weekend limit real facil avoid amid center neighbor histor
- ccur
personnel bomber seri rise replac nine practic
- mit
block chanc perman reconstruct risk prioriti struggl studi minut justic rate wish fuel paper flight
- ffens
uniti reduc formal light break fall aggress influenc jail cost technolog grant dozen extrem advis casualti rang visa lower guarante materi engag annual interior recogn transit format intent countri" port contain inspector uranium entir duti mention lost spokeswoman moment enrich patrol domest forum design miss biggest union extremist halt structur scene growth affect activist advanc associ suspend contract prove speaker specif estim rank accompani convict answer brother technic differ amount gain dismiss text stanc tell invas experi settl woman subject file tank partnership premier lack thank rival feel paid individu doubt collect failur admit littl conclud drive analyst valu tradit confront wait proceed singl
- rigin
reveal benefit appar give regular pull stori afternoon hard care normal accid food divid write radic mediat capabl pursu lift truck km
- blig
resourc fair drop appli fli divis assembl check militia assur rescu recal rout escal ten via draw vow wife kilomet lot conclus age rest bus intens wave sit safe learn lie gun reserv plot ad" massiv west mine hostil cancel will get crew nearbi
- utcom
desir traffic ident "via watch troubl map chemic data hotel struck fresh last sever heart
- ust
spi tabl succeed st sale save eas coast unilater add awar survey jet branch isol built door blow treat devic make endors hide relief path doubl kept editori fell poor sell flood boat reviv lose cast meet" atom "call wake meant lay realiti discov low sens predict touch fled honour context tonn ratifi buy eye freez facilit broad naval riot permit veto shift vital girl revolut blockad cell eventu wall bar keep recov devast rotat pact consul exil loan retir unabl circl bear abil deton goe laid extern credit hunt kg blew notic sieg fill flow eve pick pave fine ratif wear slogan shield bn imprison dealt card root sum toler suffici extra aliv nd navi rifl talk" ran elit mere pm scheme evil els fruit titl chosen exclud fatal leak fish crack teacher ton fix till club chant red hall size rich ill rd gap anti think shout reli tini roll hint rain sack tribe kill" rail crise mid realist nativ tone util fit hot bit defi rift aris rig foe star law" sex fee erad dub rip catch axi flu net etc bird bin hire clinic cri hurl tl" erect fax mar soft accident breath fuell taxi ice pit talli sink dri evict re ia rob vest "inform da "tri "daili bln let liter self nn sd trio ski lag der tide tel tm act" alik su trial" flurri bust
Top 20 Word−Stems
elect kill region secur talk attack visit leader militari agenc minist forc peopl news told
- ffici
report meet govern countri 5 , 1 , , 1 , 5 , 2 , , 2 , 5 , 3 , ,
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster Analysis of Country News-Reports
- Do country-news reports cluster in interesting ways?
- K-Means on all 60 countries’ news reports (1000 per country)
- Specify 60 clusters and set no. of iterations to 10
- Examine variation in cluster assignments across countries
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster Analysis of Country News-Reports
- Do country-news reports cluster in interesting ways?
- K-Means on all 60 countries’ news reports (1000 per country)
- Specify 60 clusters and set no. of iterations to 10
- Examine variation in cluster assignments across countries
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster Analysis of Country News-Reports
- Do country-news reports cluster in interesting ways?
- K-Means on all 60 countries’ news reports (1000 per country)
- Specify 60 clusters and set no. of iterations to 10
- Examine variation in cluster assignments across countries
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster Analysis of Country News-Reports
- Do country-news reports cluster in interesting ways?
- K-Means on all 60 countries’ news reports (1000 per country)
- Specify 60 clusters and set no. of iterations to 10
- Examine variation in cluster assignments across countries
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster Analysis of Country News-Reports
- Do country-news reports cluster in interesting ways?
- K-Means on all 60 countries’ news reports (1000 per country)
- Specify 60 clusters and set no. of iterations to 10
- Examine variation in cluster assignments across countries
Introduction Wordcount K-Means RHadoop Wrap-Up
K-Means with MapReduce/Java
- Map Stage: assign word values to (new) minimum distance clusters
- Read in vectorized story-words (vv), and previous centers
- For each word (v), apply distance function to find nearest center
- Output <key,value>: <centeri,v>
- Reduce Stage: row bind and average <key, value>’s from Map tasks
- Input <key,value>: <centeri, v >
- Output: New centeri = mean(< centeri, vv >)
- Non-Hadoop→1hr, 45 min; Hadoop→51 minutes
For running K-Means in Java, see here. For extending K-Means in Java to MapReduce and Hadoop, see here.
Introduction Wordcount K-Means RHadoop Wrap-Up
K-Means with MapReduce/Java
- Map Stage: assign word values to (new) minimum distance clusters
- Read in vectorized story-words (vv), and previous centers
- For each word (v), apply distance function to find nearest center
- Output <key,value>: <centeri,v>
- Reduce Stage: row bind and average <key, value>’s from Map tasks
- Input <key,value>: <centeri, v >
- Output: New centeri = mean(< centeri, vv >)
- Non-Hadoop→1hr, 45 min; Hadoop→51 minutes
For running K-Means in Java, see here. For extending K-Means in Java to MapReduce and Hadoop, see here.
Introduction Wordcount K-Means RHadoop Wrap-Up
K-Means with MapReduce/Java
- Map Stage: assign word values to (new) minimum distance clusters
- Read in vectorized story-words (vv), and previous centers
- For each word (v), apply distance function to find nearest center
- Output <key,value>: <centeri,v>
- Reduce Stage: row bind and average <key, value>’s from Map tasks
- Input <key,value>: <centeri, v >
- Output: New centeri = mean(< centeri, vv >)
- Non-Hadoop→1hr, 45 min; Hadoop→51 minutes
For running K-Means in Java, see here. For extending K-Means in Java to MapReduce and Hadoop, see here.
Introduction Wordcount K-Means RHadoop Wrap-Up
K-Means with MapReduce/Java
- Map Stage: assign word values to (new) minimum distance clusters
- Read in vectorized story-words (vv), and previous centers
- For each word (v), apply distance function to find nearest center
- Output <key,value>: <centeri,v>
- Reduce Stage: row bind and average <key, value>’s from Map tasks
- Input <key,value>: <centeri, v >
- Output: New centeri = mean(< centeri, vv >)
- Non-Hadoop→1hr, 45 min; Hadoop→51 minutes
For running K-Means in Java, see here. For extending K-Means in Java to MapReduce and Hadoop, see here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster 4
AZE ALB ARM BHR BIH BGR AFG DNK EGY EST AUT CZE FIN FRA GEO DEU GRC HRV HUN IRN ISR ITA IRQ JOR KGZ KWT KAZ LBN LVA BLR LTU SVK MKD OMN BEL NLD NOR PAK POL PRT QAT ROU MDA SAU SVN ESP SYR CHE TJK TUR TKM GBR UKR UZB YEM ARE SRB
Cluster Assigments [0,10) [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90]
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster 7
AZE ALB ARM BHR BIH BGR AFG DNK EGY IRL EST AUT CZE FIN FRA GEO DEU GRC HRV HUN IRN ISR ITA IRQ JOR KGZ KWT KAZ LBN LVA BLR LTU SVK MKD OMN BEL NLD NOR PAK POL PRT QAT ROU MDA SAU ESP SYR CHE TJK TUR TKM GBR UKR UZB YEM ARE SRB
Cluster Assigments [0,5) [5,10) [10,15) [15,20) [20,25) [25,30) [30,35) [35,40) [40,45]
Introduction Wordcount K-Means RHadoop Wrap-Up
Cluster 40
AZE ALB ARM BHR BIH BGR AFG DNK EGY IRL EST AUT CZE FIN FRA GEO DEU GRC HRV HUN IRN ISR ITA IRQ JOR KGZ KWT KAZ LBN LVA BLR LTU SVK MKD OMN BEL NLD NOR PAK POL PRT QAT ROU MDA SAU SVN ESP SYR CHE TJK TUR TKM GBR UKR UZB YEM ARE SRB
Cluster Assigments [0,10) [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) [80,90) [90,100) [100,110) [110,120) [120,130]
Introduction Wordcount K-Means RHadoop Wrap-Up
Moving from Hadoop/Java to RHadoop
- Write and implement Hadoop jobs in R via Hadoop Streaming
- This requires three RHadoop packages: rhdfs, rmr, rhbase
- Also requires additional prerequisite packages (e.g., rJava)
- Also requires that you install and build Thrift
- Install & set-up Hadoop, RHadoop, and Streaming on EC2
- Use Amazon Elastic MapReduce (EMR) to run RHadoop via
Streaming
Overviews of RHadoop, and installation info: here, here, here, and here.
Introduction Wordcount K-Means RHadoop Wrap-Up
Ready-Made Examples for RHadoop
- Basic data analysis
- Word count
- Logistic Regression
- K-Means (also here)
- Linear Least Squares
Introduction Wordcount K-Means RHadoop Wrap-Up
Getting Started on Hadoop
- For those interested in trying out Hadoop on Gordon...
- QuaSSIHadoop.zip
- Readme, .sh scripts, output files, error files, and all necessary input
files for 4 basic Hadoop jobs:
- Simple: a simple setup and usage example
- TestDFS: depth-first search (DFS) benchmark
- TeraSort: sorting benchmark
- Wordcount: word frequencies
Introduction Wordcount K-Means RHadoop Wrap-Up