CS520 Data Integration, Warehousing, and Provenance
- 7. Big Data Systems and Integration
CS520 Data Integration, Warehousing, and Provenance 7. Big Data - - PowerPoint PPT Presentation
CS520 Data Integration, Warehousing, and Provenance 7. Big Data Systems and Integration IIT DBGroup Boris Glavic http://www.cs.iit.edu/~glavic/ http://www.cs.iit.edu/~cs520/ http://www.cs.iit.edu/~dbgroup/ Outline 0) Course Info 1)
CS520 - 7) Big Data Analytics
– Overview of two types of systems
– What is new compared to single node systems? – How do these systems change our approach to integration/analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
– Bulk processing
– Fault tolerance
– Load balancing
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
– DBMS
– If you have many machines, failures are the norm – Need mechanisms for the system to cope with failures
– This is called fault-tolerance
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
tracker
#mappers, #reducers and which nodes to use
CS520 - 7) Big Data Analytics
to task tracker on worker nodes
map jobs on nodes that store the chunk processed by a job
progress
CS520 - 7) Big Data Analytics
chunk from HDFS, translates the input into key-value pairs and applies the map UDF to every (k,v)
disk with one file per reducer (hashing on key)
additional mappers if mappers are not making progress
CS520 - 7) Big Data Analytics
reducers (scp)
CS520 - 7) Big Data Analytics
sort these input files on key values
sort where runs already exists
UDF to each key and associated list of values
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics
CS520 - 7) Big Data Analytics