examples of big data sources
play

Examples of Big Data Sources Wal-Mart 267 million items/day, sold - PowerPoint PPT Presentation

D ata I ntensive S calable C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Wal-Mart 267 million items/day, sold at 6,000 stores HP built them 4 PB data warehouse Mine


  1. D ata I ntensive S calable C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant

  2. Examples of Big Data Sources Wal-Mart  267 million items/day, sold at 6,000 stores  HP built them 4 PB data warehouse  Mine data to manage supply chain, understand market trends, formulate pricing strategies LSST  Chilean telescope will scan entire sky every 3 days  A 3.2 gigapixel digital camera  Generate 30 TB/day of image data – 2 –

  3. Why So Much Data? We Can Get It  Automation + Internet We Can Keep It  Seagate Barracuda  1.5 TB @ $150 (10¢ / GB) We Can Use It  Scientific breakthroughs  Business process efficiencies  Realistic special effects  Better health care Could We Do More?  Apply more computing power to this data – 3 –

  4. Google Data Center Dalles, Oregon  Hydroelectric power @ 2¢ / KW Hr  50 Megawatts  Enough to power 6,000 homes – 4 –

  5. Varieties of Cloud Computing “I don’t want to be a system “I’ve got terabytes of data. Tell me what they mean.” administrator. You handle my data & applications.”  Very large, shared data repository  Hosted services  Complex analysis  Documents, web-based email, etc.  Data-intensive scalable computing (DISC)  Can access from anywhere  Easy sharing and collaboration – 5 –

  6. Oceans of Data, Skinny Pipes 1 Terabyte  Easy to store  Hard to move Disks MB / s Time Seagate Barracuda 115 2.3 hours Seagate Cheetah 125 2.2 hours Networks MB / s Time Home Internet < 0.625 > 18.5 days Gigabit Ethernet < 125 > 2.2 hours PSC Teragrid < 3,750 > 4.4 minutes – 6 – Connection

  7. Data-Intensive System Challenge For Computation That Accesses 1 TB in 5 minutes  Data distributed over 100+ disks  Assuming uniform data partitioning  Compute using 100+ processors  Connected by gigabit Ethernet (or equivalent) System Requirements  Lots of disks  Lots of processors  Located in close proximity  Within reach of fast, local-area network – 7 –

  8. Desiderata for DISC Systems Focus on Data  Terabytes, not tera-FLOPS Problem-Centric Programming  Platform-independent expression of data parallelism Interactive Access  From simple queries to massive computations Robust Fault Tolerance  Component failures are handled as routine events Contrast to existing supercomputer / HPC systems – 8 –

  9. System Comparison: Programming Models DISC Conventional Supercomputers Application Application Programs Programs Machine-Independent Software Programming Model Packages Runtime System Machine-Dependent Programming Model Hardware Hardware  Programs described at very  Application programs low level written in terms of high-level  Specify detailed control of operations on data processing & communications  Runtime system controls  Rely on small number of scheduling, load balancing, … software packages  Written by specialists  Limits classes of problems & – 9 – solution methods

  10. System Comparison: Reliability Runtime errors commonplace in large-scale systems  Hardware failures  Transient errors  Software bugs DISC Conventional Supercomputers “Brittle” Systems Flexible Error Detection and Recovery  Main recovery mechanism is to recompute from most  Runtime system detects and recent checkpoint diagnoses errors  Must bring down system for  Selective use of redundancy diagnosis, repair, or and dynamic recomputation upgrades  Replace or upgrade components while system running  Requires flexible programming model & – 10 – runtime environment

  11. Exploring Parallel Computation Models MapReduce MPI SETI@home Threads PRAM Low Communication High Communication Coarse-Grained Fine-Grained DISC + MapReduce Provides Coarse-Grained Parallelism  Computation done by independent processes  File-based communication Observations  Relatively “natural” programming model  Research issue to explore full potential and limits  Dryad project at MSR  Pig project at Yahoo! – 11 –

  12. Message Passing Existing HPC Machines P 1 P 2 P 3 P 4 P 5 Characteristics  Long-lived processes  Make use of spatial locality  Hold all program data in memory  High bandwidth communication Shared Memory Memory P 1 P 2 P 3 P 4 P 5 Strengths  High utilization of resources  Effective for many scientific applications Weaknesses  Very brittle: relies on everything working correctly and in close synchrony – 12 –

  13. HPC Fault Tolerance P 1 P 2 P 3 P 4 P 5 Checkpoint Checkpoint  Periodically store state of all processes Wasted Computation  Significant I/O traffic Restore Restore  When failure occurs  Reset state to that of last Checkpoint checkpoint  All intervening computation wasted Performance Scaling  Very sensitive to number of failing components – 13 –

  14. Map/Reduce Operation Characteristics Map/Reduce  Computation broken into many, short-lived tasks Map  Mapping, reducing Reduce  Use disk storage to hold Map Reduce intermediate results Map Strengths Reduce  Great flexibility in placement, Map scheduling, and load Reduce balancing  Handle failures by recomputation  Can access large data sets Weaknesses  Higher overhead – 14 –  Lower raw performance

  15. Generalizing Map/Reduce  E.g., Microsoft Dryad Project Computational Model    Op k Op k Op k Op k  Acyclic graph of operators  But expressed as textual program  Each takes collection of objects and produces objects     Purely functional model Implementation Concepts    Op 2 Op 2 Op 2 Op 2  Objects stored in files or memory  Any object may be lost; any operator may fail    Op 1 Op 1 Op 1 Op 1  Replicate & recompute for fault tolerance  Dynamic scheduling x 1 x 2 x 3 x n  # Operators >> # Processors – 15 –

  16. Concluding Thoughts Data-Intensive Computing Becoming Commonplace  Facilities available from Google/IBM, Yahoo!, …  Hadoop becoming platform of choice  Lots of applications are fairly straightforward  Use Map to do embarrassingly parallel execution  Make use of load balancing and reliable file system of Hadoop What Remains  Integrating more demanding forms of computation  Computations over large graphs  Sparse numerical applications  Challenges: programming, implementation efficiency – 16 –

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend