d
play

D ata I ntensive I ntensive S calable S calable C omputing C - PowerPoint PPT Presentation

D ata D ata I ntensive I ntensive S calable S calable C omputing C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Examples of Big Data Sources Wal- -Mart Mart Wal


  1. D ata D ata I ntensive I ntensive S calable S calable C omputing C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant

  2. Examples of Big Data Sources Examples of Big Data Sources Wal- -Mart Mart Wal � 267 million items/day, sold at 6,000 stores � HP building them 4PB data warehouse � Mine data to manage supply chain, understand market trends, formulate pricing strategies Sloan Digital Sky Survey Sloan Digital Sky Survey � New Mexico telescope captures 200 GB image data / day � Latest dataset release: 10 TB, 287 million celestial objects � SkyServer provides SQL access � Next generation LSST even bigger – 2 –

  3. Our Data-Driven World Our Data-Driven World Science Science � Data bases from astronomy, genomics, natural languages, seismic modeling, … Humanities Humanities � Scanned books, historic documents, … Commerce Commerce � Corporate sales, stock market transactions, census, airline traffic, … Entertainment Entertainment � Internet images, Hollywood movies, MP3 files, … Medicine Medicine � MRI & CT scans, patient records, … – 3 –

  4. Cloud Computing Varieties Cloud Computing Varieties “I don’t want to be a system “ I don’t want to be a system “ “I’ve got terabytes of data. I’ve got terabytes of data. administrator. You handle my Tell me what they mean.” administrator. You handle my Tell me what they mean.” data & applications.” data & applications.” � Very large, shared data repository � Hosted services � Complex analysis � Documents, web-based email, etc. � Data-intensive scalable computing (DISC) � Can access from anywhere � Easy sharing and collaboration – 4 –

  5. CS Research Issues CS Research Issues Applications Applications � Language translation, image processing, … Application Support Application Support � Machine learning over very large data sets � Web crawling Programming Programming � Abstract programming models to support large-scale computation � Distributed databases System Design System Design � Error detection & recovery mechanisms � Resource scheduling and load balancing � Distribution and sharing of data across system – 5 –

  6. Getting Started Getting Started Goal Goal � Get faculty & students active in DISC Software: Hadoop Software: Hadoop � Open source project inspired by Google infrastructure � Distributed file system � MapReduce programming environment � Supported and used by Yahoo � Prototype on single machine, map onto cluster – 6 –

  7. Hardware: Rely on Kindness of Hardware: Rely on Kindness of Others Others � Google setting up dedicated cluster for university use � Loaded with open-source software � Including Hadoop � IBM providing additional software support � NSF will determine how facility should be used. – 7 –

  8. More Sources of Kindness More Sources of Kindness � Yahoo: Major supporter of Hadoop � Yahoo plans to work with other universities – 8 –

  9. Big-Data Computing Study Group Big-Data Computing Study Group � Co-organized by REB & Thomas Kwan (Yahoo!) � Supported by Computing Community Consortium – 9 –

  10. BDCSG Activities BDCSG Activities Hadoop Summit Hadoop Summit � 350+ people showed up � Power of Open Source Data- -Intensive Computing Symposium Intensive Computing Symposium Data � ~100 from universities, companies, govt. labs, NSF � 14 invited speakers � Google, Yahoo!, Microsoft, Intel � CMU, UC Berkeley, Cornell, MIT, Johns Hopkins, UIUC, UW � NSF – 10 –

  11. NSF Involvement NSF Involvement – 11 –

  12. Curriculum Development Curriculum Development � Workshop for educators July 16–18, 2008 – 12 –

  13. � UW/Google � Catalyst / instigator Christophe Christophe Bisciglia Bisciglia – 13 –

  14. Future Workshops Future Workshops – 14 –

  15. Concluding Thoughts Concluding Thoughts The World is Ready for a New Approach to Large- -Scale Scale The World is Ready for a New Approach to Large Computing Computing � Optimized for data-driven applications � Technology favoring centralized facilities � Storage capacity & computer power growing faster than network bandwidth Industry is Catching on Quickly Industry is Catching on Quickly � Large crowd for Hadoop Summit University Researchers / Educators Eager to Get University Researchers / Educators Eager to Get Involved Involved � Spans wide range of CS disciplines � Across multiple institutions – 15 –

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend