D ata I ntensive I ntensive S calable S calable C omputing C - - PowerPoint PPT Presentation

d
SMART_READER_LITE
LIVE PREVIEW

D ata I ntensive I ntensive S calable S calable C omputing C - - PowerPoint PPT Presentation

D ata D ata I ntensive I ntensive S calable S calable C omputing C omputing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Examples of Big Data Sources Examples of Big Data Sources Wal- -Mart Mart Wal


slide-1
SLIDE 1

D Data

ata

I Intensive

ntensive

S Scalable

calable

C Computing

  • mputing

http://www.cs.cmu.edu/~bryant

Randal E. Bryant Carnegie Mellon University

slide-2
SLIDE 2

– 2 –

Examples of Big Data Sources Examples of Big Data Sources

Wal Wal-

  • Mart

Mart

267 million items/day, sold at 6,000 stores HP building them 4PB data warehouse Mine data to manage supply chain, understand market

trends, formulate pricing strategies

Sloan Digital Sky Survey Sloan Digital Sky Survey

New Mexico telescope captures 200 GB image data / day Latest dataset release: 10 TB, 287 million celestial objects SkyServer provides SQL access Next generation LSST even bigger

slide-3
SLIDE 3

– 3 –

Our Data-Driven World Our Data-Driven World

Science Science

Data bases from astronomy, genomics, natural languages,

seismic modeling, …

Humanities Humanities

Scanned books, historic documents, …

Commerce Commerce

Corporate sales, stock market transactions, census, airline

traffic, …

Entertainment Entertainment

Internet images, Hollywood movies, MP3 files, …

Medicine Medicine

MRI & CT scans, patient records, …

slide-4
SLIDE 4

– 4 –

Cloud Computing Varieties Cloud Computing Varieties

“ “I’ve got terabytes of data. I’ve got terabytes of data. Tell me what they mean.” Tell me what they mean.”

Very large, shared data

repository

Complex analysis Data-intensive scalable

computing (DISC)

“ “I don’t want to be a system I don’t want to be a system

  • administrator. You handle my
  • administrator. You handle my

data & applications.” data & applications.”

Hosted services Documents, web-based

email, etc.

Can access from anywhere Easy sharing and

collaboration

slide-5
SLIDE 5

– 5 –

CS Research Issues CS Research Issues

Applications Applications

Language translation, image processing, …

Application Support Application Support

Machine learning over very large data sets Web crawling

Programming Programming

Abstract programming models to support large-scale

computation

Distributed databases

System Design System Design

Error detection & recovery mechanisms Resource scheduling and load balancing Distribution and sharing of data across system

slide-6
SLIDE 6

– 6 –

Getting Started Getting Started

Goal Goal

Get faculty & students active in DISC

Software: Hadoop Software: Hadoop

Open source project inspired by Google infrastructure

Distributed file system MapReduce programming environment

Supported and used by Yahoo Prototype on single machine, map onto cluster

slide-7
SLIDE 7

– 7 –

Hardware: Rely on Kindness of Others Hardware: Rely on Kindness of Others

Google setting up dedicated cluster for university use Loaded with open-source software

Including Hadoop

IBM providing additional software support NSF will determine how facility should be used.

slide-8
SLIDE 8

– 8 –

More Sources of Kindness More Sources of Kindness

Yahoo: Major

supporter of Hadoop

Yahoo plans to

work with other universities

slide-9
SLIDE 9

– 9 –

Big-Data Computing Study Group Big-Data Computing Study Group

Co-organized by REB & Thomas Kwan (Yahoo!) Supported by Computing Community Consortium

slide-10
SLIDE 10

– 10 –

BDCSG Activities BDCSG Activities

Hadoop Summit Hadoop Summit

350+ people showed up Power of Open Source

Data Data-

  • Intensive Computing Symposium

Intensive Computing Symposium

~100 from universities, companies, govt. labs, NSF 14 invited speakers

Google, Yahoo!, Microsoft, Intel CMU, UC Berkeley, Cornell, MIT, Johns Hopkins, UIUC, UW NSF

slide-11
SLIDE 11

– 11 –

NSF Involvement NSF Involvement

slide-12
SLIDE 12

– 12 –

Curriculum Development Curriculum Development

Workshop for educators July 16–18, 2008

slide-13
SLIDE 13

– 13 –

Christophe Christophe Bisciglia Bisciglia

UW/Google Catalyst /

instigator

slide-14
SLIDE 14

– 14 –

Future Workshops Future Workshops

slide-15
SLIDE 15

– 15 –

Concluding Thoughts Concluding Thoughts

The World is Ready for a New Approach to Large The World is Ready for a New Approach to Large-

  • Scale

Scale Computing Computing

Optimized for data-driven applications Technology favoring centralized facilities

Storage capacity & computer power growing faster than network

bandwidth

Industry is Catching on Quickly Industry is Catching on Quickly

Large crowd for Hadoop Summit

University Researchers / Educators Eager to Get University Researchers / Educators Eager to Get Involved Involved

Spans wide range of CS disciplines Across multiple institutions