Distributed Computing at Hai.Thai@rackspace.com About: Me ME - - PowerPoint PPT Presentation

distributed computing
SMART_READER_LITE
LIVE PREVIEW

Distributed Computing at Hai.Thai@rackspace.com About: Me ME - - PowerPoint PPT Presentation

Distributed Computing at Hai.Thai@rackspace.com About: Me ME About: Me ME 09 Tech grad B.S. Computer Engineering 4 years at rackspace About: Rackspace About: Rackspace Managed + Cloud hosting Cloud Applications: Email


slide-1
SLIDE 1

Distributed Computing

Hai.Thai@rackspace.com at

slide-2
SLIDE 2

About: Me

ME

slide-3
SLIDE 3

About: Me

ME

  • 09 Tech grad
  • B.S. Computer Engineering
  • 4 years at rackspace
slide-4
SLIDE 4

About: Rackspace

slide-5
SLIDE 5

About: Rackspace

  • Managed + Cloud hosting
  • Cloud Applications:
  • Email
slide-6
SLIDE 6

About: Rackspace

  • Office in Blacksburg
  • 100 best companies to work for
  • We’re hiring!
slide-7
SLIDE 7

The Big Picture

Data is VALUABLE Data is growing

  • More sources + more data per source
  • Faster than individual devices
  • Years of information
slide-8
SLIDE 8

The Big Picture: Rackspace

At Rackspace e-mail

  • 2.5 Million mailboxes
  • 50-100 Million messages / day
  • 300-400 GB raw log data / day
  • Hundreds of servers
  • TBs of stored log data
slide-9
SLIDE 9

The Big Picture: Rackspace

How do we…

  • Aggregate
  • Store
  • Analyze
  • Access
slide-10
SLIDE 10

The Big Picture: Rackspace

How do we…

Get Value?

slide-11
SLIDE 11

The Problem

With mail logs, we can:

  • Help customers
  • Diagnose the system
  • Understand and plan
slide-12
SLIDE 12

Aggregation

  • Multi-Source Single-Sink
  • Realworld network
  • Hardware Failure
slide-13
SLIDE 13

Storage

  • Distributed
  • Fault tolerant
  • Horizontally scalable
  • Easy
slide-14
SLIDE 14

Serving Logs

Make logs accessible for:

  • Support to help customers
  • Operations to diagnose errors
slide-15
SLIDE 15

Serving Logs

The challenge: Volume

  • 400+ GB / day = 300 MB / min
  • Must be timely
  • Related log data may be disjoint
slide-16
SLIDE 16

Serving Logs

  • Index data with Hadoop MapReduce
  • Serve indexes in Solr

+

slide-17
SLIDE 17

Serving Logs: Indexing

Map Reduce:

  • History on distributed systems:
  • Google
  • Easily distributed
  • Map step: key->value pair
  • Reduce step: All values for a key
slide-18
SLIDE 18

Serving Logs: Indexing

Map Reduce for mail logs:

  • Map step:
  • Parse raw log
  • Reduce step:
  • Aggregate related log lines
  • Generate relevant structure for queries
  • Output as Solr index
slide-19
SLIDE 19

Serving Logs: Indexing

Nov 12 17:36:54 gate8.gate.sat.mlsrvr.com postfix/smtpd[2552]: connect from hostname Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/qmgr[9489]: 1DBD21B48AE: from=<mapreduce@mailtrust.com>, size=5950, nrcpt=1 (queue active) Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtpd[28085]: disconnect from hostname Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: too many errors after DATA from hostname Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/smtp[15928]: 732196384ED: to=<mapreduce@mailtrust.com>, relay=hostname[ip], conn_use=2, delay=0.69, delays=0.04/0.44/0.04/0.17, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 02E1544C005) Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: disconnect from hostnameNov 12 17:36:54 gate10.gate.sat.mlsrvr.com postfix/smtpd[10311]: connect from hostname Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtp[28107]: D42001B48B5: to=<mapreduce@mailtrust.com>, relay=hostname[ip], delay=0.32, delays=0.28/0/0/0.04, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 1DBD21B48AE) Nov 12 17:36:54 gate20.gate.sat.mlsrvr.com postfix/smtpd[27168]: disconnect from hostname Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/qmgr[1209]: 645965A0224: removed Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/qmgr[13764]: 732196384ED: removed Nov 12 17:36:54 gate1.gate.sat.mlsrvr.com postfix/smtpd[26394]: NOQUEUE: reject: RCPT from hostname 554 5.7.1 <mapreduce@mailtrust.com>: Client host rejected: The sender's mail server is blocked; from=<mapreduce@mailtrust.com> to=<mapreduce@mailtrust.com> proto=ESMTP helo=<mapreduce@mailtrust.com>

slide-20
SLIDE 20

Serving Logs: Indexing

Nov 12 17:36:54 gate8.gate.sat.mlsrvr.com postfix/smtpd[2552]: connect from hostname Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/qmgr[9489]: 1DBD21B48AE: from=<mapreduce@mailtrust.com>, size=5950, nrcpt=1 (queue active) Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtpd[28085]: disconnect from hostname Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: too many errors after DATA from hostname Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/smtp[15928]: 732196384ED: to=<mapreduce@mailtrust.com>, relay=hostname[ip], conn_use=2, delay=0.69, delays=0.04/0.44/0.04/0.17, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 02E1544C005) Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/smtpd[22593]: disconnect from hostnameNov 12 17:36:54 gate10.gate.sat.mlsrvr.com postfix/smtpd[10311]: connect from hostname Nov 12 17:36:54 relay2.relay.sat.mlsrvr.com postfix/smtp[28107]: D42001B48B5: to=<mapreduce@mailtrust.com>, relay=hostname[ip], delay=0.32, delays=0.28/0/0/0.04, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 1DBD21B48AE) Nov 12 17:36:54 gate20.gate.sat.mlsrvr.com postfix/smtpd[27168]: disconnect from hostname Nov 12 17:36:54 gate5.gate.sat.mlsrvr.com postfix/qmgr[1209]: 645965A0224: removed Nov 12 17:36:54 gate2.gate.sat.mlsrvr.com postfix/qmgr[13764]: 732196384ED: removed Nov 12 17:36:54 gate1.gate.sat.mlsrvr.com postfix/smtpd[26394]: NOQUEUE: reject: RCPT from hostname 554 5.7.1 <mapreduce@mailtrust.com>: Client host rejected: The sender's mail server is blocked; from=<mapreduce@mailtrust.com> to=<mapreduce@mailtrust.com> proto=ESMTP helo=<mapreduce@mailtrust.com>

slide-21
SLIDE 21

Serving Logs: Searching

  • Full text search + advanced search features
  • Supports distributed operation
  • Horizontally scalable
slide-22
SLIDE 22

Serving Logs: Searching

Our Solr cluster:

  • Separate from hadoop
  • Pulls indexed data and merges into memory
  • Subset of logs searchable
  • Shard data based on time
slide-23
SLIDE 23

Analytics

Hadoop Map Reduce

  • Large sets of data
  • 100s of GBs per job; potentially TBs
  • Full power of Map Reduce
  • Hadoop Streaming
slide-24
SLIDE 24

Challenges

Building on top of HDFS

  • Easy, but simple
  • Custom organization on top of filesystem
slide-25
SLIDE 25

Challenges

In Flight Refactor

  • Original design assumed perfect information
  • Redesign around delayed logs/events
slide-26
SLIDE 26

Challenges

  • Parsing Application Logs Requires Domain

Knowledge

  • Develop services based on distributed systems for

solutions to use rather than solutions build around technology

slide-27
SLIDE 27

The Future

  • Streaming vs Batching
  • Solr Cloud
  • New Logging solution
slide-28
SLIDE 28

Takeaway

  • Use of Hadoop + Map Reduce to solve our data

problem

  • Solutions must be created to extract value from

growing data

  • Example of a realworld distributed system
slide-29
SLIDE 29

Distributed Systems

Big Data is only one of the areas of growth in distributed systems

We need YOU

RackerTalent.com

slide-30
SLIDE 30

Resources

  • lucene.apache.org/solr
  • hadoop.apache.org
  • Hadoop: The Definitive Guide