Understanding Issue Correlations: A Case Study of the Hadoop System - - PowerPoint PPT Presentation

understanding issue correlations a case study of the
SMART_READER_LITE
LIVE PREVIEW

Understanding Issue Correlations: A Case Study of the Hadoop System - - PowerPoint PPT Presentation

Understanding Issue Correlations: A Case Study of the Hadoop System Jian Huang Xuechen Zhang Karsten Schwan Why Issue Study Matters? Scalable distributed systems are complex [Yuan et al., OSDI14] Complicated System 2 Why


slide-1
SLIDE 1

Understanding Issue Correlations: A Case Study of the Hadoop System

Jian Huang

Xuechen Zhang† Karsten Schwan †

slide-2
SLIDE 2

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14] Complicated System

slide-3
SLIDE 3

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14] Complicated System Error-prone

+

slide-4
SLIDE 4

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14] Complicated System Error-prone

+

Hard to Debug

+

slide-5
SLIDE 5

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14] Complicated System

Issue Study

Issue Pattern Error-prone

+

Hard to Debug

+

slide-6
SLIDE 6

2

Why Issue Study Matters?

Scalable distributed systems are complex [Yuan et al., OSDI’14] Complicated System

Issue Study

Issue Pattern Error-prone

+

Hard to Debug

+

Better Software & Debugging Tools

+

slide-7
SLIDE 7

3

Hadoop: A Representative Distributed System

slide-8
SLIDE 8

3

Hadoop: A Representative Distributed System

2 4 6 8 10 2008 2009 2010 2011 2012 2013 2014 2015 Number of Reported Issues (x1000) The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

slide-9
SLIDE 9

3

Hadoop: A Representative Distributed System

2 4 6 8 10 2008 2009 2010 2011 2012 2013 2014 2015 Number of Reported Issues (x1000) The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

……

slide-10
SLIDE 10

3

Hadoop: A Representative Distributed System Learn from issues – more than 6 years of experience.

2 4 6 8 10 2008 2009 2010 2011 2012 2013 2014 2015 Number of Reported Issues (x1000) The Evolution of Apache Hadoop

HDFS (Storage) MapRedue (Computation)

……

slide-11
SLIDE 11

4

What Can We Learn From Issues?

[Gunawi et al., SoCC’14] What Bugs Live in the Cloud? [Lu et al., FAST’13]

A Study of Linux File System Evolution

……

Related Work

slide-12
SLIDE 12

4

What Can We Learn From Issues?

[Gunawi et al., SoCC’14] What Bugs Live in the Cloud? [Lu et al., FAST’13]

A Study of Linux File System Evolution

……

Related Work Our Focus: Issue Correlations

Tools Programming Systems

slide-13
SLIDE 13

5

Our Findings

  • Half of the issues are independent
  • MapReduce issues tend to relate to YARN
  • One third of the issues have similar causes
  • ......
slide-14
SLIDE 14

5

Our Findings

  • Half of the issues are independent
  • MapReduce issues tend to relate to YARN
  • One third of the issues have similar causes
  • ......

Tools Programming Systems

  • Memory: GC is still the No. 1 concern
  • Storage: “99.99% of data reliability” is challenged
  • Programming: one third of them relate to interfaces
  • Tools: the logging in Hadoop is error-prone
  • ......
slide-15
SLIDE 15

6

Methodology Used in Our Study

HDFS HBase HCatalog Mahout MapReduce Cascading Hive Pig Flume

Hadoop Ecosystem

slide-16
SLIDE 16

6

Methodology Used in Our Study

Computation Storage

HDFS HBase HCatalog Mahout MapReduce Cascading Hive Pig Flume

Hadoop Ecosystem

slide-17
SLIDE 17

6

Methodology Used in Our Study

Computation Storage

HDFS HBase HCatalog Mahout MapReduce Cascading Hive Pig Flume

Closed Issues Examined Issues

2180 2038 2359 2340

Hadoop Ecosystem

slide-18
SLIDE 18

6

Methodology Used in Our Study

Computation Storage

HDFS HBase HCatalog Mahout MapReduce Cascading Hive Pig Flume

Closed Issues Examined Issues

2180 2038 2359 2340

Sampling Rate 89.8% Hadoop Ecosystem

slide-19
SLIDE 19

6

Methodology Used in Our Study

Computation Storage

HDFS HBase HCatalog Mahout MapReduce Cascading Hive Pig Flume

Closed Issues Examined Issues

2180 2038 2359 2340

Sampling Period

~6 years 5 years

Sampling Rate 89.8% Hadoop Ecosystem

slide-20
SLIDE 20

6

Methodology Used in Our Study

Issues

Description Patches Follow-up Discussions Source Code Analysis

slide-21
SLIDE 21

6

Methodology Used in Our Study

Issues

Description Patches Follow-up Discussions Source Code Analysis IssueID Create/Commit Time Subcomponent Type Causes CorrelatedIssueID

……

HPatchDB

Labeling

slide-22
SLIDE 22

7

Where Are the Correlated Issues From?

Do you know where I’m from?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

slide-23
SLIDE 23

External Correlation

correlated issues appear in other systems

Internal Correlation

correlated issues appear in the same system A B C

7

Where Are the Correlated Issues From?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

slide-24
SLIDE 24

A B C

7

Where Are the Correlated Issues From?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

slide-25
SLIDE 25

A B C

7

Where Are the Correlated Issues From?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% A significant number of issues are independent.

slide-26
SLIDE 26

A B C

7

Where Are the Correlated Issues From?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% Half of them are from YARN.

slide-27
SLIDE 27

A B C

7

Where Are the Correlated Issues From?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

slide-28
SLIDE 28

A B C

7

Where Are the Correlated Issues From?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0% Half of them are independent.

slide-29
SLIDE 29

A B C

7

Where Are the Correlated Issues From?

#Correlated Issues 1 2 3 >=4 External HDFS 94.7% 4.8% 0.5%

  • MapReduce

79.3% 17.1% 2.8% 0.5% 0.3% Internal HDFS 52.7% 32.8% 9.1% 3.1% 2.3% MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%

slide-30
SLIDE 30

8

How the Issues Are Correlated?

Do you know our relationship?

slide-31
SLIDE 31

8

How the Issues Are Correlated?

Similar Causes

Issues have similar causes

Blocking Other Issues

Issues need to be fixed before fixing other issues

Fix on Fix

Issues are caused by fixing other issues

slide-32
SLIDE 32

8

How the Issues Are Correlated?

10 20 30 40 HDFS MapReduce Percentage (%) Similar Causes Blocking Other Issues Fix on Fix 26-33% of the issues have similar causes.

slide-33
SLIDE 33

8

How the Issues Are Correlated?

10 20 30 40 HDFS MapReduce Percentage (%) Similar Causes Blocking Other Issues Fix on Fix

These issues that block others appear more frequently in HDFS.

slide-34
SLIDE 34

8

How the Issues Are Correlated?

10 20 30 40 HDFS MapReduce Percentage (%) Similar Causes Blocking Other Issues Fix on Fix Mostly due to functional dependency.

slide-35
SLIDE 35

9

Tools Programming Systems

On the Issue Correlations with System Characteristics

slide-36
SLIDE 36

9

Tools Programming Systems

47% 27% 26%

On the Issue Correlations with System Characteristics

slide-37
SLIDE 37

10

How Issues Relate to Systems?

slide-38
SLIDE 38

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

security networking storage file system memory cache

10

How Issues Relate to Systems?

slide-39
SLIDE 39

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

security networking storage file system memory cache

10

How Issues Relate to Systems?

  • LightWeightGSet Vs. java.util structure
  • Object cache for long lived object:

ReplicasMap, ReplicasInfo

GC is still the No.1 concern, memory-friendly objects are preferred.

slide-40
SLIDE 40

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

security networking storage file system memory cache

10

How Issues Relate to Systems?

File system semantic:

namespace management, file permission, consistency (e.g., fsck), etc.

Many issues happened in file system like EXT4 appear in Hadoop.

slide-41
SLIDE 41

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

security networking storage file system memory cache

10

How Issues Relate to Systems?

Issues in rack placement policy: 0.16% of blocks and their replicas are in the same rack upon system upgrade.

The statement of the 99.99% of data reliability in cloud storage is challenged.

slide-42
SLIDE 42

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

security networking storage file system memory cache

10

How Issues Relate to Systems?

One quarter of networking issues cause resource wastage.

Read a block: Peer peer = newTcpPeer(dnAddr);

  • return newBlockReader(…)

+ try{ + reader = newBlockReader(…) + return reader + } catch (IOException ex) { + throw ex; + } finally { + if(reader == null) closeQuietly(peer); + }

Socket leak !

slide-43
SLIDE 43

11

How Issues Relate to Programming?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

typo lock interface maintenance

slide-44
SLIDE 44

11

How Issues Relate to Programming?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

typo lock interface maintenance

Half of them relate to code maintenance.

slide-45
SLIDE 45

11

How Issues Relate to Programming?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

typo lock interface maintenance

Mainly caused by interface changes.

slide-46
SLIDE 46

11

How Issues Relate to Programming?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

typo lock interface maintenance

5.6% of programming issues are caused by typos !

A fsimage cannot be accessed due to:

  • elif [ “COMMAND” = “oiv_legacy” ] then

+ elif [ “$COMMAND” = “oiv_legacy” ] then

slide-47
SLIDE 47

12

How Issues Relate to Tools?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

configuration debugging documents testing

slide-48
SLIDE 48

12

How Issues Relate to Tools?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

configuration debugging documents testing

Logs are misleading: incorrect, incomplete, indistinct output.

slide-49
SLIDE 49

12

How Issues Relate to Tools?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

configuration debugging documents testing

Logs are misleading: incorrect, incomplete, indistinct output.

Accessing a non-exist file via WebHDFS, FileNotFoundException is expected, but we get this

Logs

slide-50
SLIDE 50

12

How Issues Relate to Tools?

10 20 30 40 50 60 70 80 90 100 HDFS MapReduce Percentage (%)

configuration debugging documents testing

A majority of configuration issues are related to system performance.

59% of the 219 configuration parameters in MapReduce are performance related.

slide-51
SLIDE 51

13

Conclusion

2

Correlations Between Issues

Issues are independent; 33% of issues have similar causes, etc.

Correlations With System Characteristics

More efforts are required to achieve highly reliable distributed system

1

Tools Programming Systems

slide-52
SLIDE 52

Jian Huang jian.huang@gatech.edu

Xuechen Zhang† Karsten Schwan †

Thanks! Q&A