Why do big data and cloud systems slow down and stop? Shan Lu What - - PowerPoint PPT Presentation

why do big data and cloud systems slow down and stop
SMART_READER_LITE
LIVE PREVIEW

Why do big data and cloud systems slow down and stop? Shan Lu What - - PowerPoint PPT Presentation

Why do big data and cloud systems slow down and stop? Shan Lu What are? Why do big data and cloud systems slow down and stop? Big data & cloud systems 3 Big data & cloud systems DB-backed web applications Cloud services


slide-1
SLIDE 1

Why do big data and cloud systems slow down and stop?

Shan Lu

slide-2
SLIDE 2

Why do big data and cloud systems slow down and stop?

What are?

slide-3
SLIDE 3

Big data & cloud systems

3

slide-4
SLIDE 4

Big data & cloud systems

  • DB-backed web applications
  • Cloud services

4

slide-5
SLIDE 5

DB-backed web applications

5

DBMS Application server

HTTP request

Database query

slide-6
SLIDE 6

Performance is critical for web applications

  • Low latency is critical

6

1 SECOND DELAY IN PAGE LOAD 11%

Fewer Page Views

16%

Less Customer Satisfaction

7%

Loss in Profit

  • Low latency is challenging given the data size

Nearly half of the users expect a site to load in less than 2 seconds

slide-7
SLIDE 7

Cloud services

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

Reliability is critical for cloud services

10

slide-11
SLIDE 11

Reliability is critical for cloud services

11

slide-12
SLIDE 12

Outline

  • What slows down (big data) web applications [ICSE’18]

○ What can we do about it? [CIKM’17, FSE’18, ICSE’19, CIDR’20]

  • What stops cloud systems? [HotOS’19]

○ What can we do about it? [ASPLOS’16, ASPLOS’17, ASPLOS’18, PLDI’19, SOSP’19]

12

DBMS

1000+ bugs found 1000+ bugs found

slide-13
SLIDE 13

What Slowed Down Database-Backed Web Applications

hyperloop.cs.uchicago.edu Shan Lu

View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19 How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18. PowerStation: Automatically detecting and fixing inefficiencies of database-backed web applications in IDE. FSE’18

slide-14
SLIDE 14

Common Web-app Architecture

14

DBMS Application server

HTTP request

Database query

slide-15
SLIDE 15

user

Controller Model

DBMS Application server

Common Web-app Architecture

3

HTTP request

class BlogsController def index user_id = 1 myblogs = Blog.retrieve(user_id) end end class Blog def retrieve(user_id) Blog.where(uid = user_id) end end SELECT * FROM blogs where uid = id Query Translator

http://www.xxx.com/blogs/index

slide-16
SLIDE 16

class BlogsController def index user_id = 1 myblogs = Blog.retrieve(user_id) end end

user

Controller View Model

DBMS

Common Web-app Architecture

3

HTTP request

blogs uid contents Query Translator

http://www.xxx.com/blogs/index 1001 unread blogs http://blogs/index … Arriving at Zurich Stopping by Bern Love love Berner Oberland Love Berner Oberland Back to Lausanne One day at Luzern @myblogs.each do |blog| blog.content<br/> end app/views/blogs/index.html.erb

Application server

slide-17
SLIDE 17

Model

DBMS

Potential sources of inefficiencies

3 blogs uid contents

Object Relational Mapping Framework

class Blog end SELECT * FROM blogs where uid = id Blog.where(uid = user_id)

slide-18
SLIDE 18

Model

DBMS

Potential sources of inefficiencies

3 blogs uid contents

Object Relational Mapping Framework

class Blog end SELECT * FROM blogs where uid = id Blog.where(uid = user_id)

MVC Design Pattern

Controller View

@myblogs.each do |blog| blog.content<br/> end app/views/blogs/index.html.erb

slide-19
SLIDE 19

Outline

19

Profile 12 apps from 6 common categories Build performance-bug taxonomy Design automated bug detection & fixing

64 issues in 40 pages 9 anti- patterns 1000 + bugs How severe is the problem? What are the common inefficiency patterns? How to solve the problem?

slide-20
SLIDE 20

Outline

20

Profile 12 apps from 6 common categories Build performance-bug taxonomy Design automated bug detection & fixing

64 issues in 40 pages

slide-21
SLIDE 21

Profiling methodology

21

Top 2 Apps in 6 popular categories

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

Synthesize DB content based on real-world website statistics

slide-22
SLIDE 22

Profiling End-to-end Page Time

22

11 apps have pages > 2s 6 apps have pages > 3s

40 problematic pages Server takes most time

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

20000 record

slide-23
SLIDE 23

Why is it slow?

23

There are inefficiency bugs!

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-24
SLIDE 24

Why is it slow?

  • We manually fix the 64 issues we found across 39 pages

24

LoC changed speedup

60%

80% There are bugs!

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-25
SLIDE 25

Outline

25

Profile 12 apps from 6 common categories Build performance-bug taxonomy Design automated bug detection & fixing

9 anti- patterns

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-26
SLIDE 26

Common Performance Anti-patterns

26

64 performance issues from profiling 140 performance issues from bug tracking system 9 anti-patterns

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-27
SLIDE 27

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

Common Performance Anti-patterns

27

2

Database Design

41 issues across 10 apps

3

Application Design Tradeoff

47 issues across 12 apps

1

ORM API Misuse

106 issues across 12 apps

slide-28
SLIDE 28

ORM API Misuse

28

UD

Unnecessary Data Retrieval 9 issues across 4 apps

IR

Inefficient Rendering 5 issues across 4 apps

ID

Inefficient Data Access 44 issues across 11 app

IC

Inefficient Computation 26 issues across 8 apps

UC

Unnecessary Computation 22 issues across 10 apps

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-29
SLIDE 29

ORM API Misuse

29

UD

Unnecessary Data Retrieval 9 issues across 4 apps

IR

Inefficient Rendering 5 issues across 4 apps

ID

Inefficient Data Access 44 issues across 11 app

IC

Inefficient Computation 26 issues across 8 apps

UC

Unnecessary Computation 22 issues across 10 apps

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-30
SLIDE 30

SELECT 1 AS ONE FROM issues WHERE project_id = ? LIMIT 1 SELECT COUNT(*) FROM issues WHERE project_id = ?

ORM API Misuse: inefficient computation

30

efficient inefficient

project.issues.any? project.issues.exists? SELECT COUNT(*) FROM issues WHERE project_id = ?

inefficient

project.issues.count>0

2X speedup

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-31
SLIDE 31

end values.each do |value| u.issues.include? value

ORM API Misuse: unnecessary computation

31

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-32
SLIDE 32

end values.each do |value|

  • u.issues.include?value

ORM API Misuse: unnecessary computation

32

+ rans = u.issues + rans.include?value values.each do |value| end

20X speed up

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-33
SLIDE 33

ORM API misuses that affect memory consumption

  • map (:id) VS pluck (:id)
  • pluck(size).sum VS sum(size)
  • pluck + pluck VS SQL UNION

33

slide-34
SLIDE 34

How to tackle API Misuses?

34

  • Why cannot existing compiler handle this?
  • Can we extend compiler to

○ Understand ORM APIs and queries? ○ Detect the problem? ○ Solve the problem?

PowerStation: Automatically detecting and fixing inefficiencies of database-backed web applications in IDE. FSE’18

slide-35
SLIDE 35

Database-aware PDG

35

v1 = u v2 = values v2.do |val| v3 = v1.issues v3.include?val end

query node data edge control edge

(a) Ruby code (b) PDG Call: v3=v1.issues

SQL: SELECT * from issues WHERE user_id=?

values.reject |val| u.issues.include?val end

val = v2[] Call:v3.include?val Copy: v1 = u Copy: v2 = values Call: v3=v1.issues Call: v3=v1.issues

slide-36
SLIDE 36

Detect and Fix

36

Loop-invariant query

query node data edge control edge

val = v2[] Call:v3.include?val Copy: v1 = u Copy: v2 = values Call: v3=v1.issues

PowerStation: Automatically detecting and fixing inefficiencies of database-backed web applications in IDE. FSE’18

slide-37
SLIDE 37

PowerStation: Automatically detecting and fixing inefficiencies of database-backed web applications in IDE. FSE’18

Click here

PowerStation (Integrated with RubyMine)

37

run_query is a loop invariant query Fix: move it out of the loop PowerStation issues LI IA CS IR RD DS

Issue List

PowerStation Whole App Single Action LI blogs_controller.rb 4 FIX blogs_controller.rb 4 FIX

slide-38
SLIDE 38

PowerStation: Automatically detecting and fixing inefficiencies of database-backed web applications in IDE. FSE’18

Try our Powerstation!

38

  • 12 real world apps
  • 1221 inefficiencies found
slide-39
SLIDE 39

Common Performance Anti-patterns

39

2

Database Design

41 issues across 10 apps

3

Application Design Tradeoff

47 issues across 12 apps

1

ORM API Misuse

106 issues across 12 apps

slide-40
SLIDE 40

Database Design Problem

  • Missing fields (8 issues across 5 apps):

fields derivable from other fields and not persistently stored

40 id longitude latitude

  • Missing index (33 issues across 10 apps)

location

2X

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-41
SLIDE 41

Common Performance Anti-patterns

41

2

Database Design

41 issues across 10 apps

3

Application Design Tradeoff

47 issues across 12 apps

1

ORM API Misuse

106 issues across 12 apps

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

slide-42
SLIDE 42

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

  • App. design tradeoff: display design

42

1001 unread blogs http://blogs/index … Arriving at Zurich Stopping by Bern Love love Berner Oberland Love Berner Oberland Back to Lausanne One day at Luzern 1001 unread blogs http://blogs/index … Arriving at Zurich Stopping by Bern Love love Berner Oberland Love love Berner Oberland One day at Luzern

< <

1 2 3 …

> >

slide-43
SLIDE 43

How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18.

  • App. design tradeoff: functionality design

43

1001 unread blogs http://blogs/index … Arriving at Zurich Stopping by Bern Love love Berner Oberland Love Berner Oberland One day at Luzern

< <

1 2 3 …

> >

More than 20 unread blogs http://blogs/index … Arriving at Zurich Stopping by Bern Love love Berner Oberland Love Berner Oberland One day at Luzern

< <

1 2 3 …

> >

slide-44
SLIDE 44

Application Design Tradeoff

Application functionality tradeoff (21 issues in 10 apps)

44

>1.5s performance functionality

Whether to show this guideline

SELECT count(*) FROM moderations JOIN stories where stories.user_id = @user.id AND moderations.created_at > 5.days.ago

slide-45
SLIDE 45

How to tackle application design tradeoffs?

45

  • Can we do automated optimization?
  • Help developers make informed decision, by providing

○ Cost information ○ Alternative display/functionality options

View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19

slide-46
SLIDE 46

1001 unread blogs def index @blogs = blog.all render “index” end http://blogs/index … @blogs.each do |blog| blog.content<br/> end Arriving at Zurich Stopping by Bern Love love Berner Oberland Love Berner Oberland Back to Lausanne app/controllers/blogs_controller.rb app/views/blogs/index.html.erb One day at Luzern View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19

slide-47
SLIDE 47

def index @blogs = blog.all render “index” end http://blogs/index … @blogs.each do |blog| blog.content<br/> end app/controllers/blogs_controller.rb app/views/blogs/index.html.erb 1001 unread blogs Arriving at Zurich Stopping by Bern One day at Luzern pagination Love love Berner Oberland Love Berner Oberland Back to Lausanne View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19

slide-48
SLIDE 48

def index @blogs = Blog.all.paginate(…) render “index” end http://blogs/index … @blogs.each do |blog| blog.content<br/> end will_paginate @blogs app/controllers/blogs_controller.rb app/views/blogs/index.html.erb

< <

1 2 3 …

> >

Arriving at Zurich Stopping by Bern One day at Luzern Love love Berner Oberland Love Berner Oberland 1001 unread blogs View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19

slide-49
SLIDE 49

def index @blognum = blog.count render “index” end http://blogs/index … There are @blognum blogs app/controllers/blogs_controller.rb app/views/blogs/index.html.erb

< <

1 2 3 …

> >

remove approximation async loading 1001 unread blogs Arriving at Zurich Stopping by Bern One day at Luzern Love love Berner Oberland Love Berner Oberland View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19

slide-50
SLIDE 50

more than 20 unread blogs def index @blognum = blog.limit(21).count render “index” end http://blogs/index … There are @blognum>20?‘more than 20’:@blognum blogs app/controllers/blogs_controller.rb app/views/blogs/index.html.erb

< <

1 2 3 …

> >

remove async loading Arriving at Zurich Stopping by Bern One day at Luzern Love love Berner Oberland Love Berner Oberland View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19

slide-51
SLIDE 51

def index @blognum = blog.count render “index” end http://blogs/index … @blognum unread blogs app/controllers/blogs_controller.rb app/views/blogs/index.html.erb

< <

1 2 3 …

> >

Arriving at Zurich Stopping by Bern One day at Luzern Love love Berner Oberland Love Berner Oberland View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19

slide-52
SLIDE 52

Try our Panorama!

52

  • 12 real world apps
  • 149 performance-enhancing opportunities identified for 119 costly HTML tags
  • 4.5X average page-load time speedup
  • User study agrees!
slide-53
SLIDE 53

Slow downs in web applications

53

Real world database-backed applications perform poorly Data-related performance anti-patterns exist Automatic tools are built to detect and fix performance issues hyperloop.cs.uchicago.edu

View-Centric Performance Optimization for Database-Backed Web Applications. ICSE’19 How not to structure your database-backed web applications: a study of performance bugs in the wild. ICSE’18. PowerStation: Automatically detecting and fixing inefficiencies of database-backed web applications in IDE. FSE’18

Junwen Yang

slide-54
SLIDE 54

What stopped cloud services?

Efficient and Scalable Thread-Safety Violation Detection --- Finding thousands of concurrency bugs during testing. SOSP’19 DFix: Automatically Fixing Timing Bugs in Distributed Systems. PLDI’19 FCatch: Automatically detecting time-of-fault bugs in cloud systems. ASPLOS’18 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems. ASPLOS’17 TaxDC: A Comprehensive Taxonomy of Non-Deterministic Concurrency Bugs in Cloud Distributed Systems. ASPLOS’16. What Bugs Cause Production Cloud Incidents? HotOS’19

slide-55
SLIDE 55

Need to study real-world cloud incidents

55

Cause Handli ng

slide-56
SLIDE 56

Existing studies for cloud incidents

56

Cause

Others Hardware Software Unknown Unknown

Handli ng

[6] Leesatapornwongsa. TaxDC. In ASPLOS’16 [5] Leesatapornwongsa. Scalability bugs. In HotOS’17 [4] Huang. Gray failure. In HotOS’17 [3] Yuan. Simple test can prevent most critical failures. In OSDI’14 [2] Gunawi. Why does the cloud stop computing? In SoCC’16 [1] Gunawi. What bugs live in the cloud? In SoCC’14

Data source constraints!

slide-57
SLIDE 57

Our work

57

Cause Handling

6-month high-severity incidents in Microsoft Azure services

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-58
SLIDE 58

One more background …

58

Cause Handling

What causes incidents in non-cloud software? Others Hardware Software Concurrency bugs Memory bugs Semantic bugs

6-month high-severity incidents in Microsoft Azure services

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-59
SLIDE 59

Our findings

59

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs Memory bugs Semantic bugs

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-60
SLIDE 60

Our findings

60

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs Memory bugs Semantic bugs

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-61
SLIDE 61

Our findings

61

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs Memory bugs Semantic bugs

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-62
SLIDE 62

Our findings

62

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs Resource (memory) leaks Semantic bugs

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-63
SLIDE 63

Our findings

63

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs (50% persistent races) Resource (memory) leaks Semantic bugs ……

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-64
SLIDE 64

Our findings

64

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs (50% persistent races) Resource (memory) leaks Semantic bugs Fault-handle bugs Data-format bugs

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-65
SLIDE 65

Our findings

65

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs (50% persistent races) Resource (memory) leaks Semantic bugs Fault-handle bugs Data-format bugs

>50% through mitigation without patches

What Bugs Cause Production Cloud Incidents? HotOS’19

slide-66
SLIDE 66

What can we do?

66

Cause Handling

6-month high-severity incidents in Microsoft Azure services Software Hardware Others Concurrency bugs (50% persistent races) Resource (memory) leaks Semantic bugs Fault-handle bugs Data-format bugs

>50% through mitigation without patches

What Bugs Cause Production Cloud Incidents? HotOS’19

github.com/microsoft/TSVD

slide-67
SLIDE 67

Conclusions

  • Software bugs widely exist in big data & cloud systems
  • Software bugs are taking on new forms in big data & cloud systems

○ Memory data ßà Persistent data

  • A lot of bug fighting can be done and to be done
  • Our are making our bug set and tools open source!

67

Junwen Yang Guangpu Li

slide-68
SLIDE 68

Thanks!

68