Design Implications for Enterprise Storage Systems via - - PowerPoint PPT Presentation

design implications for enterprise storage systems via
SMART_READER_LITE
LIVE PREVIEW

Design Implications for Enterprise Storage Systems via - - PowerPoint PPT Presentation

Introduction Methodology Analysis process Results Conclusions Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis Piotr Dobrowolski MIMUW/Distributed Systems October 24, 2012 Piotr Dobrowolski


slide-1
SLIDE 1

Introduction Methodology Analysis process Results Conclusions

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis

Piotr Dobrowolski

MIMUW/Distributed Systems

October 24, 2012

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-2
SLIDE 2

Introduction Methodology Analysis process Results Conclusions Agenda and authors

Presentation agenda

Where it happened, Motivation and background, Methodology, Analysis process, Results, Storage system design implications, Conclusions.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-3
SLIDE 3

Introduction Methodology Analysis process Results Conclusions Agenda and authors

Authors of the paper

Yanpei Chenen (University of California, Berkeley), Kiran Srinivasan (NetApp Inc.), Garth Goodson (NetApp Inc.), Randy Katz (University of California, Berkeley).

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-4
SLIDE 4

Introduction Methodology Analysis process Results Conclusions Agenda and authors

Authors of the paper

Yanpei Chenen (University of California, Berkeley), Kiran Srinivasan (NetApp Inc.), Garth Goodson (NetApp Inc.), Randy Katz (University of California, Berkeley). It happened in NetApp

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-5
SLIDE 5

Introduction Methodology Analysis process Results Conclusions Agenda and authors

Authors of the paper

Yanpei Chenen (University of California, Berkeley), Kiran Srinivasan (NetApp Inc.), Garth Goodson (NetApp Inc.), Randy Katz (University of California, Berkeley). It happened in NetApp mostly they’re doing enterprise storage systems,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-6
SLIDE 6

Introduction Methodology Analysis process Results Conclusions Agenda and authors

Authors of the paper

Yanpei Chenen (University of California, Berkeley), Kiran Srinivasan (NetApp Inc.), Garth Goodson (NetApp Inc.), Randy Katz (University of California, Berkeley). It happened in NetApp mostly they’re doing enterprise storage systems, & Berkeley’s AmpLab (amplab.cs.berkeley.edu).

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-7
SLIDE 7

Introduction Methodology Analysis process Results Conclusions What this paper is about?

In few words. This subject contains:

Different view on the analysis process in enterprise storages (distributed),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-8
SLIDE 8

Introduction Methodology Analysis process Results Conclusions What this paper is about?

In few words. This subject contains:

Different view on the analysis process in enterprise storages (distributed), They’re seeking for insights,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-9
SLIDE 9

Introduction Methodology Analysis process Results Conclusions What this paper is about?

In few words. This subject contains:

Different view on the analysis process in enterprise storages (distributed), They’re seeking for insights, Possible to be provided as early as it is possible,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-10
SLIDE 10

Introduction Methodology Analysis process Results Conclusions What this paper is about?

In few words. This subject contains:

Different view on the analysis process in enterprise storages (distributed), They’re seeking for insights, Possible to be provided as early as it is possible, At the design of such a storage.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-11
SLIDE 11

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Yet another storage analysis? What’s new?

There are plenty of paper describing analysis of the storage systems,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-12
SLIDE 12

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Yet another storage analysis? What’s new?

There are plenty of paper describing analysis of the storage systems, Compared in paper,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-13
SLIDE 13

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Yet another storage analysis? What’s new?

There are plenty of paper describing analysis of the storage systems, Compared in paper, Shortly no-one uses multi-layered and multi-dimensional analysis,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-14
SLIDE 14

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Yet another storage analysis? What’s new?

There are plenty of paper describing analysis of the storage systems, Compared in paper, Shortly no-one uses multi-layered and multi-dimensional analysis, Basically it (multi-*) is the most interesting thing in this paper,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-15
SLIDE 15

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Yet another storage analysis? What’s new?

There are plenty of paper describing analysis of the storage systems, Compared in paper, Shortly no-one uses multi-layered and multi-dimensional analysis, Basically it (multi-*) is the most interesting thing in this paper, Will be described soon.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-16
SLIDE 16

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Past approaches

Lots of assumptions about storage system,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-17
SLIDE 17

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Past approaches

Lots of assumptions about storage system, In this analysis they tried to make fewer assumptions about system behavior,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-18
SLIDE 18

Introduction Methodology Analysis process Results Conclusions What this paper is about?

Past approaches

Lots of assumptions about storage system, In this analysis they tried to make fewer assumptions about system behavior, In this document trying to be less connected with the technology trends.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-19
SLIDE 19

Introduction Methodology Analysis process Results Conclusions Motivation

Data access patterns

Looking as a client:

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-20
SLIDE 20

Introduction Methodology Analysis process Results Conclusions Motivation

Data access patterns

Looking as a client:

How do applications access data,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-21
SLIDE 21

Introduction Methodology Analysis process Results Conclusions Motivation

Data access patterns

Looking as a client:

How do applications access data, How do users access data,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-22
SLIDE 22

Introduction Methodology Analysis process Results Conclusions Motivation

Data access patterns

Looking as a client:

How do applications access data, How do users access data,

At server side, they are describing:

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-23
SLIDE 23

Introduction Methodology Analysis process Results Conclusions Motivation

Data access patterns

Looking as a client:

How do applications access data, How do users access data,

At server side, they are describing:

Files accession,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-24
SLIDE 24

Introduction Methodology Analysis process Results Conclusions Motivation

Data access patterns

Looking as a client:

How do applications access data, How do users access data,

At server side, they are describing:

Files accession, Directories subtrees accession,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-25
SLIDE 25

Introduction Methodology Analysis process Results Conclusions Motivation

Data access patterns

Looking as a client:

How do applications access data, How do users access data,

At server side, they are describing:

Files accession, Directories subtrees accession,

As example:

Storage system for streaming video supports different access patterns than a document repository,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-26
SLIDE 26

Introduction Methodology Analysis process Results Conclusions Motivation

Key

The better the access pattern is understood, the better the storage system design.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-27
SLIDE 27

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-28
SLIDE 28

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Clients, creation, and view files via applications,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-29
SLIDE 29

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Clients, creation, and view files via applications, and servers, stores the content,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-30
SLIDE 30

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Clients, creation, and view files via applications, and servers, stores the content,

Clients operates on user and applications layers.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-31
SLIDE 31

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Clients, creation, and view files via applications, and servers, stores the content,

Clients operates on user and applications layers. Basically client/server behavior creates layer.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-32
SLIDE 32

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Clients, creation, and view files via applications, and servers, stores the content,

Clients operates on user and applications layers. Basically client/server behavior creates layer. In example: user sessions, application instances, files on server, directories, ...

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-33
SLIDE 33

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Clients, creation, and view files via applications, and servers, stores the content,

Clients operates on user and applications layers. Basically client/server behavior creates layer. In example: user sessions, application instances, files on server, directories, ... (def) Access unit = layer,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-34
SLIDE 34

Introduction Methodology Analysis process Results Conclusions Background & Needs

Analysis at different layers

Storage is divided to:

Clients, creation, and view files via applications, and servers, stores the content,

Clients operates on user and applications layers. Basically client/server behavior creates layer. In example: user sessions, application instances, files on server, directories, ... (def) Access unit = layer, Access pattern is set of access units behaviors.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-35
SLIDE 35

Introduction Methodology Analysis process Results Conclusions Background & Needs

At different dimensions

Access unit has certain inherent characteristics,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-36
SLIDE 36

Introduction Methodology Analysis process Results Conclusions Background & Needs

At different dimensions

Access unit has certain inherent characteristics, Lets call them feature of this access unit,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-37
SLIDE 37

Introduction Methodology Analysis process Results Conclusions Background & Needs

At different dimensions

Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-38
SLIDE 38

Introduction Methodology Analysis process Results Conclusions Background & Needs

At different dimensions

Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another, Each feature represents an independent mathematical dimension,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-39
SLIDE 39

Introduction Methodology Analysis process Results Conclusions Background & Needs

At different dimensions

Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another, Each feature represents an independent mathematical dimension, dimension describes access unit.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-40
SLIDE 40

Introduction Methodology Analysis process Results Conclusions Background & Needs

At different dimensions

Access unit has certain inherent characteristics, Lets call them feature of this access unit, For example, for an application, the read size in bytes is a feature; the number of unique files accessed is another, Each feature represents an independent mathematical dimension, dimension describes access unit. Dimension = feature = characteristic.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-41
SLIDE 41

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-42
SLIDE 42

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions:

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-43
SLIDE 43

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions:

Read sequentiality

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-44
SLIDE 44

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions:

Read sequentiality Write sequentiality

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-45
SLIDE 45

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions:

Read sequentiality Write sequentiality Repeated reads

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-46
SLIDE 46

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions:

Read sequentiality Write sequentiality Repeated reads Overwrites

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-47
SLIDE 47

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions:

Read sequentiality Write sequentiality Repeated reads Overwrites

Measuring one dimension at time is easier,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-48
SLIDE 48

Introduction Methodology Analysis process Results Conclusions Background & Needs

Example of multi-dimensional insight

Files with > 70% sequential read or sequential write have no repeated reads or overwrites. 4 dimensions:

Read sequentiality Write sequentiality Repeated reads Overwrites

Measuring one dimension at time is easier, In this same time captures other dimensions.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-49
SLIDE 49

Introduction Methodology Analysis process Results Conclusions Background & Needs

In short

The need for multi-layered and multi-dimensional insights motivates their methodology.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-50
SLIDE 50

Introduction Methodology Analysis process Results Conclusions Traces

Scale #1

Analysis were taken on Common Internet File System (CIFS),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-51
SLIDE 51

Introduction Methodology Analysis process Results Conclusions Traces

Scale #1

Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-52
SLIDE 52

Introduction Methodology Analysis process Results Conclusions Traces

Scale #1

Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB), Or (open source) SAMBA,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-53
SLIDE 53

Introduction Methodology Analysis process Results Conclusions Traces

Scale #1

Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB), Or (open source) SAMBA, M$ uses it in Windows, also known as “Microsoft Windows Network”,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-54
SLIDE 54

Introduction Methodology Analysis process Results Conclusions Traces

Scale #1

Analysis were taken on Common Internet File System (CIFS), Better known by Server Message Block (SMB), Or (open source) SAMBA, M$ uses it in Windows, also known as “Microsoft Windows Network”, CIFS allows to identify layers, doing trace.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-55
SLIDE 55

Introduction Methodology Analysis process Results Conclusions Traces

Scale #2

Data were collecting for 3 months, in 2007

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-56
SLIDE 56

Introduction Methodology Analysis process Results Conclusions Traces

Scale #2

Data were collecting for 3 months, in 2007 2 different datasets:

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-57
SLIDE 57

Introduction Methodology Analysis process Results Conclusions Traces

Scale #2

Data were collecting for 3 months, in 2007 2 different datasets:

corporate

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-58
SLIDE 58

Introduction Methodology Analysis process Results Conclusions Traces

Scale #2

Data were collecting for 3 months, in 2007 2 different datasets:

corporate engineering

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-59
SLIDE 59

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-60
SLIDE 60

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-61
SLIDE 61

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-62
SLIDE 62

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-63
SLIDE 63

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-64
SLIDE 64

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories

Engineering trace:

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-65
SLIDE 65

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories

Engineering trace:

500 employees in various engineering roles

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-66
SLIDE 66

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories

Engineering trace:

500 employees in various engineering roles 19TB active storage, Windows and Linux applications

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-67
SLIDE 67

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories

Engineering trace:

500 employees in various engineering roles 19TB active storage, Windows and Linux applications 232,033 user sessions, 741,319 application instances

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-68
SLIDE 68

Introduction Methodology Analysis process Results Conclusions Traces

Scale #3

Corporate trace:

1000 employees in marketing, finance, etc. 3TB active storage, Windows applications 509,076 user sessions, 138,723 application instances 1,155,099 files, 117,640 directories

Engineering trace:

500 employees in various engineering roles 19TB active storage, Windows and Linux applications 232,033 user sessions, 741,319 application instances 1,809,571 files, 161,858 directories

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-69
SLIDE 69

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

k-means clustering

Partition n observations into k clusters,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-70
SLIDE 70

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

k-means clustering

Partition n observations into k clusters, Each observations belong to the cluster with the nearest mean,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-71
SLIDE 71

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

k-means clustering

Partition n observations into k clusters, Each observations belong to the cluster with the nearest mean, It is classified as NP-hard problem,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-72
SLIDE 72

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

k-means clustering

Partition n observations into k clusters, Each observations belong to the cluster with the nearest mean, It is classified as NP-hard problem, But there are good heuristic, (found local optimum),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-73
SLIDE 73

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

k-means

What we want is

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-74
SLIDE 74

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

k-means

What we want is minS k

i=1

  • xj∈Sixj − ci2

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-75
SLIDE 75

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

k-means

What we want is minS k

i=1

  • xj∈Sixj − ci2

Where ci is the mean of points in cluster Si

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-76
SLIDE 76

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

Observations in 2 dimensions

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-77
SLIDE 77

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

Chose k = 2 “means” randomly

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-78
SLIDE 78

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

Divide space by nearest mean

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-79
SLIDE 79

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

Connect to the centers

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-80
SLIDE 80

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

New means

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-81
SLIDE 81

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

Once again

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-82
SLIDE 82

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

Once again

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-83
SLIDE 83

Introduction Methodology Analysis process Results Conclusions Algorithm, centroids

Repeat until not stabilized

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-84
SLIDE 84

Introduction Methodology Analysis process Results Conclusions Overview

Analysis step by step

Collect network storage system traces,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-85
SLIDE 85

Introduction Methodology Analysis process Results Conclusions Overview

Analysis step by step

Collect network storage system traces, Define access unit, (need domain knowledge about storage system),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-86
SLIDE 86

Introduction Methodology Analysis process Results Conclusions Overview

Analysis step by step

Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-87
SLIDE 87

Introduction Methodology Analysis process Results Conclusions Overview

Analysis step by step

Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values, Input those values into k-means,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-88
SLIDE 88

Introduction Methodology Analysis process Results Conclusions Overview

Analysis step by step

Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values, Input those values into k-means, Interpret the k-means output and derive access patterns,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-89
SLIDE 89

Introduction Methodology Analysis process Results Conclusions Overview

Analysis step by step

Collect network storage system traces, Define access unit, (need domain knowledge about storage system), Extract multiple instances of each access unit, with values, Input those values into k-means, Interpret the k-means output and derive access patterns, Translate access patterns to design insights.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-90
SLIDE 90

Introduction Methodology Analysis process Results Conclusions Overview

Flow looks like this.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-91
SLIDE 91

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How it works

For each access unit extract instances of from trace

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-92
SLIDE 92

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How it works

For each access unit extract instances of from trace i.e. session instances, application instances

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-93
SLIDE 93

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How it works

For each access unit extract instances of from trace i.e. session instances, application instances For all instances compute all numerical values (i.e. opened files),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-94
SLIDE 94

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How it works

For each access unit extract instances of from trace i.e. session instances, application instances For all instances compute all numerical values (i.e. opened files), This gives data array with row - instances,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-95
SLIDE 95

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How it works

For each access unit extract instances of from trace i.e. session instances, application instances For all instances compute all numerical values (i.e. opened files), This gives data array with row - instances, k-means algorithm produces clusters, which are described as a data access patterns

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-96
SLIDE 96

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How to determinate number of clusters

In heuristic K-means algorithm k is fixed,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-97
SLIDE 97

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How to determinate number of clusters

In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-98
SLIDE 98

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How to determinate number of clusters

In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns Optimization, searching for best k

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-99
SLIDE 99

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How to determinate number of clusters

In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns Optimization, searching for best k by computing more than one value,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-100
SLIDE 100

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How to determinate number of clusters

In heuristic K-means algorithm k is fixed, Remind - k will be correlated with number of our access patterns Optimization, searching for best k by computing more than one value, Using metrics describing clusters “quality”.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-101
SLIDE 101

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

How many clusters? Enough to explain variance.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-102
SLIDE 102

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

Generating the results

Add informations to clusters (access patterns, k-mean output),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-103
SLIDE 103

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

Generating the results

Add informations to clusters (access patterns, k-mean output), Aggregation by time, (session start/stop, weeks),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-104
SLIDE 104

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

Generating the results

Add informations to clusters (access patterns, k-mean output), Aggregation by time, (session start/stop, weeks), File names,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-105
SLIDE 105

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

Generating the results

Add informations to clusters (access patterns, k-mean output), Aggregation by time, (session start/stop, weeks), File names, subtrees access,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-106
SLIDE 106

Introduction Methodology Analysis process Results Conclusions Identify access patterns via k-means

Generating the results

Add informations to clusters (access patterns, k-mean output), Aggregation by time, (session start/stop, weeks), File names, subtrees access, Understandable (by humans) labels.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-107
SLIDE 107

Introduction Methodology Analysis process Results Conclusions Data output

Features for each session, client side

Duration Total metadata requests Overwrite ratio Directories accessed Total IO size

  • Avg. time between IO requests

Tree connects Application instances seen Read:write ratio by bytes Read sequentiality Unique trees accessed Total IO requests ...

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-108
SLIDE 108

Introduction Methodology Analysis process Results Conclusions Data output

Features for each session, server side

Total IO size Total metadata requests Read:write ratio by bytes

  • Avg. time between IO requests

Total IO requests by bytes Read sequentiality Read:write ratio by requests Write sequentiality Repeated read ratio Overwrite ratio Tree connects Unique trees accessed File opens Unique files opened Directories accessed

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-109
SLIDE 109

Introduction Methodology Analysis process Results Conclusions Data output

Session access patterns, clients, corporate

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-110
SLIDE 110

Introduction Methodology Analysis process Results Conclusions Data output

Session access patterns, clients, engineering

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-111
SLIDE 111

Introduction Methodology Analysis process Results Conclusions Data output

Session access patterns, server, corporate

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-112
SLIDE 112

Introduction Methodology Analysis process Results Conclusions Data output

Session access patterns, server, engineering

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-113
SLIDE 113

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 1 -sessions and IO ratio

The sessions with IO sizes greater than 128KB are either read-only or write-only, except for the full-day work sessions.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-114
SLIDE 114

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 1 -sessions and IO ratio

The sessions with IO sizes greater than 128KB are either read-only or write-only, except for the full-day work sessions. Implication - Clients can consolidate sessions efficiently based

  • nly on the read-write ratio.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-115
SLIDE 115

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 2 - full day work

The full-day work, content-viewing, and con- tent-generating sessions all do ≈ 10MB of IO.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-116
SLIDE 116

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 2 - full day work

The full-day work, content-viewing, and con- tent-generating sessions all do ≈ 10MB of IO. Implication - Clients caches can already fit an entire dayˆ as IO

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-117
SLIDE 117

Introduction Methodology Analysis process Results Conclusions Client side observations

Number of sessions that starts and ends at particular time

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-118
SLIDE 118

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 3

The number of human-generated sessions and supporting sessions peaks on Monday and decreases steadily to 80% of the peak on Friday

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-119
SLIDE 119

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 3

The number of human-generated sessions and supporting sessions peaks on Monday and decreases steadily to 80% of the peak on Friday Implication - Servers get an extra “day” for background tasks by running them at appropriate times during week-days.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-120
SLIDE 120

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 4

The small content viewing application and content update application instances have < 4KB total reads per file open and access a few unique files many times.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-121
SLIDE 121

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 4

The small content viewing application and content update application instances have < 4KB total reads per file open and access a few unique files many times. Implication - Clients should always cache the first few KB of IO per file per application.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-122
SLIDE 122

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 5

The small content viewing application and content update application instances have < 4KB total reads per file open and access a few unique files many times.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-123
SLIDE 123

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 5

The small content viewing application and content update application instances have < 4KB total reads per file open and access a few unique files many times. Implication - Clients should always cache the first few KB of IO per file per application.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-124
SLIDE 124

Introduction Methodology Analysis process Results Conclusions Client side observations

File extensions analysis corporate

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-125
SLIDE 125

Introduction Methodology Analysis process Results Conclusions Client side observations

File extensions analysis engineering

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-126
SLIDE 126

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 6

Engineering applications with > 50% sequential reads and > 50% sequential writes are doing code compile tasks.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-127
SLIDE 127

Introduction Methodology Analysis process Results Conclusions Client side observations

Observation 6

Engineering applications with > 50% sequential reads and > 50% sequential writes are doing code compile tasks. Implication - Servers can identify compile tasks by the presence of both sequential reads and writes; server has to cache the output of these tasks.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-128
SLIDE 128

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 7

For files with > 70% sequential reads or sequential writes, the repeated read and overwrite ratios are close to zero.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-129
SLIDE 129

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 7

For files with > 70% sequential reads or sequential writes, the repeated read and overwrite ratios are close to zero. Implication - Servers should delegate sequentially accessed files to clients to improve IO performance.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-130
SLIDE 130

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 8

In the engineering trace, only the edit code and compile

  • utput files have a high % of repeated reads.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-131
SLIDE 131

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 8

In the engineering trace, only the edit code and compile

  • utput files have a high % of repeated reads.

Implication - Servers should delegate repeatedly read files to clients.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-132
SLIDE 132

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 9

Almost all files are active (have opens, IO, and metadata access) for only 1-2 hours over the entire trace period, as indicated by the typical opens/read/write activity of all access patterns.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-133
SLIDE 133

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 9

Almost all files are active (have opens, IO, and metadata access) for only 1-2 hours over the entire trace period, as indicated by the typical opens/read/write activity of all access patterns. Implication - Servers can use file idle time to compress or deduplicate data to increase storage capacity.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-134
SLIDE 134

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 10

Deepest subtree access patterns help storage server designers develop per-directory policies.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-135
SLIDE 135

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 10

Deepest subtree access patterns help storage server designers develop per-directory policies. The client cacheable subtrees and temporary subtrees aggregate files with repeated reads or overwrites.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-136
SLIDE 136

Introduction Methodology Analysis process Results Conclusions Server side observations

Observation 10

Deepest subtree access patterns help storage server designers develop per-directory policies. The client cacheable subtrees and temporary subtrees aggregate files with repeated reads or overwrites. Implication - Servers can delegate repeated read and overwrite directories entirely to clients, tradeoffs permitting.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-137
SLIDE 137

Introduction Methodology Analysis process Results Conclusions Server side observations

Access patterns over time, file access by users

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-138
SLIDE 138

Introduction Methodology Analysis process Results Conclusions Server side observations

Access patterns over time, application instance access

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-139
SLIDE 139

Introduction Methodology Analysis process Results Conclusions Conclusions and future work

Trends

Increasing scale, heterogeneity, and consolidation,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-140
SLIDE 140

Introduction Methodology Analysis process Results Conclusions Conclusions and future work

Trends

Increasing scale, heterogeneity, and consolidation, Want highly targeted optimizations,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-141
SLIDE 141

Introduction Methodology Analysis process Results Conclusions Conclusions and future work

Trends

Increasing scale, heterogeneity, and consolidation, Want highly targeted optimizations, Number of insights that informs future designs.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-142
SLIDE 142

Introduction Methodology Analysis process Results Conclusions Conclusions and future work

Future work

Make it more dynamically,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-143
SLIDE 143

Introduction Methodology Analysis process Results Conclusions Conclusions and future work

Future work

Make it more dynamically, Add data content patterns (extensions isn’t so useful),

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-144
SLIDE 144

Introduction Methodology Analysis process Results Conclusions Conclusions and future work

Future work

Make it more dynamically, Add data content patterns (extensions isn’t so useful), On-line analysis,

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-145
SLIDE 145

Introduction Methodology Analysis process Results Conclusions Conclusions and future work

Future work

Make it more dynamically, Add data content patterns (extensions isn’t so useful), On-line analysis, Very fast, dynamic, feedback.

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications

slide-146
SLIDE 146

Introduction Methodology Analysis process Results Conclusions Thats all folks. Q&A

Questions?

Piotr Dobrowolski MIMUW/Distributed Systems Design Implications