Towards a Middleware for Configuring Large-scale Storage - - PowerPoint PPT Presentation

towards a middleware for configuring large scale storage
SMART_READER_LITE
LIVE PREVIEW

Towards a Middleware for Configuring Large-scale Storage - - PowerPoint PPT Presentation

Towards a Middleware for Configuring Large-scale Storage Infrastructures David M. Eyers Ramani Routray, Douglas Willcocks, University of Cambridge Rui Zhang Peter Pietzuch IBM Research Almaden Imperial College London MGC Workshop, 1 st


slide-1
SLIDE 1

MGC Workshop, 1st December 2009

Towards a Middleware for Configuring Large-scale Storage Infrastructures

Ramani Routray, Rui Zhang

IBM Research Almaden

Douglas Willcocks, Peter Pietzuch

Imperial College London

David M. Eyers

University of Cambridge

slide-2
SLIDE 2

Motivation

  • The Cloud is flourishing
  • Cloud providers need large-scale storage infrastructure
  • Managing and evolving this infrastructure is challenging

– Need better tools to analyse configuration best practices

  • Would like to use Machine Learning (ML) techniques

– Need lots of training data – Needs complex models – Failure-related data are scare per deployment

2

slide-3
SLIDE 3

Contributions

  • Introduce notion of SAN configuration middleware
  • Facilitate collecting data for machine learning

– Whole community works together

  • Facilitate validation of SAN configurations

– Do proposed configuration changes violate best practice rules? – Rules provided both by human domain experts and automated machine learning techniques.

  • High level abstraction to unify underlying heterogeneity

3

slide-4
SLIDE 4

Talk outline

  • Background

– The Cloud – SANs – Evolution of storage systems

  • SAN management middleware
  • The best practice repository of configurations
  • Machine Learning results
  • Conclusion

4

slide-5
SLIDE 5

Background

  • The Cloud supports customers’ elastic demands

– E.g. Amazon S3 and EC2

  • There is no standard form for Cloud providers

– Highly heterogeneous infrastructure – Multiple vendors add to the mix

  • Cloud providers have a choice of storage infrastructure

– i.e. the infrastructure they run in their data centres – SAN, NAS, DAS, OS-level approaches – Many use Storage Area Networks (SANs): our research focus

  • Managing the storage life-cycle is a monumental task

5

slide-6
SLIDE 6

Storage Area Networks (SANs)

6

  • SANs provide block-based storage as a service
  • Usually they operate over a dedicated network

– Often use Fibre Channel (FC) fabrics rather than Ethernet – Hosts usually have one or more Host Bus Adapters

  • SANs frequently provide redundant data paths

– Zones specify access control between sets of devices

  • Management products exist to abstract over SAN parts:

– IBM TotalStorage Productivity Center, Microsoft System Center, HP System Insight Manager, EMC Control Centre, etc.

slide-7
SLIDE 7

Storage Virtualization Appliance Server Virtual Machine Virtual Machine Virtual Machine Server Virtual Machine Virtual Machine Virtual Machine Configuration Connectivity Performance Events Analytics Systems Management Suite Disk Disk Disk Disk Disk Storage Controller Disk Disk Disk Disk Disk Storage Controller Application Application Application Virtualized IP Network Virtualized FC Network

SAN overview

7

slide-8
SLIDE 8

Cloud storage needs evolve

  • Cloud providers are often under many types of pressure

– e.g. time, quality of service, financial, ... – Infrastructure can require expansion at short notice

  • Typically initial deployment done with expert consultants

– However, evolving their infrastructure is done in-house

  • In-house technical staff do not have collected experience
  • f best practices

– Changes can introduce subtle inefficiencies and instabilities

  • Downtime is unacceptably expensive!

8

slide-9
SLIDE 9

Best practices are a valuable tool

9

  • SAN consultants accumulate lots of experience

– See many different customers’ configurations – Develop intuition as to safe designs

  • Best practice repositories are crucial

– IBM SAN Central team reduced the time to diagnose many problems by orders of magnitude (literally!) – Best practice repositories are expensive to build manually

  • Plethora of devices; myriad configurations:

– Techniques to collect problem reports must improve – All users of SANs can benefit from sharing best practices without giving away business secrets

slide-10
SLIDE 10

Example best practices

  • No single points of failure along any data path
  • Do not mix incompatible software

– e.g. Linux and Windows using same volume causes corruption

  • Do not mix incompatible hardware

– e.g. Cisco switches with firmware v2.0.5 cause timeouts with host bus adapter X.

  • Do not put tape drives and disk arrays in the same zone
  • Avoid bad designs: no redundancy, incorrect zoning, etc.

10

slide-11
SLIDE 11

Proposed environment

11

Azaleos Nirvanix Force.com Amazon S3 Amazon EC2 Best Practice Repository Cloud Consumers Cloud Providers (Compute, Storage, Application, Service) Consumers’ requests trigger the reconfiguration actions for the cloud provider (e.g. provisioning, migration, ...) Periodically upload configuration/stats OR Upload best practice violations with snapshots OR Upload problem tickets with snapshots Customers download best practices OR Customers validate configurations online

Internal Management Application

slide-12
SLIDE 12

Case study

  • Cloud provider facilitates online shopping service

– Client’s seasonal sales drive increases shoppers by 50% – Impact on data centre is across network fabric, HTTP servers, database servers, storage systems, etc.

  • Cloud provider has three types of human administrators:

– application admins, network admins and storage admins

  • Changing conditions cause notifications

– “notify me when file system capacity reaches 80%” – Agree schedule for changes between administrators

  • Our middleware hooks in to validate proposed changes

12

slide-13
SLIDE 13

Engineering our approach

13

  • Sites augment their SAN with configuration middleware
  • Middleware interacts with SAN best practice repository

– Repository is centralised for now: easier for clients to trust

  • Best practice repository uses Machine Learning to validate

proposed configuration changes

– Our previous research demonstrates the applicability of ML

  • Validation can be either:

– Reactive: SAN data collected asynchronously – Proactive: validation checks performed synchronously

slide-14
SLIDE 14

Client organisation Central repository Configuration log Management applications SAN SAN configuration middleware SAN configuration middleware Periodic data mining Fibre Channel network Systems with HBAs (servers) Storage subsystems Config database SAN best practice repository SAN config viewer SAN planner Configuration troubleshooter Machine learning Desensitized reconfiguration request Best practice updates Reconfiguration review response Reconfiguration request Reconfiguration action Reconfiguration review response Configuration polling

SAN configuration middleware

14

slide-15
SLIDE 15

Best practice representation

  • Declarative description of best practices

– Base descriptions on the CIM/SMI-S profiles – Ongoing work to expand coverage

  • Abstract configuration parameters

– Protect sensitive parameter values

  • Relate problem tickets with best practice rules

– Also with the regions of the SAN infrastructure that are affected

  • Simple to represent exclusions:

– Types of devices, versions of OSes

  • Paper has more detail regarding the proposed formalism

15

slide-16
SLIDE 16

Updating the repository

  • Managing a large best practice repository is challenging
  • Introduce Machine Learning (ML)

– Propose best practice rules – Help clustering and ranking of source data

  • Having a flow of configuration snapshots will increase

knowledge about the precision of the repository

– Best practice rules can be updated progressively – Remove rules when snapshots declared to be out-of-date

  • Repository updates can be pushed to Cloud providers

16

slide-17
SLIDE 17

Machine Learning Techniques

  • Our previous research work has validated Decision Tree

learning for certain types of best practice determination

  • Now moving to ILP and related approaches

– Aim to better encode SAN structural interrelationships

  • Experts at Imperial College:

– HR: constructive hypotheses formed – Progolem: a more typical ILP system – Douglas has been doing the heavy lifting!

  • Early results are promising on small-scale examples

17

slide-18
SLIDE 18

Best practice ML building blocks

  • Cartesian:

– Avoid (or ensure) that attribute A has a specific value V

  • Connectivity:

– Avoid (or ensure) that any instance of an entity E1 is associated with at least (or at most) N instances of an entity E2

  • Exclusion:

– Ensure uniqueness of instantiations from attribute value set

  • Many-to-one:

– For attribute A value-set VS, ensure that each value V1 of VS is equal to any other value V2 of VS

  • One-to-one:

– For a given set of values VS for an attribute A, ensure that each value V1

  • f VS is different from any other value V2 of VS

18

slide-19
SLIDE 19

Tested HR on synthetic data

19

  • Tested HR’s ability to find rules using synthetic data

– Doctored the configurations to include specific violations – Used small datasets with less than 20 entities

  • Example test case:

– If a zone contains at least 2 hosts with distinctly different operating systems, then the zone can be considered to be containing "heterogenous operating systems". – If a zone containing "heterogenous operating systems" also contains a host that uses a HBA with subsystem model "subsystem_model_v100", then some form of erroneous behaviour

  • ccurs, and the SAN configuration can be considered to be

"misconfigured".

slide-20
SLIDE 20

Representing the hypothesis

20

bad_san_configuration(S) :- san_configuration(S),

  • fabric(F),
  • san_configuration_has_fabric(S,F),
  • zone(Z),
  • fabric_has_zone(F,Z),
  • host(H1),
  • host(H2),
  • host(H3),
  • zone_has_host(Z,H1),
  • zone_has_host(Z,H2),
  • zone_has_host(Z,H3),
  • host_has_operating_system(H1,O1),
  • host_has_operating_system(H2,O2),
  • O1 \= O2,
  • hba(A),
  • host_has_hba(H3,A),
  • hba_has_subsystem_model(A,subsystem_model_v100).
slide-21
SLIDE 21

HR can generate correct results

  • HR found the correct result

– In the process emitting ~400 hypotheses

  • Simon Colton (HR author) tweaked test case to complete

in about 3s

  • However configuration dataset was artificially small

– Probably 100x smaller than real world configurations

  • Scaling up datasets has run into complexity problems

– Investigating whether implementation or algorithmic complexity

21

slide-22
SLIDE 22

Future work

  • Scale up ML to larger data sets

– Further performance comparisons between ML techniques

  • Determine other data centre management issues that can

employ a similar approach

– Power optimisation – Service Level Agreement compliance tracking

  • Is it feasible to build a cross-site, cross-vendor

infrastructure?

– What do you think? – How does it blend in with the cloud model?

22

slide-23
SLIDE 23

Conclusion

  • Cloud applications are supported by storage infrastructure

in data centres

  • The complexity of large-scale storage systems makes them

difficult to manage

  • Middleware aims to securely share configuration

information between clients for the greater good

– Protect client’s sensitive configuration information – Improve efficiency (e.g. performance, reliability, energy use) – Catch potentially damaging problems early

23