Introduction to Autonomic Computing Johan Tordsson Department of - - PowerPoint PPT Presentation

introduction to autonomic computing
SMART_READER_LITE
LIVE PREVIEW

Introduction to Autonomic Computing Johan Tordsson Department of - - PowerPoint PPT Presentation

Introduction to Autonomic Computing Johan Tordsson Department of Computing Science www.cloudresearch.org About me MSc (Civ.Ing) Computer Science (2004) PhD Ume, Grid computing (2009) Postdoc in Madrid Spain (2009), OpenNebula


slide-1
SLIDE 1

Introduction to Autonomic Computing

Johan Tordsson Department of Computing Science

www.cloudresearch.org

slide-2
SLIDE 2

About me

  • MSc (Civ.Ing) Computer Science (2004)
  • PhD Umeå, Grid computing (2009)
  • Postdoc in Madrid Spain (2009), OpenNebula
  • Architect etc. in misc. EC projects (2009-2013)
  • Associate professor (2014 - now)
  • Research

– Autonomic cloud and data center management – How to make clouds run themselves faster/better/cheaper?

  • Spare time job:

– CTO & co-founder for Elastisys (UMU cloud research spinoff) – Evangelizing that computers (will) beat humans at IT operations

slide-3
SLIDE 3

Outline

  • Why

– do we need autonomic computing?

  • What

– are autonomic systems?

  • How

– to build these autonomic systems?

  • When

– will they happen?

  • Who

– will build them?

slide-4
SLIDE 4

Motivation: software complexity

slide-5
SLIDE 5

Motivation: scale

  • Enorma byggnader med servrar,

lagringsutrustning, nätverk, kylning

  • En fabrik för IT-tjänster

5

slide-6
SLIDE 6

Motivation: faults

Question: what is the probability of a hard drive failure? In my laptop? Will happen every few years,

hopefully not right now…

In a large supercomputer or data center?

More than 100k nodes Will happen during this talk!

slide-7
SLIDE 7

Motivation: costs

  • Question: How many servers can be handled by a

system administrator?

  • Very old question…
  • Some numbers:

– 10 - very complex systems – ~300 - standard large-scale organization – Several 1000s – virtualized data center – 26k (Facebook 2013)

  • Highel-level management and better abstractions

are needed

– Alternative: exponential increase in need for systems management

slide-8
SLIDE 8

Autonomic option

  • Autonomic computing

– Named after autonomic nervous system – Systems manage themselves according to admin goals – Self-governing operation of entire system, not just parts of it – New components integrate effortlessly - as a new cell establishes itself in the body

slide-9
SLIDE 9

Autonomic Computing

  • IBM initiative in early 2000’s
  • Landmark paper published 2003

in IEEE Computer by Kephart and Chess @ IBM

  • Active research field since,

during 2003-2013:

– 200 conferences/workshops – 8000+ papers

  • Lots of funding

– EC FP6, FP7, H2020 – WASP…

  • Industry uptake

– Many big IT vendors & startups

  • Key point

– Self-management of IT systems

slide-10
SLIDE 10

Self-management (1/3)

  • Self-management

– Changing components – External conditions – Hardware/software failures

  • Ex. component upgrade

– Continually check for component upgrades – Download and install – Reconfigure itself – Run a regression test – When it detects errors, revert to the older version

slide-11
SLIDE 11

Self-management (2/3)

  • Four aspects of self-management

– Self-configuration

  • Configure themselves automatically
  • High-level policies (what is desired, not how)

– Self-optimization

  • Continually seek ways to improve their operation
  • Hundreds of tunable parameters

– Self-healing

  • Handle faults and errors
  • Analyze information from logs and monitors

– Self-protection

  • Malicious attacks
  • Cascading failures
  • Admin mistakes
slide-12
SLIDE 12

Self-management (3/3)

  • Autonomic computing achievable

without self-awareness? – Without hard artificial intelligence

  • (Hollywood) Misconception:

machines will take over all human tasks – AI could be a “real danger” (S. Hawking) – Unemployment? –

  • Actual idea:

Machines will free people to manage systems at higher level

Hal 9000, 2001

Terminator

g!

slide-13
SLIDE 13

Autonomic elements

  • Fundamental atom of

the architecture

– Managed element(s)

  • Server, database,

storage system, etc.

– Autonomic manager

  • Responsible for:

– Providing its service – Managing behavior according to goals Interacting with other autonomic elements

Autonomic manager Knowledge Managed element Analyze Plan Monitor Execute

slide-14
SLIDE 14

Autonomic element details

  • Sensors: monitor environment
  • Effectors: tune managed element
  • MAPE loop:

– Process for self-management of autonomic element Monitor Analyze Sensors Execute Plan Effectors Knowledge

Autonomic Manager Managed Element

Sensors Effectors

slide-15
SLIDE 15

The MAPE loop

  • 1. Monitor:

– Collect information about state of system – Lot of metrics around – Which ones to gather? – How often to monitor?

  • 4. Execute

– Turn the “knobs” of the managed element – Interactions between knobs?

  • Unknown, even to human operators
  • At Google, 238 knobs in each managed entity
slide-16
SLIDE 16

The MAPE loop (cont.)

  • 2. Analyze

– Estimate current state based on monitoring data – Commonly use model of the world for this

  • “All models are wrong, but some are useful”
  • What part of system to model? How?
  • Correlations?
  • 3. Plan

– Select action(s), i.e., which knobs to turn? – Can be formulated as optimization problem – Reactive vs. Predictive/Proactive methods

  • Knowledge management

– Update model dynamically (monitoring) – Evaluate effects of actions (execution)

slide-17
SLIDE 17

Engineering challenges (1/3)

  • Life cycle of an autonomic element

– Design, test, and verification

  • Testing autonomic elements a challenge

– Installation and configuration

  • Element registers itself in a directory service

– Monitoring and problem determination

  • Elements will continually monitor themselves
  • Adaptation, optimization, reconfiguration

– Upgrading – Uninstallation or replacement

slide-18
SLIDE 18

Engineering challenges (2/3)

  • Relationships among autonomic elements

– Specification

  • Set of output/input services of autonomic elements
  • Expressed in a standard format
  • Description syntax and semantics

– Location

  • Find input services that autonomic element needs

– Negotiation – Provision – Operation

  • Autonomic manager oversees the operation

– Termination

slide-19
SLIDE 19

Engineering challenges (3/3)

  • System-wide issues

– Authentication, encryption, signing – Autonomic elements can identify themselves – Autonomic system must be robust against insidious forms of attack

  • Goal specification

– Humans provide the goals and constraints – Ensure that goals are specified correctly in the first place – Autonomic systems need to protect themselves from bad input goals:

  • Inconsistent, implausible, dangerous, or unrealizable
slide-20
SLIDE 20

Specifying goals (1/3)

  • Rules

– Often simple condition-action pairs

  • If something happens, do this
  • If something else happens, do that

– Can use more complex languages to express states, context, etc. – Explicit enumeration tedious – Very limited ability to express complex actions

slide-21
SLIDE 21

Specifying goals (2/3)

  • Utility functions

– Mathematical expressions – Maps system state to scalar value – Represents high-level objectives – What parts of system state to include? – What should function look like?

slide-22
SLIDE 22

Specifying goals (3/3)

  • Policies

– (higher-level) descriptions of goals and constraints for operation – How to map to lower-level behavior? – Composition of multiple policies – What high-level language to use?

  • Turing-complete?
  • No widely used languages available today
  • Human operators used to explicit steering

– Not used to indirect goal specification

slide-23
SLIDE 23

Autonomic management techniques - requirements

  • Robustness

– Avoid oscillations or behavioral changes

  • Scalability

– Internet-scale: millions of servers and networks, even more autonomic agents (50 billion devices?)

  • Adaptive to changing workloads

– Some methods reliable for certain load patterns, but unstable once the load or system dynamics change

  • Performance

– Need to make decisions fast enough to react timely – Optimal solutions vs. approximations

  • Simplicity

– Key to adoption – Complex models vs. model-free? – Learning phase required before deployment?

slide-24
SLIDE 24

Autonomic management - sample techniques

  • Heuristic frameworks

– Fast and simple, rules of thumb

  • Control theory

– Used to steer, e.g., industrial plants, embedded systems, etc. – Discretization for data packet flows (queuing theory)

  • Machine learning

– Evolve behavior based on empirical (monitor) data – Examples: Neural networks, genetic algorithms, reinforcement learning

slide-25
SLIDE 25

Heuristics

  • Rules of thumb

– Often lack theoretic background

  • Often used to handle very complex (NP-hard)

problems

– Scalable, find fast solutions

  • Greedy:
  • Local decisions that make sense right here/now
  • May not result in optimal solution

– Hill climbing

  • Steer search (manage system in this case) towards steepest

slope

– Often no upper bound

  • Not possible to know distance from optimal solution

– ”The O-word…”

slide-26
SLIDE 26

Control theory

  • Mathematical models to monitor and steer

dynamic systems

– Real-time allocation of CPU, memory, etc.

  • Some simple examples:

– Proportional control

  • Adjust signal proportionally to compensate error

– PID (Proportional Integral Derivative) control:

  • Integral: adjustment w.r.t. error over time
  • Derivative: adjustment w.r.t. error trend
slide-27
SLIDE 27

Neural networks

  • Mimics the brain’s neuron systems
  • Input/hidden/output layers of neurons:

– Neurons in hidden layer: activation functions maps input signal to output signal – Action functions tuned upon error in output layer (errors are propagated back for tuning)

  • Often used to capture multi-dimensional problems that are

hard to model with other techniques

  • Hard to train (need representative training data)
  • Hard to understand cause/effect (hidden layers)

& Deep learning

slide-28
SLIDE 28

Genetic algorithms

  • Inspired by natural evolution
  • Ingredients:

– Population with genetic representations (behaviors) for candidate solutions (can be hard) – Inheritance, crossover, and mutation operators – A rating function to compare solutions and select

  • Termination?

– Only compares to prev. generation – Optimal solution?

  • Adaptable to

dynamic environments?

slide-29
SLIDE 29

Reinforcement learning

  • Previous methods use a model (internal

representation) of the world

  • Reinforcement learning (can be) model free
  • System learns dynamically to

– select the best action for a given state – based on reward (reinforcement) function

  • How to:

– Assign value to actions? – Balance exploration (learning) vs. exploitation (benefit from good, known actions)

  • What if environment is too dynamic?

– Most states have not been seen before?

slide-30
SLIDE 30

Autonomic element(s)

  • Autonomic element seems doable
  • Autonomic elements?
  • Multi-agent systems as inspiration

– Behaviors and goals of the systems – Pattern and type of interactions among agents

  • How to decentralized achieve

high-level goals? – Understand, control, and exploit emergent behavior – Convergence?

slide-31
SLIDE 31

Autonomic elements interaction

  • Relationships

– Dynamic, short-lived – Formed by agreement?

  • May be negotiated

– Full spectrum

  • Peer-to-peer
  • Hierarchical

– Subject to policies

  • Compare single-element

policies

slide-32
SLIDE 32

Interacting control/optimization loops

Transaction Requests

Increase demand

Server 1

DB Service

Server 2 File System Storage Service 2

Storage Service 1 Increase service Feedback control &

  • ptimization of single

autonomic elements

  • Done for 1-2

variables What happens when feedback loops interact?

slide-33
SLIDE 33

Interacting control/optimization loops

Transaction Requests

Increase demand

Server 1 DB Service Storage Service 1 Capacity limit reached: Get more storage

X

Server 2 File System Storage Service 2

slide-34
SLIDE 34

Interacting control/optimization loops

Demand not being met: Find alternate supplier Getting more storage

X

Transaction Requests

Server 1 DB Service Storage Service 1

Server 2 File System Storage Service 2

slide-35
SLIDE 35

Interacting control/optimization loops

Transaction Requests

Server 1 DB Service Storage Service 1

Server 2 File System Storage Service 2

Sorry; already found an alternative Ready to give you that extra service

X

slide-36
SLIDE 36

Transaction Requests Server 1 DB 1 Server 2 File System 1 Storage 2 Storage 1

Negotiation and resource allocation

Request( QueryService, Queries = 800/sec, Type = 2, RT = 5 sec) Request( QueryService, Queries = 400/sec, Type = 5, RT = 3 sec) Request( TableSpace, Size = 3 GBytes, Reads = 2000/sec, Writes = 100/sec) Request( LogicalVolume, Size = 12 Gbytes, Reads = 500/sec, Writes = 500/sec) Counterpropose( TableSpace, Size = 3 GBytes, Reads = 1600/sec, Writes = 100/sec) Counterpropose( QueryService, Queries = 320/sec, Type = 5, RT = 4 sec)

Should all requests be met? Compute costs and benefits, propagate them down Forms of negotiation:

  • Bilateral
  • Multilateral
  • Auction
  • Supply chain
  • Competitive/coop

Learning

  • During negotiation
  • Strategy evolution
  • Collective behavior?
slide-37
SLIDE 37

Autonomic Computing Adaptation?

  • Fully autonomic computing

– Evolve as increasingly sophisticated autonomic managers are to existing managed elements

  • Autonomic elements will function at many levels

– At the lower levels

  • Limited range of internal behaviors
  • Hard-coded behaviors

– At the higher levels

  • Increased dynamism and flexibility
  • Goal-oriented behaviors
  • Hard-wired relationships will evolve into flexible

relationships that are established via negotiation

slide-38
SLIDE 38

Adaptation (cont.)

  • 1. Collect and aggregate information

– Support decisions by human administrators

  • 2. Advisors suggesting possible actions by

humans

  • 3. Autonomic systems entrusted with lower-

level decisions

  • 4. Over time, less frequent and more high-

level decisions by operator

– Carried out by numerous autonomic actions at lower level

slide-39
SLIDE 39

Autonomic computing – a developer perspective

  • Delegation of human operator responsibility

– Trust

  • A breakdown of the MAPE loop breakdown:

– Monitoring: Delayed? Missing? Incorrect? – Analyze & Plan: model is wrong! – Execute:

  • What if actuators (knobs) do not act as expected?
  • The underlying system is likely (autonomically) trying

to counteract actuators

  • And your autonomic system is being steered by a

higher-level one

slide-40
SLIDE 40

Developer perspective (cont.)

  • In autonomics, so much more can go wrong

– All computer systems fail – Autonomic systems actively steer other systems, i.e., can actively make other systems fail

  • “Intelligent” actions harmful
  • Cascading failures
  • #1 feature: turn if off
  • #2 feature: add a “I don’t understand” mode
  • “What can go wrong?”

– If your automated system cannot handle odd inputs/ configs/etc, you should not build it…

slide-41
SLIDE 41

Autonomic Computing Research Trends

slide-42
SLIDE 42

(Selected) Research trends

  • Cyber-physical systems

– Datacenters: building + hardware + software

  • Interacting autonomic systems

– Hierarchical & distributed – Understanding and controlling these

  • Multi-criteria

– Multiple goals (cost, energy, performance, …) – Multiple stakeholder

  • Datacenter owners, Application owners, end-users
slide-43
SLIDE 43

Even more research trends

  • Data-driven, predictive, & proactive

– Feedback control not enough

  • Self-aware systems

– Self-reflective, self-predictive, self-adaptive – Context, correlations, and online models

  • Need for benchmarks

– Not only performance, but other self-* aspects

43

slide-44
SLIDE 44

Summary

  • Autonomic computing needed for management
  • f complex systems such as clouds
  • Systems manage (config, repair, optimize,

protect) themselves according to admin goals

– Achievable w/o solving hard AI problem

  • Many different techniques for autonomic

management

  • Goal-specification can be hard
  • Interacting autonomic elements complicate
  • Great care needed to build autonomic systems
  • Many unsolved research questions
slide-45
SLIDE 45

Thanks! Questions?

slide-46
SLIDE 46

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

Capacity planning is hard!