Asynchronous intrusion recovery for interconnected web services - - PowerPoint PPT Presentation

asynchronous intrusion recovery for interconnected web
SMART_READER_LITE
LIVE PREVIEW

Asynchronous intrusion recovery for interconnected web services - - PowerPoint PPT Presentation

Asynchronous intrusion recovery for interconnected web services Ramesh Chandra, Taesoo Kim , Nickolai Zeldovich MIT CSAIL Today's web services are highly interconnected Many web services provide APIs to other sites Many websites


slide-1
SLIDE 1

Asynchronous intrusion recovery for interconnected web services

Ramesh Chandra, Taesoo Kim, Nickolai Zeldovich

MIT CSAIL

slide-2
SLIDE 2

Today's web services are highly interconnected

  • Many web services provide APIs to other sites
  • Many websites integrate those APIs:

— Authentication: Facebook Connect, Google+ ... — Data sharing: Dropbox ... — Business process management: Salesforce … — ...

slide-3
SLIDE 3

Example: online shopping mall

Customer Relationship Management (CRM)

...

slide-4
SLIDE 4

Example: online shopping mall

...

CRM Financial Force (Accounting Service) Adobe Echo Sign (E-Signature Service) Bill.ON (Invoices and Billing Service)

slide-5
SLIDE 5

Example: online shopping mall

Allow Facebook users to buy our products without registration

...

CRM Facebook Twitter Financial Force (Accounting Service) Adobe Echo Sign (E-Signature Service) Bill.ON (Invoices and Billing Service)

slide-6
SLIDE 6

Example: online shopping mall

Allow Facebook users to buy our products without registration

...

CRM Facebook Twitter Financial Force (Accounting Service) Adobe Echo Sign (E-Signature Service) Bill.ON (Invoices and Billing Service) Address in Facebook

slide-7
SLIDE 7

Attack in one service can spread between services

Financial Force (Accounting Service) Adobe Echo Sign (E-Signature Service) Bill.ON (Invoices and Billing Service) Facebook Twitter

...

CRM Address modifjed by Attacker Ship purchased products to ...

slide-8
SLIDE 8

Bugs in web services are commonplace

  • Facebook (Mar 29th 2013):

— Attackers can intercept full permission access tokens

slide-9
SLIDE 9

Bugs in web services are commonplace

  • Facebook (Mar 29th 2013):

— Attackers can intercept full permission access tokens

  • Many web services have similar bugs

Twitter (Aug 20th 2013)

Instagram (May 2nd 2013)

Microsoft Yammer (Aug 4th 2013)

slide-10
SLIDE 10

Goal

  • Recovering integrity in interconnected services

— Repair the state of afgected services as if the attack

never occurred

  • State-of-the-art: manual recovery

— Admin doesn't trust other sites for recovery — Require manual interaction (e.g., email other admin)

slide-11
SLIDE 11

General plan for automatic recovery

  • Use rollback-and-replay for recovering integrity

in single machine

— Prior works: Retro [OSDI '10], Warp [SOSP '11]

  • Extend rollback-and-replay to many web services!
slide-12
SLIDE 12

Challenges

  • Rollback-and-replay requires global coordinator

— Each service cannot decide what to do for repair

  • All services must be available during recovery

— We want to repair some services even if others are down — Consistency problem: some services are not repaired yet

slide-13
SLIDE 13

Contributions

  • 1. Repair protocol between services
  • No central coordinator
  • Each service controls its repair
  • 2. Asynchronous repair
  • Proceed repair even with unavailable services
  • Consistency in partially repair state

Enable automatic intrusion recovery in distributed web services

slide-14
SLIDE 14

Running example of an attack

Financial Force (Accounting Service) Adobe Echo Sign (E-Signature Service) Bill.ON (Invoices and Billing Service) Facebook Twitter

...

CRM Address modifjed by Attacker Ship purchased products to ...

slide-15
SLIDE 15

Running example of an attack

Bill.ON (Invoices and Billing Service) Facebook

...

CRM

slide-16
SLIDE 16

Running example of an attack

Bill.ON (Invoices and Billing Service) Facebook

...

CRM

Victim Attacker

http://bit.ly/1xoTn

slide-17
SLIDE 17

Running example of an attack

Bill.ON (Invoices and Billing Service) Facebook

...

CRM

Victim Attacker

http://bit.ly/1xoTn

slide-18
SLIDE 18

Running example of an attack

Bill.ON (Invoices and Billing Service) Facebook

...

CRM

Victim Attacker

http://bit.ly/1xoTn

slide-19
SLIDE 19

Running example of an attack

Bill.ON (Invoices and Billing Service) Facebook

...

CRM

Victim Attacker

Modify address

http://bit.ly/1xoTn

slide-20
SLIDE 20

Running example of an attack

Bill.ON (Invoices and Billing Service) Facebook

...

CRM

Victim Attacker

Address modifjed by Attacker Modify address

http://bit.ly/1xoTn

slide-21
SLIDE 21

Timeline of the attack

Attacker Victim Facebook Shopping Mall Bill.ON

slide-22
SLIDE 22

Timeline of the attack

Attacker Victim Facebook Shopping Mall Bill.ON

Time

slide-23
SLIDE 23

Timeline of the attack

Attacker Victim Facebook Shopping Mall Bill.ON

Time

slide-24
SLIDE 24

Timeline of the attack

Attacker Victim Facebook Shopping Mall Bill.ON

Time

slide-25
SLIDE 25

Goal: attack did not take place

Attacker Facebook Shopping Mall Bill.ON

Time

Victim

slide-26
SLIDE 26

Goal: attack did not take place

Attacker Facebook Shopping Mall Bill.ON

Time

Victim

slide-27
SLIDE 27

Overview of system execution

  • Normal execution:

— Record enough information for rollback-and-replay

  • Repair:

— Identify an attack to initiate repair — Repair local state: rollback and replay recorded requests — Propagate repair whenever local repair afgects others

slide-28
SLIDE 28

Overview of system execution

  • Normal execution:

— Record enough information for rollback-and-replay

  • Repair:

— Identify an attack to initiate repair — Repair local state: rollback and replay recorded requests — Propagate repair whenever local repair afgects others

slide-29
SLIDE 29

Strawman: repair with global coordinator using rollback-and-replay

Attacker Victim Facebook Shopping Mall Bill.ON

Time

Identify an attack for repair

slide-30
SLIDE 30

Strawman: repair with global coordinator using rollback-and-replay

Attacker Victim Facebook Shopping Mall Bill.ON

Time

Rollback state before the attack occurred

slide-31
SLIDE 31

Strawman: repair with global coordinator using rollback-and-replay

Attacker Victim Facebook Shopping Mall Bill.ON

Time

Error Rollback state before the attack occurred

slide-32
SLIDE 32

Strawman: repair with global coordinator using rollback-and-replay

Attacker Victim Facebook Shopping Mall Bill.ON

Time

Error Error Rollback state before the attack occurred

slide-33
SLIDE 33

Strawman: repair with global coordinator using rollback-and-replay

Attacker Victim Facebook Shopping Mall Bill.ON

Time

Error Error Rollback state before the attack occurred Original address

slide-34
SLIDE 34

Strawman: repair with global coordinator using rollback-and-replay

Attacker Victim Facebook Shopping Mall Bill.ON

Time

Error Error Rollback state before the attack occurred Original address

slide-35
SLIDE 35

Strawman: repair with global coordinator using rollback-and-replay

Attacker Victim Facebook Shopping Mall Bill.ON

Time Error Error

Remove access token Restore victim's address

slide-36
SLIDE 36

Problems in Strawman design

  • P1. All services must be available

→ Support asynchronous repair with speculation

  • P2. Require global coordinator

→ Defjne repair APIs between services

slide-37
SLIDE 37

Problems in Strawman design

  • P1. All services must be available

→ Support asynchronous repair with speculation

  • P2. Require global coordinator

→ Defjne repair APIs between services

slide-38
SLIDE 38

Challenge: cooperating with unavailable web services

Attacker Victim Facebook Shopping Mall Bill.ON

Time Error Error Error Error

Offmine Unavailable

Wait for other services to come up?

slide-39
SLIDE 39

Solution: asynchronous repair

  • Asynchronously deliver repair requests
  • Speculatively proceed local repair with past responses

(or timeout responses)

  • Expose repaired state after local repair
  • Intuition: why asynchronous repair works?

— Many web services are designed for independent

  • peration, prepared for handling others failures
slide-40
SLIDE 40

Example: asynchronous repair

Attacker Victim Facebook Shopping Mall Bill.ON

Time Error Error Error Error

Repair queues

slide-41
SLIDE 41

Example: asynchronous repair

Attacker Victim Facebook Shopping Mall Bill.ON

Time Error Error Error Error

Repair queues

Asynchronously deliver new response Speculatively proceed with past request

slide-42
SLIDE 42

Example: asynchronous repair

Attacker Victim Facebook Shopping Mall Bill.ON

Time Error Error Error Error

Repair queues

Asynchronously deliver new response Speculatively proceed with past request

slide-43
SLIDE 43

Example: asynchronous repair

Attacker Victim Facebook Shopping Mall Bill.ON

Time Error Error Error Error

Repair queues

Asynchronously deliver new response Speculatively proceed with past request

slide-44
SLIDE 44

Example: exposing state after local repair

Attacker Victim Facebook Shopping Mall Bill.ON

Time

...

Another web service

Two services are still repairing

slide-45
SLIDE 45

What if speculation fails?

  • If service responds difgerently,

— Restart local repair with the new response — In fact, it is not difgerent from initiating new repair

  • Asynchronous repair will converge to the correctly

repaired state at the end

slide-46
SLIDE 46

Example: speculation failure

Facebook Shopping Mall

  • k

Mall Ready for shipping to:

Message:

slide-47
SLIDE 47

Example: speculation failure

Facebook Shopping Mall

  • k

Mall Ready for shipping to:

Message:

Following request depends

  • n previous request
slide-48
SLIDE 48

Example: speculation failure

Facebook Shopping Mall

  • k

Mall Ready for shipping to:

Message:

slide-49
SLIDE 49

Example: speculation failure

Facebook Shopping Mall

  • k

Mall Ready for shipping to:

Message:

Mall Ready for shipping to:

Message:

Respond with difgerent result

slide-50
SLIDE 50

Example: speculation failure

Facebook Shopping Mall

  • k

Mall Ready for shipping to:

Message:

slide-51
SLIDE 51

Example: speculation failure

Facebook Shopping Mall

  • k

Mall Ready for shipping to:

Message:

slide-52
SLIDE 52

Example: speculation failure

Facebook Shopping Mall

  • k

Mall Ready for shipping to:

Message:

Asynchronous repair makes forward progress in time-line graph, so it will converge to the correctly repaired state at the end

slide-53
SLIDE 53

Consistency problem: partially repaired state

Attacker Victim Facebook Shopping Mall Bill.ON

Time

...

Another web service

Two services are still repairing Repaired state

?

slide-54
SLIDE 54

Consistency problem: partially repaired state

  • Exposing partially repaired state might diverge global state

— But it is not something new that our recovery

mechanism introduces more

— Most of web services already cope with this problem

slide-55
SLIDE 55

Exposing partially repaired state might violate service invariants

  • Service invariants: guarantees by service provider

(e.g., locking serivce: when lock is held, no concurrent access)

  • In theory: yes (for arbitrary tightly coupled systems)
  • In practice: no

— RESTful APIs usually provide consistency per API — Web services are in nature loosely coupled

slide-56
SLIDE 56

Consistency: partial repair state

  • Services and clients already deal with concurrency
  • Repair of a service is modeled as:

— Being performed by a concurrent repair actor client — That uses the service's regular API calls

  • So, partially repaired state can be considered as state

resulting from yet another concurrent operations

slide-57
SLIDE 57

Problems in Strawman design

  • P1. All services must be available

→ Support asynchronous repair with speculation

  • P2. Require global coordinator

→ Defjne repair APIs between services

slide-58
SLIDE 58

How to propagate repair requests without global coordinator?

Attacker Victim Facebook Shopping Mall Bill.ON

Time Error Error

How to ask services to initiate repair?

slide-59
SLIDE 59

Requesting repair with APIs (RPC)

replace_response(#7, )

From: Facebook To: Shopping Mall

Facebook

From: Facebook To: Shopping Mall

Tagged: #7 API: modify the previous response New response (repaired) Tag

Shopping Mall

  • Tag each request in normal exec.
  • Each service runs repair controller
slide-60
SLIDE 60

Repair APIs (RPC)

  • No centralized coordinator, each server invokes

following repair APIs to recover from the attack

— replace_response(tag, resp): replace past response — replace_request(tag, req): replace past request — delete(tag): delete past request — create(req, before, after): execute new requests in the past

slide-61
SLIDE 61

Repair APIs (RPC)

  • No centralized coordinator, each server invokes

following repair APIs to recover from the attack

— replace_response(tag, resp): replace past response — replace_request(tag, req): replace past request — delete(tag): delete past request — create(req, before, after): execute new requests in the past

If service supports those 4 APIs, it can participate in decentralized recovery

slide-62
SLIDE 62

Authentication of repair APIs

  • Too application specifjc

— (e.g. Email service: sender can delete recipient's emails?)

  • Delegate authentication to original web services

— Implement application specifjc policy

(e.g. ask admin for confjrmation of repair)

— Assign a credential to repair requests

slide-63
SLIDE 63

Summary of design

  • 1. Asynchronous repair
  • Proceed repair with offmine or unavailable services
  • Consistency in partially repair state
  • 2. Repair APIs between services
  • No central coordinator
  • Each service controls its repair
  • Delegate authentication
slide-64
SLIDE 64

Implementation

  • Prototype implementation: Aire

— Extend Django web framework

— Support existing Django app. with few modifjcations

  • Support Askbot, Django-OAuth, and Dpaste
  • e.g., Askbot's authentication policy: 55 LoC

— Total: 5700 lines of Python code

slide-65
SLIDE 65

Evaluation questions

  • Can Aire support real web services?
  • Can Aire recover from distributed attacks?
  • What are the runtime overheads of Aire?
slide-66
SLIDE 66

Aire supports real web services

OAuth Provider Dpaste Askbot

slide-67
SLIDE 67

Aire supports real web services

OAuth Provider Dpaste Askbot

slide-68
SLIDE 68

Aire supports real web services

OAuth Provider Dpaste Askbot

slide-69
SLIDE 69

Aire supports real web services

OAuth Provider Dpaste Askbot

...

slide-70
SLIDE 70

Aire supports real web services

Share link: http://dpaste.com/4324

OAuth Provider Dpaste Askbot Append a link to Dpaste

...

Post code

slide-71
SLIDE 71

Aire supports real web services

Share link: http://dpaste.com/4324

OAuth Provider Dpaste Askbot Append a link to Dpaste

...

Post code

Askbot + OAuth + Dpaste = 183K LoC! Aire can support large Django web applications

slide-72
SLIDE 72

Aire enables automatic recovery

Share link: http://dpaste.com/4324

Dpaste

...

Askbot OAuth Provider Post code Append a link to Dpaste

slide-73
SLIDE 73

Aire enables automatic recovery

Share link: http://dpaste.com/4324

Dpaste

...

Askbot OAuth Provider Post code Append a link to Dpaste

slide-74
SLIDE 74

Aire enables automatic recovery

Share link: http://dpaste.com/4324

Dpaste

...

Askbot OAuth Provider Post code Append a link to Dpaste

slide-75
SLIDE 75

Aire enables automatic recovery

  • Askbot, OAuth, and Dpaste are correctly recovered

— Even when Dpaste is temporary unavailable — Even when Dpaste goes offmine

  • More examples in paper:

— Intrusion recovery (synthetic) — Mistakes on ACL setting — Misconfjgured versioning spreadsheet

slide-76
SLIDE 76

Aire has moderate runtime

  • verheads
  • 19-30% throughput reduction
  • 5-9KB/req storage overheads

→ Moderate overheads for websites which care integrity more than performance

Workload Req/s without Aire Req/s with Aire Logs / req With Aire Reading 21.58 17.58 5.52 KB Writing 23.26 16.20 9.24 KB

slide-77
SLIDE 77

Aire's repair is effjcient

  • Experiment setting:

— Attacker logins as a victim user and writes a post — 100 legitimate users post 5 questions and navigate — All users are afgected by the attack (read attacker's post)

Askbot OAuth DPaste Repaired Reqs 105 / 2196 2 / 9 1 / 496 Remote repair reqs 1 1 Local repair time 84.06 sec 0.10 sec 3.91 sec Normal exec. time 177.58 sec 0.01 sec 0.02 sec

slide-78
SLIDE 78

Aire's repair is effjcient

  • Experiment setting:

— Attacker logins as a victim user and writes a post — 100 legitimate users post 5 questions and navigate — All users are afgected by the attack (read attacker's post)

Askbot OAuth DPaste Repaired Reqs 105 / 2196 2 / 9 1 / 496 Remote repair reqs 1 1 Local repair time 84.06 sec 0.10 sec 3.91 sec Normal exec. time 177.58 sec 0.01 sec 0.02 sec

Repair in Askbot propagates to OAuth and Dpaste

slide-79
SLIDE 79

Aire's repair is effjcient

  • Experiment setting:

— Attacker logins as a victim user and writes a post — 100 legitimate users post 5 questions and navigate — All users are afgected by the attack (read attacker's post)

Askbot OAuth DPaste Repaired Reqs 105 / 2196 2 / 9 1 / 496 Remote repair reqs 1 1 Local repair time 84.06 sec 0.10 sec 3.91 sec Normal exec. time 177.58 sec 0.01 sec 0.02 sec

5% of requests are repaired

slide-80
SLIDE 80

Aire's repair is effjcient

  • Experiment setting:

— Attacker logins as a victim user and writes a post — 100 legitimate users post 5 questions and navigate — All users are afgected by the attack (read attacker's post)

Askbot OAuth DPaste Repaired Reqs 105 / 2196 2 / 9 1 / 496 Remote repair reqs 1 1 Local repair time 84.06 sec 0.10 sec 3.91 sec Normal exec. time 177.58 sec 0.01 sec 0.02 sec

Total repair takes x2 shorter than normal execution, although x10 slower in replaying a request for repair

slide-81
SLIDE 81

Related work

  • Intrusion recovery with selective re-execution:

— Retro [OSDI'10], Warp [SOSP'11]

→ Use them as building blocks for asynchronous repair

  • Intrusion recovery in distributed systems:

— Heat-ray [SOSP'09], Polygraph [EuroSys'09], Dare [APsys'12]

→ Automatic recovery in loosely coupled web services

slide-82
SLIDE 82

Summary

  • Aire recovers integrity of distributed web services

— Defjne a repair protocol — Support asynchronous and decentralized repair — Propose partial repair consistency