seer leveraging big data to navigate the increasing
play

Seer: Leveraging Big Data to Navigate The Increasing Complexity of - PowerPoint PPT Presentation

Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou Cornell University HotCloud July 9 th 2018 Executive Summary


  1. Seer: Leveraging Big Data to Navigate The Increasing Complexity of Cloud Debugging Yu Gan, Meghna Pancholi, Dailun Cheng, Siyuan Hu, Yuan He and Christina Delimitrou Cornell University HotCloud– July 9 th 2018

  2. Executive Summary ¨ Microservices puts more pressure on performance predictability ¤ Microservices dependencies à propagate & amplify QoS violations ¤ Finding the culprit of a QoS violation is difficult ¤ Post-QoS violation, returning to nominal operation is hard ¨ Anticipating QoS violations & identifying culprits ¨ Seer: Data-driven Performance Debugging for Microservices ¤ Combines lightweight RPC-level distributed tracing with hardware monitoring ¤ Leverages scalable deep learning to signal QoS violations with enough slack to apply corrective action 2

  3. From Monoliths to Microservices 3

  4. Motivation ¨ Advantages of microservices: ¤ Ease & speed of code development & deployment ¤ Security, error isolation ¤ PL/framework heterogeneity ¨ Challenges of microservices: ¤ Change server design assumptions ¤ Complicate resource management à dependencies ¤ Amplify tail-at-scale effects ¤ More sensitive to performance unpredictability ¤ No representative end-to-end apps with microservices 4

  5. An End-to-End Suite for Cloud & IoT Microservices ¨ 4 end-to-end applications using popular open-source microservices à ~30-40 microservices per app ¤ Social Network ¤ Movie Reviewing/Renting/Streaming ¤ E-commerce ¤ Drone control service ¨ Programming languages and frameworks: ¤ node.js, Python, C/C++, Java/Javascript, Scala, PHP , and Go ¤ Nginx, memcached, MongoDB, CockroachDB, Mahout, Xapian ¤ Apache Thrift RPC, RESTful APIs ¤ Docker containers ¤ Lightweight RPC-level distributed tracing 5

  6. Resource Management Implications Netflix Twitter Movie Streaming Amazon ¨ Challenges of microservices: ¤ Dependencies complicate resource management ¤ Dependencies change over time à difficult for users to express ¤ Amplify tail@scale effects 6

  7. The Need for Proactive Performance Debugging ¨ Detecting QoS violations after they occur: ¤ Unpredictable performance propagates through system ¤ Long time until return to nominal operation ¤ Does not scale 7

  8. Queue CPU Mem Net Disk Performance Implications 8

  9. Queue CPU Mem Net Disk Performance Implications 9

  10. Seer: Data-Driven Performance Debugging ¨ Leverage the massive amount of traces collected over time Apply online, practical data mining techniques that 1. identify the culprit of an upcoming QoS violation Use per-server hardware monitoring to determine the 2. cause of the QoS violation Take corrective action to prevent the QoS violation from 3. occurring ¨ Need to predict 100s of msec – a few sec in the future 10

  11. Gantt charts microservices Tracing Framework Client http latency ¨ RPC level tracing […] ¨ Based on Apache Thrift WebUI ¨ Timestamp start-end QueryEngine for each microservice TCP TCP proc RX zTracer App proc ¨ Store in centralized DB Proc Cassandra TCP proc TX TCP (Cassandra) uService K RPC time TX ¨ Record all requests à Tracing RPC time RX Collector No sampling TCP ¨ Overhead: <0.1% in zTracer Proc throughput and <0.2% TCP uService K+1 in tail latency […] 11

  12. Deep Learning to the Rescue ¨ Why? ¤ Architecture-agnostic ¤ Adjusts to changes in dependencies over time ¤ High accuracy, good scalability ¤ Inference within the required window 12

  13. DNN Configuration Input Output signal signal Which ¨ Container microservice utilization will cause a QoS violation ¨ Latency in the near future? ¨ Queue depth 13

  14. DNN Configuration Input Output signal signal Which ¨ Container microservice utilization will cause a QoS violation ¨ Latency in the near future? ¨ Queue depth 14

  15. DNN Configuration ¨ Training once: slow (hours - days) ¤ Across load levels, load distributions, request types ¤ Distributed queue traces, annotated with QoS violations ¤ Weight/bias inference with SGD ¤ Retraining in the background ¨ Inference continuously: streaming trace data 93% accuracy in signaling upcoming QoS violations 91% accuracy in attributing QoS violation to correct microservice 15

  16. DNN Configuration Accuracy stable or increasing with cluster size ¨ Challenges: ¤ In large clusters inference too slow to prevent QoS violations ¤ Offload on TPUs, 10-100x improvement; 10ms for 90 th %ile inference ¤ Fast enough for most corrective actions to take effect (net bw partitioning, RAPL, cache partitioning, scale-up/out, etc.) 16

  17. Experimental Setup ¨ 40 dedicated servers ¨ ~1000 single-concerned containers ¨ Machine utilization 80-85% ¨ Inject interference to cause QoS violation ¤ Using microbenchmarks (CPU, cache, memory, network, disk I/O) 17

  18. Restoring QoS ¨ Identify cause of QoS violation ¤ Private cluster: performance counters & utilization monitors ¤ Public cluster: contentious microbenchmarks ¨ Adjust resource allocation ¤ RAPL (fine-grain DVFS) & scale-up for CPU contention ¤ Cache partitioning (CAT) for cache contention ¤ Memory capacity partitioning for memory contention ¤ Network bandwidth partitioning (HTB) for net contention ¤ Storage bandwidth partitioning for I/O contention 18

  19. Restoring QoS ¨ Post-detection, baseline system à dropped requests ¨ Post-detection, Seer à maintain nominal performance 19

  20. Demo Queue CPU Mem Net Disk 20

  21. 21

  22. Challenges Ahead Serverless microservices IoT swarms ¨ Security implications of data-driven approaches ¨ Fall-back mechanisms when ML goes wrong ¨ Not a single-layer solution à Predictability needs vertical approaches Thank you! 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend