big data for data science
play

Big Data for Data Science Cloud Computing event.cwi.nl/lsde Cloud - PowerPoint PPT Presentation

Big Data for Data Science Cloud Computing event.cwi.nl/lsde Cloud computing What? Computing resources as a metered service (pay as you go) Ability to dynamically provision virtual machines Why? Cost: capital vs.


  1. Big Data for Data Science Cloud Computing event.cwi.nl/lsde

  2. Cloud computing • What? – Computing resources as a metered service (“pay as you go”) – Ability to dynamically provision virtual machines • Why? – Cost: capital vs. operating expenses – Scalability: “infinite” capacity – Elasticity: scale up or down on demand • Does it make sense? – Benefits to cloud users – Business case for cloud providers www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  3. Enabling technology: virtualisation App App App App App App OS OS OS Operating System Hypervisor Hardware Hardware Traditional Stack Virtualized Stack www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  4. Everything as a service • Utility computing = Infrastructure as a Service (IaaS) – Why buy machines when you can rent cycles? – Examples: Amazon’s EC2, Rackspace • Platform as a Service (PaaS) – Give me nice API and take care of the maintenance, upgrades – Example: Google App Engine • Software as a Service (SaaS) – Just run it for me! – Example: Gmail, Salesforce www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  5. Several Historical Trends (1/3) • Shared Utility Computing – 1960s – MULTICS – Concept of a Shared Computing Utility – 1970s – IBM Mainframes – rent by the CPU-hour. (Fast/slow switch.) • Data Center Co-location – 1990s-2000s – Rent machines for months/years, keep them close to the network access point and pay a flat rate. Avoid running your own building with utilities! • Pay as You Go – Early 2000s - Submit jobs to a remote service provider where they run on the raw hardware. Sun Cloud ($1/CPU-hour, Solaris +SGE) IBM Deep Capacity Computing on Demand (50 cents/hour) www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  6. Several Historical Trends (2/3) • Virtualization – 1960s – OS-VM, VM-360 – Used to split mainframes into logical partitions. – 1998 – VMWare – First practical implementation on X86, but at significant performance hit. – 2003 – Xen paravirtualization provides much perf, but kernel must assist. – Late 2000s – Intel and AMD add hardware support for virtualization. www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  7. Several Historical Trends (3/3) • Minicomputers (1960-1990) – IBM AS/400, DEC VAX • The age of the x86 PC (1990-2010) – IBM PC, Windows (1-7) – Linux takes the server market (2000-) – Hardware innovation focused on Gaming/Video (GPU), Laptop • Mobile and Server separate (2010-) – Ultramobile (tablet,phone) .  ARM – Server  x86 still but much more influence on hardware design • Parallel processing galore (software challenge!) • Large utility computing provides build their own hardware – Amazon SSD cards (FusionIO) – Google network Routers www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  8. Big picture overview • Client requests are handled in the first tier by – PHP or ASP pages user – Associated logic user • These lightweight services are fast 1 1 and very nimble 1 1 • Much use of caching: 1 2 the second tier 1 2 2 2 2 1 2 Shards 1 2 Index 1 2 DB www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  9. In the outer tiers replication is key • We need to replicate – Processing • Each client has what seems to be a private, dedicated server (for a little while) – Data • As much as possible! • Server has copies of the data it needs to respond to client requests without any delay at all – Control information • The entire system is managed in an agreed-upon way by a decentralised cloud management infrastructure www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  10. What about the shards? • The caching components running in tier two are central to the responsiveness of tier-one services • Basic idea is to always used cached data if at all possible – So the inner services (here, a database and a search index stored in a set of files) are shielded from the online load – We need to replicate data within our cache to spread loads and provide fault-tolerance – But not everything needs to be fully replicated – Hence we often use shards with just a few replicas www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  11. Read vs. write • Parallelisation works fine, so long as we are reading • If we break a large read request into multiple read requests for sub- components to be run in parallel, how long do we need to wait? – Answer: as long as the slowest read • How about breaking a large write request? – Duh… we still wait till the slowest write finishes • But what if these are not sub-components, but alternative copies of the same resource? – Also known as replicas – We wait the same time, but when do we make the individual writes visible? Replication solves one problem but introduces another www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  12. More on updating replicas in parallel • Several issues now arise – Are all the replicas applying updates in the same order? • Might not matter unless the same data item is being changed • But then clearly we do need some agreement on order – What if the leader replies to the end user but then crashes and it turns out that the updates were lost in the network? • Data center networks are surprisingly lossy at times • Also, bursts of updates can queue up • Such issues result in inconsistency 17 www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  13. Eric Brewer’s CAP theorem • In a famous 2000 keynote talk at ACM PODC, Eric Brewer proposed that – “ You can have just two from Consistency, Availability and Partition Tolerance ” • He argues that data centres need very fast response, hence availability is paramount • And they should be responsive even if a transient fault makes it hard to reach some service • So they should use cached data to respond faster even if the cached entry cannot be validated and might be stale! • Conclusion: weaken consistency for faster response • We will revisit this as we go along www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  14. Is inconsistency a bad thing? • How much consistency is really needed in the first tier of the cloud? – Think about YouTube videos. Would consistency be an issue here? – What about the Amazon “number of units available” counters. Will people notice if those are a bit off? • Probably not unless you are buying the last unit • End even then, you might be inclined to say “oh, bad luck” www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  15. CASE STUDY: AMAZON WEB SERVICES www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  16. Amazon AWS • Grew out of Amazon’s need to rapidly provision and configure machines of standard configurations for its own business. • Early 2000s – Both private and shared data centers began using virtualization to perform “server consolidation” • 2003 – Internal memo by Chris Pinkham describing an “infrastructure service for the world.” • 2006 – S3 first deployed in the spring, EC2 in the fall • 2008 – Elastic Block Store available. • 2009 – Relational Database Service • 2012 – DynamoDB www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  17. Terminology • Instance = One running virtual machine. • Instance Type = hardware configuration: cores, memory, disk. • Instance Store Volume = Temporary disk associated with instance. • Image (AMI) = Stored bits which can be turned into instances. • Key Pair = Credentials used to access VM from command line. • Region = Geographic location, price, laws, network locality. • Availability Zone = Subdivision of region the is fault-independent. www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  18. Amazon AWS www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  19. EC2 Architecture EBS S3 Manager snapshot SSH AMI EC2 Private Private Instance Instance Instance IP IP Firewall Public IP Internet www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  20. www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  21. EC2 Pricing Model • Free Usage Tier • On-Demand Instances – Start and stop instances whenever you like, costs are rounded up to the nearest hour. (Worst price) • Reserved Instances – Pay up front for one/three years in advance. (Best price) – Unused instances can be sold on a secondary market. • Spot Instances – Specify the price you are willing to pay, and instances get started and stopped without any warning as the marked changes. (Kind of like Condor!) www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

  22. Free Usage Tier • 750 hours of EC2 running Linux, RHEL, or SLES t2.micro instance usage • 750 hours of EC2 running Microsoft Windows Server t2.micro instance usage • 750 hours of Elastic Load Balancing plus 15 GB data processing • 30 GB of Amazon Elastic Block Storage in any combination of General Purpose (SSD) or Magnetic, plus 2 million I/Os (with Magnetic) and 1 GB of snapshot storage • 15 GB of bandwidth out aggregated across all AWS services • 1 GB of Regional Data Transfer www.cwi.nl/~boncz/bigdatacourse www.cwi.nl/~boncz/bads event.cwi.nl/lsde

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend