big data
play

Big Data overview, issues, challenges and opportunities C. Onime - PowerPoint PPT Presentation

Big Data overview, issues, challenges and opportunities C. Onime (onime@ictp.it) 1 Outline Interactive session Introduction to Big-Data Issues/challenges Taxonomy classifications Conclusion Opportunities and future 2


  1. Big Data overview, issues, challenges and opportunities C. Onime (onime@ictp.it) 1

  2. Outline • Interactive session – Introduction to Big-Data – Issues/challenges – Taxonomy classifications • Conclusion – Opportunities and future 2

  3. Pre-exercise • Before providing a formal definition, let’s try answer the questions: – What exactly is Big-Data? – Can you identify it? 3

  4. Definition(s) • The term Big-Data by definition is used for data that is “massive” in one of the following areas: – Volume: quantity – Velocity: generated at high speed – Variety: wide spread from diverse sources and types. – Variability: constantly changing meaning – Veracity: making data accurate (removing bad data) – Visualization: presenting and conveying meaning – Value: applying findings and taking action 4

  5. Big-Data examples • Astronomical Image data from a telescope exceeds 1TB/day • Environmetal monitoring • Government: Census, National Health Records/Systems, etc. • Industry: Amazon, Google, Ebay... 5

  6. World wide storage 6

  7. Another forecast • 0.076 ZB = 76 EB • 76 EB = 76M PB • Current estimate is that 82% of global IP traffic will be video by 2020 Clement Onime - onime@ictp.it 7

  8. Preamble • So what is driving Big Data? – Mainly industry related paradigms & applications • Data mining, Business Intelligence, Knowledge Management and now Big Data Management Clement Onime- onime@ictp.it 8

  9. Data Mining • A process of analyzing data from different perspectives and summarizing it into useful information, [...] which allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Clement Onime- onime@ictp.it 9

  10. Business Intelligence • A process of finding, gathering, aggregating and analyzing information for decision- making. It makes use of a set of technologies that allow the acquisition and analysis of data to improve company decision making and work flows. Clement Onime- onime@ictp.it 10

  11. Knowledge Management • A business process that formalizes the management and use of an enterprise’s intellectual assets.“ KM promotes a collaborative and integrative approach to the creation, capture, organization, access and use of information assets, including the tacit, un-captured knowledge of people. • A systematic process of finding, selecting, organizing, distilling and presenting information in a way that improves an employee’s comprehension in a specific area of interest which supports an organization to gain insight and understanding from its own experience. Clement Onime- onime@ictp.it 11

  12. Big Data Management Clement Onime- onime@ictp.it 12

  13. Other drivers • Scientific Research – High Performance Computing (LHC, SKA, Genomics) • Improvements in hardware technology – Heading towards Nano-circuits, clocking resolutions, etc • Improvements in computing platforms – Networks: always connected devices, capacity; Clouds: anytime, anywhere on-demand metered access to resources • Every user is a now a provider/consumer – Social networking Clement Onime - onime@ictp.it 13

  14. Issues and challenges • Perspectives – backgrounds, use cases • Taxonomies, ontologies, schemas, workflow • Bits – raw data formats and storage methods • Cycles – algorithms and analysis • Infrastructure (screws) to support Big Data – From presentation by Michael Cooper & Peter Mell of NIST Clement Onime- onime@ictp.it 14

  15. Perspectives Clement Onime- onime@ictp.it 15

  16. Six dimensional Taxonomy Data Mapping Security & Compute Privacy infrastructure Big Data Storage Visualisation Infrastructure Analytics Clement Onime - onime@ictp.it 16

  17. Data Mapping examples UNSTRUCTURED VISUAL MEDIA (Video scene detection, image understanding) NETWORK SECURITY (ID, malwares/virus attacks) STRUCTURED SEMI SENSOR DATA (ID, long term trends , weather) SOCIAL NETWORKING (Trend analysis, query processing) STRUCTURED RETAIL FINANCIAL (Sentiment & behaviour analysis) (High speed training) LARGE SCALE SCIENCE (HEP, Genomics) BATCH NEAR-REAL-TIME REAL-TIME Clement Onime - onime@ictp.it 17

  18. Compute infrastructure Hadoop Map Reduce S4 Batch Hama Bulk synchronous Giraph parallel Compute Infrastructure Pregel Storm Streaming Spark Clement Onime - onime@ictp.it 18

  19. Overview of Hadoop MapReduce Clement Onime- onime@ictp.it 19

  20. Hadoop 2.0 Ecosystem Clement Onime- onime@ictp.it 20

  21. Storm Cluster Worker node Worker process Executor Executor Task Supervisor Task Task Task Master node Zookeeper framework (Nimbus) Worker node Worker process Supervisor Worker process Clement- Onime onime@ictp.it 21

  22. Storm Basics • Tuple – Key-value pairs • Streams – Sequence of tuples pairs • Spout – Source of streams • Bolt – Processing element – (filers, join, transform, e.t.c) Clement- Onime onime@ictp.it 22

  23. Storm topology Bolt • Graph of Computation Bolt – Network of spouts and bolt Spout – Parallel & cyclic execution Bolt • Groupings – Shuffle, all, Global, fields Spout Bolt • Example: – Twitter analytics: spout, bolts: parse, count, ranks, report Clement- Onime onime@ictp.it 23

  24. Storage infrastructure Examples (Oracle, Relational (SQL) MySQL, PostgreSQL, etc) Examples (MongoDB, Document oriented CouchDB, CouchBase) In memory (Memcached, Redis, Aerospike) Key-value stores Dynamo inspired Storage Infrastructure NoSQL (Cassandra, Riak, Voldemart) Examples (Hbase, Big-Table Cassandra) Examples (Giraph, Graph oriented Neo4j, OrientDB) Examples (Hstore, NewSQL In memory VoltDB) Clement Onime - onime@ictp.it 24

  25. Clement Onime - onime@ictp.it SEMI UNSTRUCTURED STRUCTURED STRUCTURED NoSQL BATCH (MongoDB, CrouchDB, Cassandra) (MySQL, PostgreSQL, SQL-lite) Neo4j SQL NEAR-REAL-TIME Storm, Kinesis Infrastructure mapping Shark, Spark VoltDB Titan Redis Aerospike REAL-TIME 25

  26. Storage complexity/size Clement Onime- onime@ictp.it 26

  27. Analytics Regression (Polynomials, MARS) Supervised Classification (Decision trees, Naïve Bayes, Support vector machines) Clustering (K-means, Gaussian mixtures) Un-supervised Reduction (Principle component analysis) Machine learning algorithm Active Semi-supervised Co-training Markov decision process Re-enforcement Q-Learning Clement Onime - onime@ictp.it 27

  28. Comparison of Data analysis paradigms Statistics Machine learning Model Network, Graphs Data point Examples/instances Response Label Parameters Weights Covariate Feature Fitting/Estimation Learning Test set performance Generalization Regression/Classification Supervised Learning Density estimation, Clustering Unsupervised Learning Clement Onime- onime@ictp.it 28

  29. Visualisation Line/ bar charts Charts / plots Scatter plots Spatial layout Tree maps Trees / graphs Arc diagrams Data cubes, Visualisation Binning histograms Abstract or summary Hierarchical Clustering aggregation MS Pivot Deep zoom viewer, Tableau Interactive or real-time AR systems / Mixed reality tools Clement Onime - onime@ictp.it 29

  30. Mixed Reality Environments 𝐹 𝑁𝑆 = න(𝑆 + 𝑊) where Clement Onime - onime@ictp.it 30

  31. VR and AR Virtual Reality (VR) CAVE Augmented Reality(AR) • Computer generated virtual • Real-time integration of environment computer generated information into a 3D world. • Creates a completely virtual • Blends into real world and environment that is without real objects supports real objects • Portable • Mobile – Headsets, wearable devices – Commodity devices: smart- phones and tablets – Custom and typically not cost – Cost effective effective Clement Onime - onime@ictp.it 31

  32. Some Examples VR Environments AR Environments Clement Onime - onime@ictp.it 32

  33. AR Cubicle AR immersive cubicle User 180 ° horizontal by 3 markers on walls and 90 ° vertical by marker on floor Clement Onime - onime@ictp.it 33

  34. Security and privacy Secure computations Infrastructure Best practices Privacy preservation Data privacy Cryptography Access control Security and privacy Secure storage Data Transaction logs management and Audits Provenance End-point security Integrity and reactive security Real-time monitoring Clement Onime - onime@ictp.it 34

  35. Public Key Cryptography • Asymmetric cryptography – A pair of keys: one public and the other private – Useful for authentication and encryption – Depends mainly on the impracticability of computing the equivalent private key from its public component. – Public key may be freely exchanged without secure channels such as public key servers, etc.. – Computationally intensive mathematical algorithms Clement Onime - onime@ictp.it 35

  36. Digital Certificates • Similar to travel passport – Provides forgery resistant identifying information • Name of holder • Serial number • Expiration date • Copy of holder’s public key (used for encryption) • Digital signature of issuing authority (CA) Clement Onime - onime@ictp.it 36

  37. SSL Transport Client hello reply + certificate Trusted certificates Key exchange + certificate Client Server Trusted certificates Client OK Server OK Encrypted messages Clement Onime - onime@ictp.it 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend