6hp http://www.ida.liu.se/~patla00/courses/BDA Teachers Lectures: - PowerPoint PPT Presentation

Big Data Analytics 6hp http://www.ida.liu.se/~patla00/courses/BDA

Teachers  Lectures: Patrick Lambrix, Christoph Kessler, Jose Pena, Valentina Ivanova,  Labs: Zlatan Dragisic, Huanyu Li  NSC: Rickard Armiento 2

Course literature  Articles (on web)  Lab descriptions (on web) 3

Data and Data Storage 4

Data and Data Storage  Database / Data source  One (of several) ways to store data in electronic format  Used in everyday life: bank, hotel reservations, library search, shopping 5

Databases / Data sourcces  Database management system (DBMS): a collection of programs to create and maintain a database  Database system = database + DBMS 6

Databases / Data sources Information Queries Answer Model Database Processing of system Database queries/updates management system Access to stored data Physical database 7

What information is stored?  Model the information - Entity-Relationship model (ER) - Unified Modeling Language (UML) 8

What information is stored? - ER  entities and attributes  entity types  key attributes  relationships  cardinality constraints  EER: sub-types 9

1 tgctacccgc gcccgggctt ctggggtgtt ccccaaccac ggcccagccc tgccacaccc 61 cccgcccccg gcctccgcag ctcggcatgg gcgcgggggt gctcgtcctg ggcgcctccg 121 agcccggtaa cctgtcgtcg gccgcaccgc tccccgacgg cgcggccacc gcggcgcggc 181 tgctggtgcc cgcgtcgccg cccgcctcgt tgctgcctcc cgccagcgaa agccccgagc 241 cgctgtctca gcagtggaca gcgggcatgg gtctgctgat ggcgctcatc gtgctgctca 301 tcgtggcggg caatgtgctg gtgatcgtgg ccatcgccaa gacgccgcgg ctgcagacgc 361 tcaccaacct cttcatcatg tccctggcca gcgccgacct ggtcatgggg ctgctggtgg 421 tgccgttcgg ggccaccatc gtggtgtggg gccgctggga gtacggctcc ttcttctgcg 481 agctgtggac ctcagtggac gtgctgtgcg tgacggccag catcgagacc ctgtgtgtca 541 ttgccctgga ccgctacctc gccatcacct cgcccttccg ctaccagagc ctgctgacgc 601 gcgcgcgggc gcggggcctc gtgtgcaccg tgtgggccat ctcggccctg gtgtccttcc 661 tgcccatcct catgcactgg tggcgggcgg agagcgacga ggcgcgccgc tgctacaacg 721 accccaagtg ctgcgacttc gtcaccaacc gggcctacgc catcgcctcg tccgtagtct 781 ccttctacgt gcccctgtgc atcatggcct tcgtgtacct gcgggtgttc cgcgaggccc 841 agaagcaggt gaagaagatc gacagctgcg agcgccgttt cctcggcggc ccagcgcggc 901 cgccctcgcc ctcgccctcg cccgtccccg cgcccgcgcc gccgcccgga cccccgcgcc 961 ccgccgccgc cgccgccacc gccccgctgg ccaacgggcg tgcgggtaag cggcggccct 1021 cgcgcctcgt ggccctacgc gagcagaagg cgctcaagac gctgggcatc atcatgggcg 1081 tcttcacgct ctgctggctg cccttcttcc tggccaacgt ggtgaaggcc ttccaccgcg 1141 agctggtgcc cgaccgcctc ttcgtcttct tcaactggct gggctacgcc aactcggcct 1201 tcaaccccat catctactgc cgcagccccg acttccgcaa ggccttccag ggactgctct 1261 gctgcgcgcg cagggctgcc cgccggcgcc acgcgaccca cggagaccgg ccgcgcgcct 1321 cgggctgtct ggcccggccc ggacccccgc catcgcccgg ggccgcctcg gacgacgacg 1381 acgacgatgt cgtcggggcc acgccgcccg cgcgcctgct ggagccctgg gccggctgca 1441 acggcggggc ggcggcggac agcgactcga gcctggacga gccgtgccgc cccggcttcg 1501 cctcggaatc caaggtgtag ggcccggcgc ggggcgcgga ctccgggcac ggcttcccag 1561 gggaacgagg agatctgtgt ttacttaaga ccgatagcag gtgaactcga agcccacaat 1621 cctcgtctga atcatccgag gcaaagagaa aagccacgga ccgttgcaca aaaaggaaag 1681 tttgggaagg gatgggagag tggcttgctg atgttccttg ttg 10

DEFINITION Homo sapiens adrenergic, beta-1-, receptor ACCESSION NM_000684 SOURCE ORGANISM human REFERENCE 1 AUTHORS Frielle, Collins, Daniel, Caron, Lefkowitz, Kobilka TITLE Cloning of the cDNA for the human beta 1-adrenergic receptor REFERENCE 2 AUTHORS Frielle, Kobilka, Lefkowitz, Caron TITLE Human beta 1- and beta 2-adrenergic receptors: structurally and functionally related receptors derived from distinct genes 11

Entity-relationship protein-id source PROTEIN accession definition m Reference n title article-id ARTICLE author 12

Databases / Data sources Information Queries Answer Model Database Processing of system Database queries/updates management system Access to stored data Physical database 13

How is the information stored? (high level) How is the information accessed? (user level) structure precision  Text (IR)  Semi-structured data  Data models (DB)  Rules + Facts (KB) 14

IR - formal characterization Information retrieval model: (D,Q,F,R)  D is a set of document representations  Q is a set of queries  F is a framework for modeling document representations, queries and their relationships  R associates a real number to document- query-pairs (ranking) 15

IR - Boolean model adrenergic cloning receptor ( 1 1 0) yes yes no --> Doc1 (0 1 0) no yes no --> Doc2 Q1: cloning and (adrenergic or receptor) --> (1 1 0) or (1 1 1) or (0 1 1) Result: Doc1 Q2: cloning and not adrenergic --> (0 1 0) or (0 1 1) Result: Doc2 16

IR - Vector model (simplified) Doc1 (1,1,0) cloning Doc2 (0,1,0) Q (1,1,1) adrenergic sim(d,q) = d . q |d| x |q| receptor 17

Semi-structured data ”Homo sapiens adrenergic, human beta-1- , receptor” NM_000684 SOURCE ACCESSION DEFINITION PROTEIN Protein REFERENCE DB REFERENCE TITLE AUTHOR AUTHOR TITLE AUTHOR Frielle AUTHOR ”Human beta -1 Collins AUTHOR …” ”Cloning of …” AUTHOR AUTHOR Daniel AUTHOR AUTHOR Caron AUTHOR Lefkowitz 18 Kobilka

Semi-structured data - Queries select source from PROTEINDB.protein P where P.accession = ”NM_000684”; 19

Relational databases PROTEIN REFERENCE PROTEIN-ID ACCESSION DEFINITION SOURCE PROTEIN-ID ARTICLE-ID 1 Homo sapiens human NM_000684 1 1 adrenergic, 1 2 beta-1-, receptor ARTICLE-AUTHOR ARTICLE-TITLE ARTICLE-ID AUTHOR ARTICLE-ID TITLE 1 Frielle Cloning of the cDNA for the human 1 1 Collins beta 1-adrenergic receptor 1 Daniel 1 Caron Human beta 1- and beta 2- 2 1 Lefkowitz adrenergic receptors: structurally 1 Kobilka and functionally related 2 Frielle receptors derived from distinct 2 Kobilka genes 2 Lefkowitz 2 Caron 20

Relational databases - SQL select source from protein where accession = NM_000684; PROTEIN PROTEIN-ID ACCESSION DEFINITION SOURCE 1 Homo sapiens human NM_000684 adrenergic, beta-1-, receptor 21

Evolution of Database Technology  1960s:  Data collection, database creation, IMS and network DBMS  1970s:  Relational data model, relational DBMS implementation  1980s:  Advanced data models (extended-relational, OO, deductive, etc.)  Application-oriented DBMS (spatial, temporal, multimedia, etc.)  1990s:  Data mining, data warehousing, multimedia databases, and Web databases  2000s  Stream data management and mining  Data mining and its applications  Web technology (XML, data integration) and global information systems  NoSQL databases 22

Knowledge bases (F) source(NM_000684, Human) (R) source(P?,Human) => source(P?,Mammal) (R) source(P?,Mammal) => source(P?,Vertebrate) Q: ?- source(NM_000684, Vertebrate) A: yes Q: ?- source(x?, Mammal) A: x? = NM_000684 23

Interested in more?  732A57 Database Technology (relational databases)  TDDD43 Advanced data models and databases (IR, semi-structured data, DB, KB)  732A47 Text mining (includes IR) 24

Analytics

Analytics  Discovery, interpretation and communication of meaningful patterns in data 26

Analytics - IBM  What is happening? Descriptive Discovery and explanation  Why did it happen? Diagnostic Reporting, analysis, content analytics  What could happen? Predictive Predictive analytics and modeling  What action should I take? Prescriptive Decision management  What did I learn, what is best? Cognitive

Analytics - Oracle  Classification  Regression  Clustering  Attribute importance  Anomaly detection  Feature extraction and creation  Market basket analysis

Why Analytics?  The Explosive Growth of Data  Data collection and data availability  Automated data collection tools, database systems, Web, computerized society  Major sources of abundant data  Business: Web, e- commerce, transactions, stocks, …  Science: Remote sensing, bioinformatics, scientific simulation, …  Society and everyone: news, digital cameras, YouTube  We are drowning in data, but starving for knowledge! 29

Ex.: Market Analysis and Management Where does the data come from? — Credit card transactions, loyalty cards,  discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing   Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time Cross-market analysis — Find associations/co-relations between product  sales, & predict based on such association Customer profiling — What types of customers buy what products (clustering  or classification) Customer requirement analysis   Identify the best products for different groups of customers  Predict what factors will attract new customers Provision of summary information   Multidimensional summary reports  Statistical summary information (data central tendency and variation) 30

6hp http://www.ida.liu.se/~patla00/courses/BDA Teachers Lectures: - PowerPoint PPT Presentation

Big Data Analytics 6hp http://www.ida.liu.se/~patla00/courses/BDA Teachers Lectures: Patrick Lambrix, Christoph Kessler, Jose Pena, Valentina Ivanova, Labs: Zlatan Dragisic, Huanyu Li NSC: Rickard Armiento 2

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th

Introduction to Artificial Intelligence CS540-1 Yingyu Liang slide 1 Logistics Course

Algorithms for NLP CS 11711, Spring 2020 Lecture 1: Introduction Yulia Tsvetkov 1 Welcome!

Algorithms for NLP Lecture 1: Introduction Yulia Tsvetkov CMU Slides: Nathan Schneider

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G.,

Understanding Git Nelson Elhage Anders Kaseorg Student Information Processing Board September

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak Supervision Chelsea Finn

APPLD: Adaptive Planner Parameter Learning From Demonstration Xuesu Xiao 1* , Bo Liu 1* , Garrett

ASIC and Custom in Nanometer Technologies David Chinnery Outline Introduction

Posix-Free File Systems in the Cloud Jeff Chase Duke University Beyond Posix

Building a Robust Research Commons: Enhancing the Precompe99ve

General Consent to Research Use of Biological Samples and Health Information Eiji Maruyama Kobe

Ohio State/WIRB Submission Process Sarah Hersch, MA, CIP, ORRP Objectives General workflow

Discovery and Analysis of Regulatory Regions in the Human Genome Wyeth Wasserman Centre for

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Jason Mezey Biological

My name is Rob Hooft, and I work for the Netherlands Bioinformatics Centre, NBIC for short. NBIC

Beware of the Hype History of the Semantic Web Web was invented by Tim Berners-Lee

Chapt hapter er 1 1 Computer Abstractions and Technology 1.1 Introduction The Computer

ELIXIR AAI task delivers: Manual assignment of affiliation Beacon workshop, ELIXIR AHM 2019

10/8/2018 Digital Solutions to enhance the continuity of care for refugees and migrants Dr

Founding Complexity Science: the work of Gregoire Nicolis Vasileios Basios

The Nomadic Network Providing Secure, Scalable and Manageable Roaming, Remote and Wireless Data

Health and Care Working Together in South Yorkshire and Bassetlaw The Hospital Services Review

SchemaBlocks Michael Baudis ga4gh.org SchemaBlocks - Perceived Need GA4GH schemas by the