Counting Types for Massive JSON Datasets BDA 2017, Nancy ( prsent - PowerPoint PPT Presentation

Counting Types   for Massive JSON Datasets   BDA 2017, Nancy ( présenté à DBPL 2017) Mohamed-Amine Baazizi , Dario Colazzo, Giorgio Ghelli, Carlo Sartiani

Counting types } Can types count? Type theory perspective } Should they? } How to efficiently summarize the Database structure of large JSON datasets? perspective } How precise is the summary? 2

The first problem } Type inference for massive JSON datasets, BDA 2016/ EDBT 2017 } We infer this type from a collection of JSON objects { title : Str ; text : [ Str ] + Null ; author : { address:T? ; affil:T? ;… } ? abstract : Str ? } } How «optional» is the author? } How frequently a text is Null? 3

Let us count { title : Str 1000 ; text : ([ Str 8000 ] 800 + Null 200 ) 1000 ; author : { add:T 300 ?; affil:T 300 ?;… } 800 ? abstract : Str 20 ? } 1000 4

The second problem } How to capture correlation information? } { addr:T 300 ; aff:T 300 ; r:T 800 } 800 Concision } { addr:T 300 ; aff:T 300 ; r:T 300 } 300 + {r:T 500 } 500 } { addr:T 300 ; r:T 300 } 300 + { aff:T 300 ; r:T 500 } 500 } { addr:T 300 ; r:T 500 } 500 + { aff:T 300 ; r:T 300 } 300 } { addr:T 300 ; r:T 300 } 300 + { aff:T 300 ; r:T 300 } 300 + {r:T 200 } 200 Precision 5

The type system } B ::= Null i | Num i | Str i | Bool i } R ::= { l : T , …, l : T } i } A ::= [ T ] i } S ::= B | R | A } T ::= S | 0 | T + T } Examples } Num 2 captures any multiset of two numbers } [Num 4 ] 3 a possible type for the multiset { [1], [1], [1,2] } M 6

The type inference algorithm } Singleton } ⊢ V : S ⊢ 3 : Int 1 } Multiset } ⊢ v 1 ,…,v n : M T ⊢ [1], [1], [1,2] : M [ Num 4 ] 3 } Different abstraction levels } [1], [1], [1,2] : M [ Num 4 ] 3 Concision } [1], [1], [1,2] : M [ Num 2 ] 2 +[ Num 2 ] 1 } [1], [1], [1,2] : M [ Num 1 ] 1 + [ Num 1 ] 1 +[ Num 2 ] 1 Precision 7

The type inference algorithm } Singleton Parametric inference E } ⊢ V : S ⊢ 3 : Int 1 } Multiset E } ⊢ v 1 ,…,v n : M T ⊢ [1], [1], [1,2] : M [ Num 4 ] 3 } Different abstraction levels } [1], [1], [1,2] : M [ Num 4 ] 3 Concision } [1], [1], [1,2] : M [ Num 2 ] 2 +[ Num 2 ] 1 } [1], [1], [1,2] : M [ Num 1 ] 1 + [ Num 1 ] 1 +[ Num 2 ] 1 Precision 7

The type inference algorithm [ 1, 2 ] Num 1 Num 1 Reduce Num 1 +Num 1 8

The type inference algorithm [ 1, 2 ] Num 1 Num 1 Reduce Num 1 +Num 1 Num 2 8

Parametric reduction 9

Equivalences of practical use } Kind { addr:T 300 ; aff:T 300 ; r:T 800 } 800 } Label { addr:T 300 ; r:T 300 } 300 + { aff:T 300 ; r:T 300 } 300 + {r:T 200 } 200 10

Kind reduction: twitter data { contributors: (Null 9,599,980 +[Num 20 ] 20 ) 9,600,000 ?; retweeted : Bool 9,600,000 ?; retweeted_status {…} : {…} 1,200,000 ?; deleted : {…} 300,000 ?; } 9,900,000 11

Label reduction: twitter data { con: … 7,200,000 ; ret: Bool 7,200,000 ;…} 7,200,000 +{ con: … 1,200,000 ; ret: Bool 1,200,000 ;…} 1,200,000 +{ con: … 1,040,000 ; ret: Bool 1,040,000 ; r_s: {} 1,040,000 ;…} 1,040,000 +{ con: … 160,000 ; ret: Bool 160,000 ; r_s: {} 160,000 ;…} 160,000 +{ deleted: { } 300,000 ;…} 300,000 : 12

Label reduction: twitter data { con: … 7,200,000 ; ret: Bool 7,200,000 ;…} 7,200,000 +{ con: … 1,200,000 ; ret: Bool 1,200,000 ;…} 1,200,000 +{ con: … 1,040,000 ; ret: Bool 1,040,000 ; r_s: {} 1,040,000 ;…} 1,040,000 +{ con: … 160,000 ; ret: Bool 160,000 ; r_s: {} 160,000 ;…} 160,000 +{ deleted: { } 300,000 ;…} 300,000 : Kind reduction 12

Experiments } Scala implementation } Spark cluster of 5+1 nodes, 64GB, 100 cores } 3 real life datasets : } Github (1m objects /10GB / 14 sec) } Twitter (9.9m objects / 21GB / 53 sec) } Nytimes (1.2m objects / 21GB / 27 sec) } Extraction of interesting features 13

Related work } PL community: dependent types, probabilistic types } Dataguide avec statistiques [Klettke et al. 2016] } JavaScript library for MongoDB [Schmit 2017] } Approaches w/o counting information } No parametric approach, so far [Klettke et al. 2016] Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores, Technologie und Web (BTW) [Schmidt. 2017]. mongodb-schema. (2017). https://github.com/mongodb-js/mongodb-schema. 14

To sum up } An algorithm to summarize JSON data: } Well defined semantics } Parametric } Parallel } Yielding quantitative information } What else may a counting type do? Thank you! 15

Counting Types for Massive JSON Datasets BDA 2017, Nancy ( prsent - PowerPoint PPT Presentation

Counting Types for Massive JSON Datasets BDA 2017, Nancy ( prsent DBPL 2017) Mohamed-Amine Baazizi , Dario Colazzo, Giorgio Ghelli, Carlo Sartiani Counting types } Can types count? Type theory perspective } Should they? } How to

Introduction to JSON Psychometric Conference 2016 (JavaScript Object Ou Zhang Notation)

1 Web App Development 2 3 JavaScript: JSON JSON: J ava S cript O bject N otation. JSON is a

Lecture 20: JSON JSON JSON stands for JavaScript Object Notation. It is a data format and it has

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

JSON (JavaScript Object Notation) JSON (JavaScript Object Notation) A lightweight

OData JSON Extensions Ralf Handl, SAP Susan Malaika, IBM Michael Pizzo, Microsoft 2012-07-27,

JL JSON Manipulation Language Json Objects and JLs Motivation [ { name: "John",

SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " ! ! Mohamed-Amine Baazizi 1 ,

Types Dynamic types Types are broken down into many categories Static types Duck typing

A JSON Data Processing Language Audrey Copeland, Walter Meyer, Taimur Samee, Rizwan Syed

Jsonpath in examples and roadmap Nikita Glukhov, Oleg Bartunov Postgres Professional SQL/JSON

A RESTful JSON-LD Architecture A RESTful JSON-LD Architecture for Unraveling Hidden References

JSON-LD Update State of JSON-LD in 2017 Gregg Kellogg gregg@greggkellogg.net @gkellogg

Session 14 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

ArangoDB Siegen, 31 August 2017 Max Neunhffer www.arangodb.com Documents (JSON) In this

Session 9 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

Analyse statique de programmes num eriques avec calculs flottants eme Rencontres Arithm 4 `

Earnings Summary Fourth Quarter 2018 Conference Call Tuesday, February 12, 2019 10:00 a.m. ET

High-performance Elliptic Curve Cryptography by Using the CIOS Method for Modular Multiplication A

Approximation-aware Dependency Parsing by Belief Propagation

Graph- -based Learning based Learning Graph Larry Holder Larry Holder Computer Science and

I f Information Security ti S it CS 526 Lecture 20 UMIP & IFEDAC CS526 Spring 2009/Lecture

Organic Compounds in Water and Wastewater Structure Activity Models for PPCPs Lecture #26

Defeating Secure Boot with EMFI Ang Cui, PhD & Rick Housley {a|r}@redballoonsecurity.com

Counting Types for Massive JSON Datasets BDA 2017, Nancy ( prsent - PowerPoint PPT Presentation

Counting Types for Massive JSON Datasets BDA 2017, Nancy ( prsent DBPL 2017) Mohamed-Amine Baazizi , Dario Colazzo, Giorgio Ghelli, Carlo Sartiani Counting types } Can types count? Type theory perspective } Should they? } How to

Introduction to JSON Psychometric Conference 2016 (JavaScript Object Ou Zhang Notation)

1 Web App Development 2 3 JavaScript: JSON JSON: J ava S cript O bject N otation. JSON is a

Lecture 20: JSON JSON JSON stands for JavaScript Object Notation. It is a data format and it has

Massive Data Algorithmics Lecture 1: Introduction Massive Data Algorithmics Lecture 1:

JSON (JavaScript Object Notation) JSON (JavaScript Object Notation) A lightweight

OData JSON Extensions Ralf Handl, SAP Susan Malaika, IBM Michael Pizzo, Microsoft 2012-07-27,

JL JSON Manipulation Language Json Objects and JLs Motivation [ { name: &quot;John&quot;,

SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 &quot; ! ! Mohamed-Amine Baazizi 1 ,

Types Dynamic types Types are broken down into many categories Static types Duck typing

A JSON Data Processing Language Audrey Copeland, Walter Meyer, Taimur Samee, Rizwan Syed

Jsonpath in examples and roadmap Nikita Glukhov, Oleg Bartunov Postgres Professional SQL/JSON

A RESTful JSON-LD Architecture A RESTful JSON-LD Architecture for Unraveling Hidden References

JSON-LD Update State of JSON-LD in 2017 Gregg Kellogg gregg@greggkellogg.net @gkellogg

Session 14 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

ArangoDB Siegen, 31 August 2017 Max Neunhffer www.arangodb.com Documents (JSON) In this

Session 9 Serialization/JSON 1 Lecture Objectives Understand the need for serialization

Analyse statique de programmes num eriques avec calculs flottants eme Rencontres Arithm 4 `

Earnings Summary Fourth Quarter 2018 Conference Call Tuesday, February 12, 2019 10:00 a.m. ET

High-performance Elliptic Curve Cryptography by Using the CIOS Method for Modular Multiplication A

Approximation-aware Dependency Parsing by Belief Propagation

Graph- -based Learning based Learning Graph Larry Holder Larry Holder Computer Science and

I f Information Security ti S it CS 526 Lecture 20 UMIP &amp; IFEDAC CS526 Spring 2009/Lecture

Organic Compounds in Water and Wastewater Structure Activity Models for PPCPs Lecture #26

Defeating Secure Boot with EMFI Ang Cui, PhD &amp; Rick Housley {a|r}@redballoonsecurity.com

JL JSON Manipulation Language Json Objects and JLs Motivation [ { name: "John",

SCHEMA INFERENCE FOR MASSIVE JSON DATASETS ! ! ParisBD 2017 " ! ! Mohamed-Amine Baazizi 1 ,

I f Information Security ti S it CS 526 Lecture 20 UMIP & IFEDAC CS526 Spring 2009/Lecture

Defeating Secure Boot with EMFI Ang Cui, PhD & Rick Housley {a|r}@redballoonsecurity.com