New Trends on Exploratory Methods for Data Analytics Davide Mottin, - PowerPoint PPT Presentation

Minimal Project Join REQ [Shen et al., 2014] Main idea : Find the set of queries that approximately return a set of examples Partial query Minimal PJ Queries Q’ table • valid: every tuple is present in A B C query results 1 Mike ThinkPad Office minimal: any removal in query • 2 Mary iPad tree gets to an invalid query 3 Bob Dropbox 28 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis

Candidate Query Generation [Shen et al., 2014] ● Use candidate network generation algorithm A B C (Hristidis 2002) 1 Mike ThinkPad Office 2 Mary iPad 3 Bob Dropbox CQ 1 CQ 2 Owner CQ 3 Owner Sales 1. Generate join tree 𝐾 A B A B C A B C Employee Device 2. Generate mapping 𝜚 Customer Device App Employee Device App C 3. Check minimal: ESR - Every leaf node CQ 4 CQ 5 Owner Owner contains a column that C B A B is mapped by an input App Device Employee Device App column C ESR A ESR Employee 29 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis

Validity verification [Shen et al., 2014] Candidate Naïve: check all candidate queries query singularly if they return ALL examples Better: exploit substructures in candidate Substructures queries for pruning Sub 1 Owner Sub 1 fails => Best: adaptively select the substructures B 𝐷𝑅 + invalid A to have the min number of evaluations Employee Device NP-hard Sub 1 fails => Owner Sub 2 Sub 2 fails C A B Device Employee App 30 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Minimal Project Join REQ [Psallidas et al., 2015] Main idea: Allow missing rows/columns and rank the k best queries Output: Top-k PJ Queries Sales Products Customers Partial query S4 Name First Name Last Name table Sales Products Customers A B C City 1 John Smith Xbox Name Last Name 2 Jill Hans Surface Name 31 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Ranking score [Psallidas et al., 2015] Linear combination of row score and column score • 𝛽 = 1 penalizes 𝛽 ∗ 𝑡𝑑𝑝𝑠𝑓 Aij 𝑅 + 1 − 𝛽 ∗ 𝑡𝑑𝑝𝑠𝑓 kiF 𝑅 missing rows • 𝛽 = 0 penalizes 𝑅 missing columns Sales Sales Row Score Products Customers Row score Products Customers City John Smith Xbox 3 3 Jill Hans Surface 2 1 Name City Name Last Name Name First Name Last Name Xbox St. John Smith 5 4 Xbox John Smith iPhone Michael Douglas iPhone Montpellier Douglas Surface Redmond Johnson Surface Jill Johnson Sales Sales Column score Products Customers John Smith Xbox Products Customers City Name First Name Last Name Jill Hans Surface City Name Last Name Xbox John Name Smith 2 1 2 5 Xbox St. John Column Smith iPhone Jill Johnson 2 1 1 4 iPhone Montpellier Johnson Michael Surface Douglas Score Surface Redmond Douglas 32 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

S4 Optimizations [Psallidas et al., 2015] Row score is always bounded by the column score (row containment is more restrictive) Upper bound Exploit inverted indexes on columns/rows Stop when current upper bound score is less than the k-th ranked Early evaluated query termination Scan queries on decreasing upper bound Reuse common subparts in the candidate queries Caching 33 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Reverse engineering queries (REQ) Lack of user models! REQ Exact Approximate One-shot Interactive Minimal Top-k • Query From • Query by output • Discovering • S4: Top-k examples (QFE) - TALOS Queries based Spreadsheet on Examples style • Interactive • REQ SPJ queries from inference of join examples queries 34 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

Examples for query suggestion: Blaeu [Sellam et al., 2016] Main idea : Allow interactive navigation of the query space in a hierarchy Query Results Blaeu Query navigations or Query 35 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

� Examples for query suggestion: Blaeu [Sellam et al., 2016] Query results Attribute 2 Given a result of an example query Q, explore the data through data maps = partitions Output : Set of query refinements Attribute 1 Problem : User utility is unknown 𝑣: 𝐸𝐶 → −1,1 , 𝑉 𝑅 = Y 𝑣(𝑢) Cluster analysis for result exploration • M∈t Zoom and projection operations • User utility User model • 36 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

� Examples for query suggestion: Blaeu [Sellam et al., 2016] Find the partition 𝒟 = 𝐷 / , … , 𝐷 ? of the results of Q 𝑣: 𝐸𝐶 → −1,1 , 𝑉 𝐷 = Y 𝑣(𝑢) such that exists C w ∈ 𝒟: 𝑉 𝐷 x > 𝑉(𝑅) M∈z Unknown User utility Solution : interesting tuples are close to each other within a maximum separation threshold 𝜄(𝒟) Detect clusters Organize clusters (k-medoid) (decision tree) Inference 37 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Where we are Relational databases Machine learning Textual data Graphs and networks Challenges and Remarks 38 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Examples for textual data Few methods for textual data using examples Snowball [Agichtein 2000] DIPRE [Brin 1999] Entity Web table Search by Extraction completion example [Hanafi 2017] [Yakout 2013] Serendipitous Using example search queries [Bordino 2013] [Zhu 2014] 39 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

Entity extraction by-example (SEER) [Hanafi et al., 2017] Main idea: Create rules to extract wanted information from documents using examples SEER Output : Extraction P: Percentage = 1.0 = 1.0 rules = 0.4 D: {5, 6} = 0.4 D: {percent, %} = 0.4 R: [0-9]+ = 0.2 D: {percent, %} = 0.4 = 0.3 40 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Learning rules [Hanafi et al., 2017] Example: 5 percent up 1. Enumerate possible primitives per example token P: Number 5 L: ‘percent’ percent … P: Integer R: [A-Za-z]+ L: ‘5’ T: 0-1 R: [0-9]+ 2. Assign scores to primitives Token gap Literal Pre- ≺ Dictionary ≺ Regex builts ≺ ≺ Dubai : T: 0-1 L: ‘Dubai’ P: City 0 1 41 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Learning rules (cont’d) [Hanafi et al., 2017] 3. Generate rules Example: Example: 5 percent 6% Tokens: 5 percent P: Percentage = 1.0 L: ‘%’ = 0.4 Tree: P: Percentage = 1.0 R: symbols = 0.2 L: ‘6’ = 0.4 L: ‘percent’ = 0.4 L: ‘%’ = 0.4 R: [0-9]+ = 0.2 L: ‘5’ = 0.4 R: [A-Za-z]+ = 0.2 R: symbols = 0.2 L: ‘percent’ = 0.4 R: [0-9]+ = 0.2 R: [A-Za-z]+ = 0.2 Rule: R: [0-9]+ = 0.2 L: ‘percent’ = 0.4 4. Merge Intersection: [ 5 percent, 6% ] P: Percentage = 1.0 D: {5, 6} = 0.4 D: {percent, %} = 0.4 R: [0-9]+ = 0.2 D: {percent, %} = 0.4 42 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Web tables completion (InfoGather) [Yakout et al., 2012] Main idea: Complete tables using partial information about tuples Part No Mfg Model Brand Web tables DSC W570 Sony S80 Nikon Part No Mfg T1460 Benq Easyshare CD44 Kodak DSC W570 Sony Optio E60 Pentax DSC W570 Sony T1460 Benq S8100 Nikon Optio E60 Pentax Optio E60 Pentax S8100 Nikon Model Brand Model Brand S80 S80 Benq InfoGather A10 A10 Innostream GX-1S GX-1S Samsung T1460 T1460 Benq Incomplete table Complete table 43 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Augmentation framework [Yakout et al., 2012] Direct Match Approach (DMA) Web tables ● Traditional schema matching techniques using Input the attribute names and the values in the column Indirect matching |𝑈 ∩ • 𝑅| 𝑗𝑔 𝑅. 𝐵 ≈ 𝑈. 𝐶 table 𝑇 |C} 𝑈 = • min( 𝑅 , |𝑈|) 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 44 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

� � Ranking tables using PageRank PageRank • Personalized PageRank (PPR) • Adjacency Query Table matrix 𝜌 Š 𝑤 = 𝜗 𝜀 Š 𝑤 + 1 − 𝜗 Y 𝜌 Š 𝑥 𝛽 j,Ž {j| j,Ž ∈•} Topic Sensitive Pagerank (TSP) • ⃗ + 1 − 𝜗 𝜌 • 𝑤 = 𝜗 𝛾 Y 𝜌 • 𝑥 𝛽 j,Ž Nodes è Web Tables Edges è Tables Similarity {j| j,Ž ∈•} Topic vector Topic weight è DMA score 45 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis

Serendipitous search [Bordino et al., 2013] Main idea: Use related entities and query logs to find serendipitous searches Francisco Pizarro Peru America Rafting Query Amazon Machu Picchu Logs ... Connected entities rafting excursion down the urubamba river el dorado temple of sun Serendipitous indios quechuas Search map of peru sapa inca Searches related to Document Document content 46 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Find queries using entity-query graph [Bordino et al., 2013] Query-flow graph with entity nodes Three types of arcs: 1. query to query: 2. entity to query Frequency-based approach The more queries entities share 3. entity to entity the higher the probability Idea : Run Personalized PageRank on entity-query graphs 47 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Search by multiple examples [Zhu et al., 2014] Main idea: Document examples are used to find topics Action Movies - Mission impossible - Die Hard - … Chuck Norris Search by examples Action Actors Arnold - Bruce Willis Schwarzenegger - Tom Cruise - … … Related topics and documents 48 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Nearest neighbor approach [Zhu et al., 2014] Main Idea: Tb The similarity is an Query aggregation over the B Examples distances between D1 document 𝐸 [ and its nearest query example Tc D3 Centroid A Ta D2 49 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Graphs Fact Graph Ontology Tree is A Arnold Person Schwarzenegger actedIN is A subClassOf Terminator Actor Release 1984 Budget $6.4M Length 1h 48m 51 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Graphs is A Arnold Person Schwarzenegger actedIN is A subClassOf Terminator Actor RDF (subject,predicate,object) (Arnold_Schwarzenegger,isA,Person) (Actor, subClassOf, Person) (Arnold_Schwarzenegger, actedIn, Terminator) Fact Graph Ontology Tree 52 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Exemplar Queries [Mottin et al., 2014] Input: 𝑅 𝑓 , an example element of interest Nodes/Entities Output: set of elements in the desired result set Edges/Facts Structures Exemplar Query Evaluation • evaluate 𝑅 𝑓 in a database D, finding a sample S • find the set of elements A similar to S given a similarity relation 53 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Exemplar Queries [Mottin et al., 2014] Input: 𝑅 𝑓 , an example element of interest Nodes/Entities Output: set of elements in the desired result set Edges/Facts Structures Exemplar Query Evaluation • evaluate 𝑅 𝑓 in a database D, finding a sample S • find the set of elements A similar to S given a similarity relation • [OPTIONAL] return only the subset A R that are relevant 54 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis

SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Mediator Nodes Entity Search Path Queries Entity Tuples [Ruchansky’15] [Metzger’13, [Bonifati’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] CHALLENGE: DISCOVER USER PREFERENCE CHALLENGE: EFFICIENT SEARCH 55 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

The Minimum Wiener Connector Problem [Ruchansky, et al., 2015] Model: Unlabeled Undirected Graph Query: A set of Nodes Q Similarity: Shortest-Path distance Output: A Set of Connector Nodes H “ explains ” connections in Q Connectors: Case: Infected Patients Nodes with HIGH closeness → Culprit/Other Infected to ALL the inputs Case: Target Audience Similar to a Steiner-Tree but → Influencers overall pairwise distances are optimized 56 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

The Minimum Wiener Connector Problem [Ruchansky et al., 2015] Model: Unlabeled Undirected Graph Sometimes The Query: A set of Nodes Q Best Solution is NOT A Tree Similarity: Shortest-Path distance W=1+2+1 =4 Output: A Set of Connector Nodes H NP-Hard minimize the sum of pairwise shortest-path-distances between nodes in the connector H W=1+1+1 = 3 Called: Wiener Index . X d ( u, v ) min tradeoff between size ( u,v ) ∈ H d(u, v) is the shortest-path distance and average distance 57 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Approximate minimum Wiener Index [Ruchansky et al., 2015] Connector Approximated with CHOOSE r & λ ∈ [ 1, log (1+ β ) |V| ] Edge-Weighted SteinerTree All Pairwise Distances Enumerate Candidate Solutions for r ∈ Q & λ Distances from a root r and keep best Measure distance in H Precomputed distance in G r Edge Weights w(u, v) = λ + max { d G ( r, u ) , d G ( r, v ) } λ 58 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Focused Clustering and Outlier Detection [Perozzi et al., 2014] PhD PhD NYC College Model: Unlabeled Undirected Graph NYC English Paris Greek with Node Attributes Google Dutch SAP Query: A set of Nodes Q Google Similarity: Attribute Values & Connectivity ( to be inferred ) College Output: Clusters of Nodes: Dense & Coherent NYC English +Cluster Outliers PhD Google NYC Italian Case: Target Users → Community with same interests PhD IBM NYC French Case: Products → Co-purchased products with similar features SAP 59 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Focused Clustering and Outlier Detection [Perozzi et al., 2014] PhD PhD NYC College TASK: Infer “FOCUS” , important attributes NYC English Paris Greek attribute weights β Google Dutch SAP Google 0.5 PhD PhD 0.5 NYC NYC 0 French English 0 SAP College Google NYC English 1. Set of similar pairs, PS (from Q) PhD Google NYC 2. Set of dissimilar pairs, PD (random sample) Italian PhD IBM 3. Learn a distance metric between PS and PD NYC French ( Distance Metric Learning, inverse Mahalanobis distance: Xing, et al 2002) SAP 60 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Focused Clustering and Outlier Detection [Perozzi et al., 2014] LOCAL TASK: Extract Clusters on Focused Graph clusters attribute weights β -> Edge Weight 1. Find Starting Set of Candidates 1.a Drop low-weight edges 1.b Extract Strongly Connected Component C 1, C 2, … 2. Grow Clusters around Candidates Seed 2.a Compute conductance of C: φ (w) (C, G) 2.b Select node to add to C’ : best improvement to ∆φ (w) (C,C’) (greedy) 2.c Prune Underperforming nodes 3. Detect Outliers: High unweighted conductance 61 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] ✓ 62 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

iQBEES: Entity Search by Example [Metzger et al., 2013, Sobczak et al., 2015] Entity 1: ? Model: Knowledge Graph Entity 2: Query: A set of Entities Q Similarity: shared semantic properties ? Output: A Set of Similar Entities ranked ? Case: Products → Find Similar Products Case: Social Media → User recommendation 63 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Maximal Aspects [Metzger et al., 2013, Sobczak et al., 2015] Adding any aspect ?x type BodyBuilder → E(A)={Arnold} ?x type AmericanActor Include ?x type AmericanActor Typical Types ?x type GovernorCalifornia Prune generic ?x hasHeight 1.88m use most aspects ?x type Entity specific type ?x type AmericanActor Rank REPEATABLE Set of ?x actedIn TheExpendables Update Q aspects ?x type ActionActor 64 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] ✓ ✓ 65 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

Learning Path Queries on Graphs [Bonifati et al., 2015] Model: Edge Labeled Graph Tram + ✓ X Query: 2 sets of Entities Q + , Q - Tram Bus Positive, Negative + Similarity: common path query (RegExp) ✓ - (bus|tram)*Cinema Cinema Output: A Set of Nodes Satisfying some paths(Q + ) but NOT paths(Q - ) S 1 X C 1 Case: Proteins → Similar interactions/co-expression MONADIC: only starting nodes extensible to Case: Tasks Initiator → Similar Processes/Behaviours BINARY/ N-ARY : path from X to Y 66 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Learnability of Path Queries [Bonifati et al., 2015] Query: Q + & Q - (Positive & Negative examples ) Consistency Check: PSPACE-complete Consistecy: 8 v 2 Q + . paths G ( v ) 6✓ paths G ( Q − ) Enumerate Paths 1. Selecting the Smallest Consistent Paths Up Up to Fixed dist stanc nce Infinite Paths? Fix maximal length K but… When to use Kleene star * ? For paths of Length N C | ( A ﹒ B ﹒ C ) → ( A ﹒ B )* ﹒ C K = 2 ⅹ N K N +1 2. Generalize SCP a. Construct Prefix-Tree Acceptor b. Generalize into DFA with Merge PTA DFA 67 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Reverse engineering SPARQL queries [Arenas et al., 2016] Model: Knowledge Graph Spanish Mexico Query: Set of ANSWERS * Haiti Similarity: common AND/OPT/FILTER query Jamaica English Output: A SPARQL QUERY/RESULT Case: Open Data → Query Unknown Schema ?e1 ?e2 M1 Mexico Spanish Case: Novice User → Avoid SPARQL M2 Haiti M3 Jamaica English 68 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Reverse engineering SPARQL queries [Arenas et al., 2016] Enumerate all possible Query: Set of Variable Mappings SPARQL queries satisfied ?X ?Y ?Z by the mappings John M1 INTRACTABLE Mary mary@email.eu M2 M3 Lucy Roses Street Build tree-shaped SPARQL queries IMPLIED by the mappings 69 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Reverse engineering SPARQL queries [Arenas et al., 2016] Query: Set of Variable Mappings Ω {M1,M2,M3,M4} M1 M2 {M3,M4} {M2,M4} M3 M4 {M4} M1 M2 Greedy: keep just M3 enough to cover all M4 variables 70 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

SIMILARITY Nodes Structures (Edge-)Labels Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] ✓ ✓ ✓ 71 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

Exemplar Queries [Mottin et al., 2014] Query: Model: Knowledge Graph Input: Example Structure Similarity: Isomorphism/Simulation Output: A set of Graphs Knowledge Graph A2 A1 S D. Mottin, M. Lissandrini, T. Palpanas, Y. 72 VLDB 2017 tutorial Velegrakis

NP-complete Computing exemplar queries [Mottin et al., 2014] (subgraph isomorphism) 𝑷 𝑾 𝟓 (simulation) Pruning technique: • Compute the neighbor labels of each node A A A v 𝑋 ?,E,[ = 𝑜 / 𝑚 𝑜 / , 𝑜 + = 𝑏 ∨∈ 𝑂 [-/ 𝑜 X B B • Prune nodes not matching query B A nodes neighborhood labels u Q Sample A1 B • Apply iteratively on the query nodes A2 v neighborhood = {(B,1)} Labels at distance 1 ⊈ No Match u neighborhood = {(A,1)} 73 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

NP-complete Computing exemplar queries [Mottin et al., 2014] (subgraph isomorphism) 𝑷 𝑾 𝟓 (simulation) Approximation: • Nodes closed to the sample are more important v • Use Personalized PageRank with a weighted matrix • Weight edges: frequency of the edge-label Sample A1 A2 74 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Ranking results [Mottin et al., 2014] User Query | | ⇢ ( n s , n ) = � S ( n s , n ) + (1 − � ) v [ n ] CBS Google Yahoo! ⇣ P ⌘ P A2 A1 S Combination of two factors 1. Structural: similarity of two nodes in terms of neighbor relationships 2. Distance-based: the PageRank already computed 75 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Graph query by example (GQBE) [Jayaram et al., 2015] In GQBE Input is a set of (disconnected) entity mention tuples Model: Knowledge Graph Q = (Google, S. Mateo) Input: Entity Tuples Results = Similarity: Isomorphism (Yahoo, S. Clara) (CBS, New York) Output: A set of Tuples 76 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

GQBE: Maximum Query Graph [Jayaram et al., 2015] 1. Find the maximum query graph Q = (v 1 ,v 2 ) • Graph with M edges having the 0.1 0.1 maximum weight 0.2 u 2 v 1 0.4 2. Answers subgraph-isomorphic to 0.7 0.8 0.1 the query graph NP-hard z 0.5 3. Return top-k 0.3 Answer score: v 2 u 1 0.5 • Sum of query graph weights • Similarity match between edges in the answer Maximum Answer and the query (shared nodes take extra credit) Query Graph graph 77 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis

Multiple query tuples [Jayaram et al., 2015] Subgraphs of v 1 v 2 Maximum Maximum Query Graph Query graph is Very Large v 1 v 2 v 1 v 2 v 1 v 2 Preserve the query connectivity v 1 v 2 Find answers using a lattice obtained removing edges from the union graph GQBE finds answers for multiple query tuples 1. Compute a re-weighted union graph of the individual query graphs 78 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

SIMILARITY Nodes Structures Structures Connectivity Properties Queries Entity Search Path Queries Mediator Nodes Entity Tuples [Metzger’13, [Bonifati’15] [Ruchansky’15] [Jayaram’15] Sobczak’15] Clusters SPARQL Graph Structures [Perozzi’14] [Arenas’16] [Mottin’14] Do not Include User Feedback 79 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

Online exploration of datasets Main idea: Learn the items to show online as more points are acquired Two ways of learning: passive and active items Learn v Is t or ? items v t Learn Passive Active 81 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

MindReader [Ishikawa et al., 1999] Main idea: learn an implicit query from user examples and optional scores Weight Searching “mildly overweighted” patients • The doctor selects examples by q browsing patient database • The examples have “oblique” : good correlation : very good • We can “guess” the implied query Height 82 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

� Learning an ellipsoid distance [Ishikawa et al., 1999] Weighted distance matrix Euclidean 𝐸 𝑦, 𝑟 = 𝑦 − 𝑟 œ 𝑁(𝑦 − 𝑟) q Implicit query ? ? weighted 𝐸 𝑦, 𝑟 = Y Y 𝑛 x` (𝑦 x − 𝑟 x )(𝑦 ` − 𝑟 ` ) Euclidean x ` q Learn the query minimizing the penalty = weighted sum of distances between query point and sample vectors generalized ellipsoid distance 𝑛𝑗𝑜𝑗𝑛𝑗𝑨𝑓 Y 𝑦 [ − 𝑟 œ 𝑁(𝑦 [ − 𝑟) q [ 𝑡𝑣𝑐𝑘𝑓𝑑𝑢 𝑢𝑝 det 𝑁 = 1 83 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Learning the distance [Ishikawa et al., 1999] ❚ Query point is moved towards “good” examples — Rocchio formula in IR Q 0 : query point : retrieved data Q 1 : relevance judgments Q 1 : new query point Q 0 Learning can be done online!!! 84 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Active learning for online query systems [Vanchinathan et al., 2015] Main idea: the system “query” the user to understand her preferences Ask user Get item System preference Learn unknown preferences and minimize the number of questions to the user 85 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

� Learning unknown preferences [Vanchinathan et al., 2015] Problem : Find a set S that maximize the user preference within a budget (e.g., number of interactions) User preferences S (intended user set) arg max Y 𝑞𝑠𝑓𝑔(𝑤) Ž∈ª subject to 𝐷𝑝𝑡𝑢 𝑇 ≤ 𝑐𝑣𝑒𝑕𝑓𝑢 Cost for the set S 86 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

� � Background: Gaussian processes [Bishop et al., 2006] Idea : Model the user preferences as a Gaussian Process A Gaussian Process (GP) is an infinite set of variables, any subset of this is Gaussian + exp(− 1 / 2 𝐠 − 𝜈 œ Σ -/ (𝐠 − 𝜈)) Gaussian prior 𝑄 𝐠 Σ, 𝜈 = 2𝜌Σ Specified only by mean and covariance ? Given observations 𝑦, 𝑧 [B/ over an unknown function f drawn from a Gaussian prior, the posterior is Gaussian 𝑄 𝐠 𝐳 ∝ ¹ 𝑒x 𝑄(𝐠, 𝐲, 𝐳) 87 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

GP-Select [Vanchinathan et al., 2015] Learn posterior Trades off exploration exploitation Ask user feedback Exploration: select items with high-variance • • Exploitation: select items with high-value 88 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Active learning on graphs – which prior? [Ma et al., 2015] Idea: Use the graph structure to infer the node classes Use graph Laplacian as prior 𝑀 = 𝐸– 𝐵 , A is the adjacency matrix Laplacian: higher probability of having the same class if two nodes are connected 89 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Explore-by-Example: AIDE [Dimitriadou et al., 2015] Relevance Feedback Relevant Samples Data Classification User Irrelevant Samples Model User Samples Model Query Formulation Space Exploration Sampling queries Data Extraction Query 90 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

The AIDE algorithm [Dimitriadou et al., 2015] 1. Divide the space into d-dimensional cubes 2. Find the sample points in the cubes (medoids) 3. Train the classifier 4. Refine the training sampling from neighbors of misclassified points 5. Boundary refinement 91 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Classification & Query Formulation [Dimitriadou et al., 2015] Sample Red Green Relevant red Object A 13.67 12.34 Yes red>14.82 red<=14.82 Object B 15.32 14.50 No red .. .. .. ... Irrelevant Object X 14.21 13.57 Yes red<13.55 red>=13.55 green Irrelevant green>13.74 green<=13.74 Irrelevant Relevant Decision Tree Classifier SELECT * FROM galaxy WHERE red<= 14.82 AND red>= 13.5 AND green<=13.74 92 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis

Misclassified Sample Exploitation [Dimitriadou et al., 2015] Sampling x Areas x x √ √ x √ √ √ √ √ √ √ x Red wavelength x x x x x x x √ x √ x x x x Green Wavelength 93 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis

Clustering-based Sampling [Dimitriadou et al., 2015] √ x √ x √ x √ √ √ √ √ Idea : Use a k-medoid x √ √ √ √ x x √ √ √ x approach to find sampling areas Red wavelength Clusters- Sampling Areas x √ √ Green Wavelength 94 VLDB 2017 tutorial D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis 9 9 4 4

Example-based methods Query suggestion Entity extraction Community- • • • using examples by example text based Node- Reverse Web table retrieval • • engineering completion using Entity Search • queries examples Path and SPARQL • Search by queries • example Graph structures • as Examples 96 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

Example-based methods: takeaways Graph Textual Relational Complex search Exploit locality • • space • Allows serendipitous Entity attributes are • search • Exact and expressive approximate Easier document • • Reverse finding Interactivity can engineering: good • improve the quality approximations Speed up entity • matching Limited to query Large result-sets • • inference require ranking 97 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 Tutorial

The use of examples Examples can ease data exploration • … reduce need for complex queries / simplify user input • … require no schema knowledge • … allow uncertainity in search conditions • … require little data analytics expertise 98 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

Where should we invest time Approximate Machine Methods learning User models Scalability 99 D. Mottin, M. Lissandrini, T. Palpanas, Y. Velegrakis VLDB 2017 tutorial

ADOPT HETEROGENEITY Need for solutions that operate across different models operate on heterogeneous datastores 100 D. Mottin, M. Lissandrini VLDB 2017 tutorial

New Trends on Exploratory Methods for Data Analytics Davide Mottin, - PowerPoint PPT Presentation

VLDB 2017 tutorial New Trends on Exploratory Methods for Data Analytics Davide Mottin, Matteo Lissandrini , Yannis Velegrakis, Themis Palpanas Who we are Davide Mottin Matteo Lissandrini Graph Mining, Novel Query Knowledge Graphs , Novel

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data

Exploratory Monitoring at Bing AUTOMATED SYNTHETIC EXPLORATORY MONITORING OF DYNAMIC WEB SITES

Exploratory Data Analysis Exploratory Data Analysis for Ecological Modelling and for Ecological

Exploratory Data Analysis Maneesh Agrawala CS 448B: Visualization Fall 2018 1 A2: Exploratory

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

VISUALIZATION Jeff Goldsmith, PhD Department of Biostatistics 1 Exploratory data analysis

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Middle Level Exploratory Classes Standards Based Grading McLean County Unit 5 Exploratory

Agenda Agenda 1. ProjectOverview 1 Project Overview 2. DrillingProgram 3 3.

EXPLORATORY PRACTICE Ins K. de Miller (PUC-Rio, Brasil) Exploratory Practice: work for

An Exploratory Study of How Developers Exploratory Study Seek, Relate, and Collect Relevant

Session-Based Exploratory Session-Based Exploratory TestingWith a Twist TestingWith a

Towards a Coalgebraic Chomsky Hierarchy Sergey Goncharov , Stefan Milius, Alexandra Silva CMCS

Graphical Linear Algebra PhD Open, University of Warsaw Pawel Sobocinski University of

Landmark indexing for evaluation of label-constrained reachability queries Lucien Valstar ,

Lecture 4 Signal Flow Graphs and recurrence relations Plan Fibonaccis rabbits and

Diagrammatic Quantum Reasoning: Completeness and Incompleteness Simon Perdrix CNRS, Loria,

Optimal Learning of Joint Alignments with a Faulty Oracle Charalampos E. Tsourakakis

Hypergraph categories as cospan algebras Brendan Fong, with David Spivak Category Theory 2018

Brzozowskis algorithm (co)algebraically Jan Rutten CWI & Radboud University 1. Example

New Trends on Exploratory Methods for Data Analytics Davide Mottin, - PowerPoint PPT Presentation

VLDB 2017 tutorial New Trends on Exploratory Methods for Data Analytics Davide Mottin, Matteo Lissandrini , Yannis Velegrakis, Themis Palpanas Who we are Davide Mottin Matteo Lissandrini Graph Mining, Novel Query Knowledge Graphs , Novel

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Introduction to Data Science: x (1) x 1 x 2 x ( n ) x i n 1 1 Size: size

Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory

CME/STATS 195 CME/STATS 195 Lecture 5: Exploratory Data Analysis Lecture 5: Exploratory Data

Exploratory Monitoring at Bing AUTOMATED SYNTHETIC EXPLORATORY MONITORING OF DYNAMIC WEB SITES

Exploratory Data Analysis Exploratory Data Analysis for Ecological Modelling and for Ecological

Exploratory Data Analysis Maneesh Agrawala CS 448B: Visualization Fall 2018 1 A2: Exploratory

Subgroup Discovery Exploratory Data Analysis Exploratory Data Analysis Classification:

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

VISUALIZATION Jeff Goldsmith, PhD Department of Biostatistics 1 Exploratory data analysis

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Middle Level Exploratory Classes Standards Based Grading McLean County Unit 5 Exploratory

Agenda Agenda 1. ProjectOverview 1 Project Overview 2. DrillingProgram 3 3.

EXPLORATORY PRACTICE Ins K. de Miller (PUC-Rio, Brasil) Exploratory Practice: work for

An Exploratory Study of How Developers Exploratory Study Seek, Relate, and Collect Relevant

Session-Based Exploratory Session-Based Exploratory TestingWith a Twist TestingWith a

Towards a Coalgebraic Chomsky Hierarchy Sergey Goncharov , Stefan Milius, Alexandra Silva CMCS

Graphical Linear Algebra PhD Open, University of Warsaw Pawel Sobocinski University of

Landmark indexing for evaluation of label-constrained reachability queries Lucien Valstar ,

Lecture 4 Signal Flow Graphs and recurrence relations Plan Fibonaccis rabbits and

Diagrammatic Quantum Reasoning: Completeness and Incompleteness Simon Perdrix CNRS, Loria,

Optimal Learning of Joint Alignments with a Faulty Oracle Charalampos E. Tsourakakis

Hypergraph categories as cospan algebras Brendan Fong, with David Spivak Category Theory 2018

Brzozowskis algorithm (co)algebraically Jan Rutten CWI &amp; Radboud University 1. Example

Brzozowskis algorithm (co)algebraically Jan Rutten CWI & Radboud University 1. Example