Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge - PowerPoint PPT Presentation

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge Castro, Ricard Gavald` a International Colloquium on Grammatical Inference University of Maryland, September 2012 This work is partially supported by the PASCAL2 Network

Example Application: Web User Modeling Customers “Wish List” Online Store ➓ Process examples as fast as they arrive (10 5 per sec. or more) Stream Mining ➓ Use small amount of memory (must fit into machine’s main memory) Customer Log ➓ Detect changes in customer behavior Model and adapt the model accordingly Other Applications: Process Mining, Biological Models (DNA and aminoacid sequences)

Outline Learning PDFA from Data Streams Testing Similarity in Data Streams with the Bootstrap Adapting to Changes in the Target Conclusion

The Data Streams Algorithmic Model An algorithm receives an infinite stream x 1 , x 2 , . . . , x t , . . . from some domain X and must: ➓ Make only one pass over the data and process each item in time O ♣ 1 q ➓ At every time t use sublinear memory (e.g. O ♣ log t q , O ♣❄ t q ) ➓ Adapt to possible “changes” in the data It is a theoretically challenging model useful for applications: ➓ Originated in the algorithmics community ➓ Realistic for Data Mining and Machine Learning tasks in real-time ➓ Feasible way to deal with Big Data problems When studying learning problems with streaming data: ➓ In the worst case setting it resembles Gold’s model (with algorithmic constraints) ➓ But we consider a PAC-style scenario where: ➓ x t are all independent and generated from a distribution D t ➓ the sequence of distributions D 1 , D 2 , . . . , D t , . . . either changes very slowly or presents only abrupt changes but very rarely

Hypothesis Class: PDFA Probabilistic Deterministic Finite Automata = DFA + Probabilities Transition/Stop probabilities a p q ♣ a q p q ♣ b q p q ♣ ξ q q 1 0.3 0.7 0.0 a 1 2 0.5 0.5 0.0 3 0.8 0.0 0.2 3 b Parameters ➓ n (states) b ➓ ⑤ Σ ⑤ (alphabet) 2 ➓ L (expected length) a ➓ µ (distinguishability, L ✽ ) µ ✏ min q ✘ q ✶ max x P Σ ✍ ⑤ D q ♣ x q ✁ D q ✶ ♣ x q⑤

State Merge/Split Algorithm Usual approach to PDFA learning [Carrasco–Oncina ’94, Ron et al. ’98, Clark–Thollard ’04, Palmer–Goldberg ’05, Castro–Gavald` a ’08, etc.] Statistical tests b − 1 a − 1 S b a S a − 1 S b a S ✛ a ✁ 1 S S a − 1 S a a S ✓ b ✁ 1 a ✁ 1 S S a − 1 S a − 1 a − 1 S a b a − 1 a − 1 S S ✛ b ✁ 1 S b b − 1 S b b − 1 S a ✁ 1 S ✛ b ✁ 1 S b − 1 S b ✁ 1 S ✓ a ✁ 1 a ✁ 1 S b b ✁ 1 S ✓ b ✁ 1 b ✁ 1 S b a b S a − 1 S a a a − 1 a − 1 S a S a − 1 S a a − 1 S S a − 1 a − 1 S b b b − 1 S b b b a b − 1 b − 1 S b − 1 S b b − 1 S a b − 1 b − 1 S a a a − 1 b − 1 S

Description of the Algorithm Learner Module System Architecture initialize H with safe q λ ; foreach σ P Σ do add a candidate q σ to H ; schedule insignificance and similiarity tests for q σ ; Learner foreach string x t in the stream do foreach decomposition x t ✏ wz, with w , z P Σ ✍ do Change! if q w is defined then Change Det. Hypothesis add z to ˆ S w ; Stream if q w is a candidate and ⑤ ˆ S w ⑤ is large enough then call SimilarityTest ♣ q w , δ q ; Adapter foreach candidate q w do if it is time to test insignificance of q w then if ⑤ ˆ S w ⑤ is too small then declare q w insignificant; Predictor else schedule another insignificance test for q w ; Predictions if H has more than n safes or there are no candidates left then return H ;

Sample Sketches for Similarity Testing Note: Instead of keeping a sample S w for each state q w , the algorithm keeps a sketch ˆ S w of each sample A sketch using memory O ♣ 1 ④ µ q should be enough: ➓ Given samples S , S ✶ from distributions D , D ✶ ➓ Algorithm wants to test L ✽ ♣ D , D ✶ q ✏ 0 or L ✽ ♣ D , D ✶ q ➙ µ ➓ In the second case, if ⑤ D ♣ x q ✁ D ✶ ♣ x q⑤ ➙ µ then either D ♣ x q ➙ µ or D ✶ ♣ x q ➙ µ ➓ It is enough to find all strings with D ♣ x q , D ✶ ♣ x q ✏ Ω ♣ µ q , of which there are O ♣ 1 ④ µ q In our algorithm, each sketch uses a SpaceSaving data structure [Metwally et al. ’05] : ➓ Uses memory O ♣ 1 ④ µ q ➓ Finds every string whose probability is Ω ♣ µ q (frequent strings) ➓ And approximates their probability with enough accuracy ➓ Easier to implement than sketches based on hash functions

Properties of the Algorithm Streaming-specific features ➓ Adaptive test scheduling (decide as soon as possible) ➓ Similarity test based on Vapnik–Chervonenkis bound (slow similarity detection) ➓ Use bootstrapped confidence intervals in tests (faster convergence) Complexity Bounds (with any reasonable test) ➓ Time per example O ♣ L q (expected, amortized) ➓ The learner reads O ♣ n 2 ⑤ Σ ⑤ 2 ④ ǫµ 2 q examples (in expectation) ➓ Memory usage is O ♣ n ⑤ Σ ⑤ L ④ µ q (roughly O ♣❄ t q )

Testing Similarity between Probability Distributions Goal: decide if L ✽ ♣ D , D ✶ q ✏ 0 or L ✽ ♣ D , D ✶ q ➙ µ from samples S , S ✶ Statistical Test Based on Empirical L ✽ (the “default”) ➓ Let µ ✍ ✏ L ✽ ♣ D , D ✶ q and compute ˆ µ ✏ L ✽ ♣ S , S ✶ q ➓ Compute ∆ l , ∆ u such that ˆ µ ✁ ∆ l ↕ µ ✍ ↕ ˆ µ � ∆ u holds w.h.p. ➓ If ˆ µ ✁ ∆ l → 0 decide D ✘ D ✶ µ � ∆ u ➔ µ decide D ✏ D ✶ ➓ If ˆ ➓ Else, wait for more examples Problem: asymmetry — deciding dissimilarity is easier that deciding similarity ➓ When D ✘ D ✶ will decide correctly w.h.p. when ⑤ S ⑤ , ⑤ S ✶ ⑤ ✓ 1 ④ µ 2 ✍ ➓ When D ✏ D ✶ will decide correctly w.h.p. when ⑤ S ⑤ , ⑤ S ✶ ⑤ ✓ 1 ④ µ 2 In the later we are always competing against the worst case L ✽ ♣ D , D ✶ q ✏ µ

Enter the Bootstrap ➓ In the test I just described there is another worst case assumption — the confidence µ � ∆ u must hold for any D and D ✶ interval µ ✍ ↕ ˆ ➓ But it may be the case that for some D , certifying that S , S ✶ ✒ D come from the same distribution is easier ➓ The bootstrap is widely used in statistics for computing distribution dependent confidence intervals (among many other things) Basic Idea ( 1 − δ ) % ➓ Suppose we have r different samples S ♣ 1 q , . . . , S ♣ r q ✒ D ➓ Compute distances ˆ µ i ✏ L ✽ ♣ S ♣ i q , S ✶ ♣ i q q ➓ Use them to compute a histogram of the distribution of ˆ µ ˆ ˆ ˆ µ − ∆ l µ µ + ∆ u

Enter the Bootstrap ➓ In the test I just described there is another worst case assumption — the confidence µ � ∆ u must hold for any D and D ✶ interval µ ✍ ↕ ˆ ➓ But it may be the case that for some D , certifying that S , S ✶ ✒ D come from the same distribution is easier ➓ The bootstrap is widely used in statistics for computing distribution dependent confidence intervals (among many other things) Basic Idea Bootstrapped Confidence Intervals ➓ Suppose we have r different samples ➓ Given a sample S , obtain other samples ˜ S ♣ 1 q , . . . , S ♣ r q ✒ D S ♣ i q by sampling from S uniformly with replacement µ i ✏ L ✽ ♣ S ♣ i q , S ✶ ➓ Compute distances ˆ ♣ i q q ➓ Sort estimates increasingly ˜ µ 1 ↕ . . . ↕ ˜ µ r ➓ Use them to compute a histogram of the ➓ Say that µ ✍ ↕ ˜ distribution of ˆ µ r ♣ 1 ✁ δ q r s with prob. ➙ 1 ✁ δ µ

Bootstrapped Confidence Intervals in Data Streams Question: Do you need to store the full sample to do bootstrap resampling? Answer: No, if you can test from sketched data The Bootstrap Sketch sketches x ➓ Keep r copies of the sketch you use for copy testing (e.g. SpaceSaving ) x ➓ For each item x t in the stream, randomly x r insert r copies of x t into the r sketches x r ➓ Comparing each pair ˜ S ♣ i q , ˜ S ✶ ♣ j q can obtain r 2 approximations ˜ µ i , j random x assingments ➓ Choosing r involves a trade-off between accuracy and memory In theory can prove bound (asymptotically) comparable to Vapnik–Chervonenkis In practice assuming µ ✍ ↕ ˜ µ r ♣ 1 ✁ δ q r 2 s gives accurate and statisically efficient similarity test

Experimental Results for Learner ➓ Prototype written in C++ and Boost, run in this laptop ➓ Evaluated with Reber Grammar (typical Grammatical Inference benchmark) ➓ ⑤ Σ ⑤ ✏ 5, n ✏ 6, µ ✏ 0.2, L ✓ 8 ➓ Compared VC and Bootstrap ( r ✏ 10) based tests Examples Memory (MiB) Time/item (ms) Hoeffding 57617 6.1 0.05 Bootstrap 23844 53.7 1.2

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge - PowerPoint PPT Presentation

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge Castro, Ricard Gavald` a International Colloquium on Grammatical Inference University of Maryland, September 2012 This work is partially supported by the PASCAL2 Network

Bootstrapping without the Boot We like minimally supervised learning (bootstrapping).

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Parametric Bootstrapping 18.05 Spring 2017 Parametric bootstrapping Use the estimated parameter

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Explorations in Bootstrapping Guided Search 8th Language and Computation Day Deirdre Lungley

Ring Switching and Bootstrapping FHE Chris Peikert School of Computer Science Georgia Tech

Illustrating Agnostic Learning We want a classifier to distinguish between cats and dogs Image 1

Flavor Physics: Past, Present, Future Indirect Searches for NP at the Time of LHC GGI, Florence,

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null

A Bayesian Method for Partially Paired High Dimensional Data Fei Liu, Feng Liang, Woncheol Jang

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Foundation of Cryptography (0368-4162-01), Lecture 3 Hardcore Predicates for Any One-way Function

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge - PowerPoint PPT Presentation

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge Castro, Ricard Gavald` a International Colloquium on Grammatical Inference University of Maryland, September 2012 This work is partially supported by the PASCAL2 Network

Bootstrapping without the Boot We like minimally supervised learning (bootstrapping).

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Parametric Bootstrapping 18.05 Spring 2017 Parametric bootstrapping Use the estimated parameter

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Statistical analysis and bootstrapping Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Explorations in Bootstrapping Guided Search 8th Language and Computation Day Deirdre Lungley

Ring Switching and Bootstrapping FHE Chris Peikert School of Computer Science Georgia Tech

Illustrating Agnostic Learning We want a classifier to distinguish between cats and dogs Image 1

Flavor Physics: Past, Present, Future Indirect Searches for NP at the Time of LHC GGI, Florence,

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null

A Bayesian Method for Partially Paired High Dimensional Data Fei Liu, Feng Liang, Woncheol Jang

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Foundation of Cryptography (0368-4162-01), Lecture 3 Hardcore Predicates for Any One-way Function

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams