bootstrapping and learning pdfa in data streams
play

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge - PowerPoint PPT Presentation

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge Castro, Ricard Gavald` a International Colloquium on Grammatical Inference University of Maryland, September 2012 This work is partially supported by the PASCAL2 Network


  1. Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge Castro, Ricard Gavald` a International Colloquium on Grammatical Inference University of Maryland, September 2012 This work is partially supported by the PASCAL2 Network

  2. Example Application: Web User Modeling Customers “Wish List” Online Store ➓ Process examples as fast as they arrive (10 5 per sec. or more) Stream Mining ➓ Use small amount of memory (must fit into machine’s main memory) Customer Log ➓ Detect changes in customer behavior Model and adapt the model accordingly Other Applications: Process Mining, Biological Models (DNA and aminoacid sequences)

  3. Outline Learning PDFA from Data Streams Testing Similarity in Data Streams with the Bootstrap Adapting to Changes in the Target Conclusion

  4. Outline Learning PDFA from Data Streams Testing Similarity in Data Streams with the Bootstrap Adapting to Changes in the Target Conclusion

  5. The Data Streams Algorithmic Model An algorithm receives an infinite stream x 1 , x 2 , . . . , x t , . . . from some domain X and must: ➓ Make only one pass over the data and process each item in time O ♣ 1 q ➓ At every time t use sublinear memory (e.g. O ♣ log t q , O ♣❄ t q ) ➓ Adapt to possible “changes” in the data It is a theoretically challenging model useful for applications: ➓ Originated in the algorithmics community ➓ Realistic for Data Mining and Machine Learning tasks in real-time ➓ Feasible way to deal with Big Data problems When studying learning problems with streaming data: ➓ In the worst case setting it resembles Gold’s model (with algorithmic constraints) ➓ But we consider a PAC-style scenario where: ➓ x t are all independent and generated from a distribution D t ➓ the sequence of distributions D 1 , D 2 , . . . , D t , . . . either changes very slowly or presents only abrupt changes but very rarely

  6. Hypothesis Class: PDFA Probabilistic Deterministic Finite Automata = DFA + Probabilities Transition/Stop probabilities a p q ♣ a q p q ♣ b q p q ♣ ξ q q 1 0.3 0.7 0.0 a 1 2 0.5 0.5 0.0 3 0.8 0.0 0.2 3 b Parameters ➓ n (states) b ➓ ⑤ Σ ⑤ (alphabet) 2 ➓ L (expected length) a ➓ µ (distinguishability, L ✽ ) µ ✏ min q ✘ q ✶ max x P Σ ✍ ⑤ D q ♣ x q ✁ D q ✶ ♣ x q⑤

  7. State Merge/Split Algorithm Usual approach to PDFA learning [Carrasco–Oncina ’94, Ron et al. ’98, Clark–Thollard ’04, Palmer–Goldberg ’05, Castro–Gavald` a ’08, etc.] Statistical tests b − 1 a − 1 S b a S a − 1 S b a S ✛ a ✁ 1 S S a − 1 S a a S ✓ b ✁ 1 a ✁ 1 S S a − 1 S a − 1 a − 1 S a b a − 1 a − 1 S S ✛ b ✁ 1 S b b − 1 S b b − 1 S a ✁ 1 S ✛ b ✁ 1 S b − 1 S b ✁ 1 S ✓ a ✁ 1 a ✁ 1 S b b ✁ 1 S ✓ b ✁ 1 b ✁ 1 S b a b S a − 1 S a a a − 1 a − 1 S a S a − 1 S a a − 1 S S a − 1 a − 1 S b b b − 1 S b b b a b − 1 b − 1 S b − 1 S b b − 1 S a b − 1 b − 1 S a a a − 1 b − 1 S

  8. Description of the Algorithm Learner Module System Architecture initialize H with safe q λ ; foreach σ P Σ do add a candidate q σ to H ; schedule insignificance and similiarity tests for q σ ; Learner foreach string x t in the stream do foreach decomposition x t ✏ wz, with w , z P Σ ✍ do Change! if q w is defined then Change Det. Hypothesis add z to ˆ S w ; Stream if q w is a candidate and ⑤ ˆ S w ⑤ is large enough then call SimilarityTest ♣ q w , δ q ; Adapter foreach candidate q w do if it is time to test insignificance of q w then if ⑤ ˆ S w ⑤ is too small then declare q w insignificant; Predictor else schedule another insignificance test for q w ; Predictions if H has more than n safes or there are no candidates left then return H ;

  9. Sample Sketches for Similarity Testing Note: Instead of keeping a sample S w for each state q w , the algorithm keeps a sketch ˆ S w of each sample A sketch using memory O ♣ 1 ④ µ q should be enough: ➓ Given samples S , S ✶ from distributions D , D ✶ ➓ Algorithm wants to test L ✽ ♣ D , D ✶ q ✏ 0 or L ✽ ♣ D , D ✶ q ➙ µ ➓ In the second case, if ⑤ D ♣ x q ✁ D ✶ ♣ x q⑤ ➙ µ then either D ♣ x q ➙ µ or D ✶ ♣ x q ➙ µ ➓ It is enough to find all strings with D ♣ x q , D ✶ ♣ x q ✏ Ω ♣ µ q , of which there are O ♣ 1 ④ µ q In our algorithm, each sketch uses a SpaceSaving data structure [Metwally et al. ’05] : ➓ Uses memory O ♣ 1 ④ µ q ➓ Finds every string whose probability is Ω ♣ µ q (frequent strings) ➓ And approximates their probability with enough accuracy ➓ Easier to implement than sketches based on hash functions

  10. Properties of the Algorithm Streaming-specific features ➓ Adaptive test scheduling (decide as soon as possible) ➓ Similarity test based on Vapnik–Chervonenkis bound (slow similarity detection) ➓ Use bootstrapped confidence intervals in tests (faster convergence) Complexity Bounds (with any reasonable test) ➓ Time per example O ♣ L q (expected, amortized) ➓ The learner reads O ♣ n 2 ⑤ Σ ⑤ 2 ④ ǫµ 2 q examples (in expectation) ➓ Memory usage is O ♣ n ⑤ Σ ⑤ L ④ µ q (roughly O ♣❄ t q )

  11. Outline Learning PDFA from Data Streams Testing Similarity in Data Streams with the Bootstrap Adapting to Changes in the Target Conclusion

  12. Testing Similarity between Probability Distributions Goal: decide if L ✽ ♣ D , D ✶ q ✏ 0 or L ✽ ♣ D , D ✶ q ➙ µ from samples S , S ✶ Statistical Test Based on Empirical L ✽ (the “default”) ➓ Let µ ✍ ✏ L ✽ ♣ D , D ✶ q and compute ˆ µ ✏ L ✽ ♣ S , S ✶ q ➓ Compute ∆ l , ∆ u such that ˆ µ ✁ ∆ l ↕ µ ✍ ↕ ˆ µ � ∆ u holds w.h.p. ➓ If ˆ µ ✁ ∆ l → 0 decide D ✘ D ✶ µ � ∆ u ➔ µ decide D ✏ D ✶ ➓ If ˆ ➓ Else, wait for more examples Problem: asymmetry — deciding dissimilarity is easier that deciding similarity ➓ When D ✘ D ✶ will decide correctly w.h.p. when ⑤ S ⑤ , ⑤ S ✶ ⑤ ✓ 1 ④ µ 2 ✍ ➓ When D ✏ D ✶ will decide correctly w.h.p. when ⑤ S ⑤ , ⑤ S ✶ ⑤ ✓ 1 ④ µ 2 In the later we are always competing against the worst case L ✽ ♣ D , D ✶ q ✏ µ

  13. Enter the Bootstrap ➓ In the test I just described there is another worst case assumption — the confidence µ � ∆ u must hold for any D and D ✶ interval µ ✍ ↕ ˆ ➓ But it may be the case that for some D , certifying that S , S ✶ ✒ D come from the same distribution is easier ➓ The bootstrap is widely used in statistics for computing distribution dependent confidence intervals (among many other things) Basic Idea ( 1 − δ ) % ➓ Suppose we have r different samples S ♣ 1 q , . . . , S ♣ r q ✒ D ➓ Compute distances ˆ µ i ✏ L ✽ ♣ S ♣ i q , S ✶ ♣ i q q ➓ Use them to compute a histogram of the distribution of ˆ µ ˆ ˆ ˆ µ − ∆ l µ µ + ∆ u

  14. Enter the Bootstrap ➓ In the test I just described there is another worst case assumption — the confidence µ � ∆ u must hold for any D and D ✶ interval µ ✍ ↕ ˆ ➓ But it may be the case that for some D , certifying that S , S ✶ ✒ D come from the same distribution is easier ➓ The bootstrap is widely used in statistics for computing distribution dependent confidence intervals (among many other things) Basic Idea Bootstrapped Confidence Intervals ➓ Suppose we have r different samples ➓ Given a sample S , obtain other samples ˜ S ♣ 1 q , . . . , S ♣ r q ✒ D S ♣ i q by sampling from S uniformly with replacement µ i ✏ L ✽ ♣ S ♣ i q , S ✶ ➓ Compute distances ˆ ♣ i q q ➓ Sort estimates increasingly ˜ µ 1 ↕ . . . ↕ ˜ µ r ➓ Use them to compute a histogram of the ➓ Say that µ ✍ ↕ ˜ distribution of ˆ µ r ♣ 1 ✁ δ q r s with prob. ➙ 1 ✁ δ µ

  15. Bootstrapped Confidence Intervals in Data Streams Question: Do you need to store the full sample to do bootstrap resampling? Answer: No, if you can test from sketched data The Bootstrap Sketch sketches x ➓ Keep r copies of the sketch you use for copy testing (e.g. SpaceSaving ) x ➓ For each item x t in the stream, randomly x r insert r copies of x t into the r sketches x r ➓ Comparing each pair ˜ S ♣ i q , ˜ S ✶ ♣ j q can obtain r 2 approximations ˜ µ i , j random x assingments ➓ Choosing r involves a trade-off between accuracy and memory In theory can prove bound (asymptotically) comparable to Vapnik–Chervonenkis In practice assuming µ ✍ ↕ ˜ µ r ♣ 1 ✁ δ q r 2 s gives accurate and statisically efficient similarity test

  16. Experimental Results for Learner ➓ Prototype written in C++ and Boost, run in this laptop ➓ Evaluated with Reber Grammar (typical Grammatical Inference benchmark) ➓ ⑤ Σ ⑤ ✏ 5, n ✏ 6, µ ✏ 0.2, L ✓ 8 ➓ Compared VC and Bootstrap ( r ✏ 10) based tests Examples Memory (MiB) Time/item (ms) Hoeffding 57617 6.1 0.05 Bootstrap 23844 53.7 1.2

  17. Outline Learning PDFA from Data Streams Testing Similarity in Data Streams with the Bootstrap Adapting to Changes in the Target Conclusion

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend