SLIDE 1
LEARNING REGRESSION TREES from Time-Changing Data Streams Bla - - PowerPoint PPT Presentation
LEARNING REGRESSION TREES from Time-Changing Data Streams Bla - - PowerPoint PPT Presentation
LEARNING REGRESSION TREES from Time-Changing Data Streams Bla Sovdat August 27, 2014 THE STREAM MODEL Example (adult, female, 3.141, 0.577) Data arrives in the form of examples (tuples) (child, male, 2.1728, 0.1123) (child, female,
SLIDE 2
SLIDE 3
Requirements of the data stream environment:
1) Process one example at a time, inspect it only once 2) Use a limited amount of memory 3) Work in a limited amount of time 4) Be ready to predict at any time
Typical use of data stream learner:
a) The learner receives a new example from the stream (1) b) The learner processes the example (2, 3) c) The learner is ready for the next example (4)
Different evaluation techniques
DATA STREAM ENVIRONMENT
Data stream prediction cycle
Alfred Bifet and Richard Kirkby. Data Stream Mining: A Practical Approach. 2009.
SLIDE 4
Regression tree represents a mapping from attribute space to real numbers Examples are tuples of attribute values Each attribute 𝐵 has a range of possible values:
1) Discrete (also “categorial”) attribute: sex with range {male, female} 2) Numeric attribute: temperature with range R (reals)
The target attribute is a real number Concrete example:
INTERMEZZO: DECISION TREES
Example Example: ((male,first,adult),no)
Tom Mitchell. Machine Learning. McGraw Hill. 1997.
SLIDE 5
Famous batch learner for regression trees Start with a set of examples 𝑇 = (𝒚𝟐, 𝑧1 , 𝒚𝟑, 𝑧2 , … , (𝒚𝒐, 𝑧𝑜)}, i.e., the training set Each example is of the form (𝐲, 𝑧), where 𝒚 = (𝑤1, 𝑤2, … , 𝑤𝑏) Pick the attribute 𝐵 that maximizes standard deviation reduction (SDR) Partition the set 𝑇 according to the attribute 𝐵 Recursively apply the procedure on each subset
INTERMEZZO: CART
𝑡𝑡 𝑇 = 1 |𝑇| (𝑧 − 𝑧 )2
𝒚,𝑧 ∊𝑇
𝑡𝑡𝑡 𝐵 = 𝑡𝑡 𝑇 − |𝑇𝑗| |𝑇| 𝑡𝑡(𝑇𝑗)
𝑒 𝑗=1
𝑧 = 1 |𝑇| 𝑧
(𝒚,𝑧)∊𝑇
𝑇𝑗 = (𝒚, 𝑧) ∈ 𝑇 𝐵 𝒚 = 𝑏𝑗} 𝐵 = arg max
𝐵
𝑡𝑡𝑡(𝐵)
- L. Breiman, J. Friedman, C.J. Slone, R.A. Olshen. Classification and Regression Trees. CRC Press. 1984.
SLIDE 6
Let’s modify CART to a streaming setting Data is not available in advance, and we only see a (small) sample of the stream When and on what attribute to split? What attribute is “the best” relative to the whole stream? Idea: Apply Hoeffding bound to confidently decide when to split
A PROBLEM
SLIDE 7
Well-known result from probability, also known as additive Chernoff bound, proved by Wassily Hoeffding Many applications in theoretical computer science (randomized algorithms, etc.) and machine learning (PAC bounds , Hoeffding trees, etc.) Theorem (Hoeffding, 1963). Let 𝑌 = 𝑌1 + 𝑌2 + ⋯ + 𝑌𝑜 be a sum of independent bounded random variables, with 𝑏 ≤ 𝑌𝑗 ≤ 𝑐, and let 𝑆 = 𝑐 − 𝑏. Then Pr [𝑌 − 𝐹[𝑌] ≥ 𝜁𝑜] ≤ exp (− 2𝑜𝜁2 𝑆2 ) The result is independent of the distribution
SIMPLIFIED HOEFFDING BOUND
Randomized Quicksort does at most 48 𝑜 log 𝑜 comparisons “with high probability”
Rajeev Motwani, Prabhakar Raghavan. Randomized Algorithms. Cambridge University Press. 1995. Wassily Hoeffding. Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association. 1963.
SLIDE 8
Let 𝐵1 and 𝐵2 be the best and the second-best attributes (i.e. attributes with highest SDRs) Let 𝑇𝐵𝐵 and 𝑇𝐵2 be estimated standard deviation reductions, computed from 𝑜 examples, for attributes 𝐵1 and 𝐵2 If 𝑇𝐵2
𝑇𝐵1
< 1 − 𝜁, then 𝑡𝑡𝑡(𝐵2)/𝑡𝑡𝑡(𝐵1) < 1 with probability at least 1 − 𝜀, where
ε = log 1 𝜀 2𝑜
To see this, solve exp (−2𝑜𝜁2) ≤ 𝜀 for 𝜁 Note that 𝑡𝑡𝑡(𝐵2)/𝑡𝑡𝑡(𝐵1) < 1 means 𝐵1 is better than 𝐵2, i.e., it is obvious that 𝑡𝑡𝑡(𝐵1)/𝑡𝑡𝑡(𝐵2) < 1 iff 𝑡𝑡𝑡(𝐵1) > 𝑡𝑡𝑡(𝐵2), assuming SDRs are positive This is all we need to scale up the CART learner: Each leaf accumulates examples until it is confident it found the “truly best” attribute APPLYING THE HOEFFDING BOUND
Elena Ikonomovska. Algorithms for Learning Regression Trees and Ensembles on Evolving Data Streams. PhD thesis. 2012.
SLIDE 9
Learning: 1) Start with an empty leaf (the root node) 2) Sort a newly arrived example into a leaf 3) Update statistics, compute SDRs, and compute 𝜁 4) Accumulate examples in the leaf until 𝑇𝐵2
𝑇𝐵1
< 1 − 𝜁
a) Split the leaf: create 𝑡 new leaf nodes
Predicting: 1) Sort example down the tree, into a leaf 2) Predict the average 𝑧 of examples from the leaf
FAST INCREMENTAL MODEL TREES
The big picture
ε = log 1 𝜀 2𝑜
Elena Ikonomovska. Algorithms for Learning Regression Trees and Ensembles on Evolving Data Streams. PhD thesis. 2012.
SLIDE 10
Handling numeric attributes (histogram, BST, etc.) Stopping criteria (tree size, thresholds, etc.) Fitting a linear model in leaves (unthresholded perceptron) Handling concept drift (with Page-Hinkley test)
EXTENSIONS OF THE FIMT LEARNER
SLIDE 11
Syntactically no difference between regression and classification (almost) A variant of the FIMT-DD learner available in QMiner The learner exposed via QMiner Javascript API Pass algorithm parameters and data stream specification in JSON format Several stopping and splitting criteria Change detection mechanism, using Page-Hinkley test Can export the model anytime (XML and DOT formats supported) Usage examples available on GitHub The algorithm expects two (learning) or three (predicting) parameters: 1) vector of discrete attribute values; 2) vector of numeric attribute values; 3) target variable value (not needed for prediction) REGRESSION TREES IN QMINER
SLIDE 12
REGRESSION TREES IN QMINER
// algorithm parameters var algorithmParams = { "gracePeriod": 300, "splitConfidence": 1e-6, "tieBreaking": 0.005, "driftCheck": 1000, "windowSize": 100000, "conceptDriftP": false, "maxNodes": 15, "regLeafModel": "mean" "sdrThreshold": 0.1, "sdThreshold": 0.01, "phAlpha": 0.005, "phLambda": 50.0, "phInit": 100, }; // describe the data stream var streamConfig = { "dataFormat": ["A", "B", "Y"], "A": { "type": "discrete", "values": ["t", "f"] }, "B": { "type": "discrete", "values": ["t", "f"] }, "Y": { "type": "numeric" } }; // create a new learner var ht = analytics.newHoeffdingTree(streamConfig, algorithmParams); // process the stream while (!streamData.eof) { /* parse example */ ht.process(vec_discrete, vec_numeric, target); } // use the model var val = ht.predict(["t", "f"], []); // export the model ht.exportModel({ "file": "./sandbox/ht/model.gv", "type": "DOT" });
SLIDE 13
The algorithm is pretty fast: tens of thousands of examples per second Scales poorly with the number of attributes (quadratic in 𝑏)
When using information gain as attribute selection criterion, needs 𝑃(𝑜𝑏2𝑑) time Numeric attribute discretization is expensive (both space and time)
Would love to get feedback from people From now on: Change the algorithm as needed
REGRESSION TREES IN QMINER
SLIDE 14