http cs246 stanford edu input features
play

http://cs246.stanford.edu Input features: N features: X 1 , X 2 , - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Input features: N features: X 1 , X 2 , X N A Each X j has domain D j X 1 <v 1 Categorical: C Y= 0.42 D j = {red, blue} X 2


  1. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

  2.  Input features:  N features: X 1 , X 2 , … X N A  Each X j has domain D j X 1 <v 1  Categorical: C Y= 0.42 D j = {red, blue} X 2 ∈ {v 2 , v 3 }  Numerical: D j = (0, 10) D F  Y is output variable with domain D Y : F G H I  Categorical: Classification  Numerical: Regression  Task:  Given input data vector x i predict y i 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2

  3.  Decision trees: A  Split the data at each X 1 <v 1 internal node C Y=  Each leaf node 0.42 X 2 <v 2 makes a prediction D F  Lecture today: X 3 <v 4 X 2 <v 5  Binary splits: X j <v F G H I  Numerical attrs.  Regression 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3

  4.  Input: Example x i  Output: Predicted y i ’ A X 1 <v 1  “Drop” x i down Y= B C 0.42 the tree until it X 2 <v 2 hits a leaf node D E X 3 <v 4 X 2 <v 5  Predict the value F G H I stored in the leaf that x i hits 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

  5.  Training dataset D*, |D*|=100 examples # of examples A traversing the edge |D|=90 |D|=10 X 1 <v 1 Y= B C 0.42 |D|=45 |D|=45 X 2 <v 2 D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5 F G H I 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5

  6. A  Imagine we are currently B C at some node G D E  Let D G be the data reaches G F G H I  There is a decision we have to make: Do we continue building the tree?  If so, which variable and which value do we use for a split?  If not, how do we make a prediction?  We need to build a “predictor node” 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6

  7.  Alternative view: + + + + – – + + + + + + – – + + + + + – + + – + + – – – – + + – + – – X 1 + + – – – + + – – + + + + – – + + + + + + + – + + + X 2 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7

  8.  Requires at least a single pass over the data! 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8

  9. A  How to split? Pick |D|=90 |D|=10 X 1 <v 1 attribute & value that B C .42 |D|=45 |D|=45 X 2 <v 2 optimizes some criterion D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5  Classification: F G H I Information Gain  IG(Y|X) = H(Y) – H(Y|X) 𝑛  Entropy: 𝐼 𝑎 = − ∑ 𝑞 𝑘 log 𝑞 𝑘 𝑘=1  Conditional entropy: 𝑛 𝐼 𝑋 | 𝑎 = − ∑ 𝑄 𝑎 = 𝑤 𝑘 𝐼 𝑋 𝑎 = 𝑤 𝑘 𝑘=1  Suppose Z takes m values (v 1 … v m )  H(W|Z=v) ... Entropy of W among the records in which Z has value v 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9

  10. A  How to split? Pick |D|=90 |D|=10 X 1 <v 1 attribute & value that B C .42 |D|=45 |D|=45 X 2 <v 2 optimizes some criterion D E  Regression: |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5  Find split (X i , v) that F G H I creates D, D L , D R : parent, left, right child datasets and maximizes: 𝐸 ⋅ 𝑊𝑊𝑊 𝐸 − 𝐸 𝑀 ⋅ 𝑊𝑊𝑊 𝐸 𝑀 + 𝐸 𝑆 ⋅ 𝑊𝑊𝑊 𝐸 𝑆  For ordered domains sort X i and consider a split between each pair of adjacent values  For categorical X i find best split based on subsets (Breiman’s algorithm) 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

  11. A  When to stop? |D|=90 |D|=10 X 1 <v 1  1) When the leaf is “pure” B C .42 |D|=45 |D|=45 X 2 <v 2  E.g., Var(y i ) < ε D E |D|=25 |D|=30 |D|=20 |D|=15 X 3 <v 4 X 2 <v 5  2) When # of examples in F G H I the leaf is too small  E.g., |D| ≤ 10  How to predict?  Predictor:  Regression: Avg. y i of the examples in the leaf  Classification: Most common y i in the leaf 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 11

  12. 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12

  13. FindBestSplit  Given a large dataset with FindBestSplit hundreds of attributes  Build a decision tree! FindBestSplit FindBestSplit  General considerations:  Tree is small (can keep it memory):  Shallow (~10 levels)  Dataset too large to keep in memory  Dataset too big to scan over on a single machine  MapReduce to the rescue! 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

  14. 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14

  15. P arallel L earner for A ssembling N umerous E nsemble T rees [Panda et al., VLDB ‘09]  A sequence of MapReduce jobs that build a decision tree  Setting:  Hundreds of numerical (discrete & continuous) attributes  Target (class) is numerical: Regression  Splits are binary: X j < v  Decision tree is small enough for each Mapper to keep it in memory  Data too large to keep in memory 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

  16. A B C D E F G H I Master Attribute Model metadata Intermediate results Input data FindBestSplit InMemoryGrow 2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

  17. A B C D E F G H I  Mapper loads the model and info about which attribute splits to consider  Each mapper sees a subset of the data D*  Mapper “drops” each datapoint to find the appropriate leaf node L  For each leaf node L it keeps statistics about  1) the data reaching L  2) the data in left/right subtree under split S  Reducer aggregates the statistics (1) and (2) and determines the best split for each node 2/29/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

  18. A B C D E F G H I  Master FindBestSplit  Monitors everything FindBestSplit (runs multiple MapReduce jobs)  MapReduce Initialization  For each attribute identify values FindBestSplit to be considered for splits  MapReduce FindBestSplit FindBestSplit  MapReduce job to find best split when Hardest part there is too much data to fit in memory  MapReduce InMemoryBuild  Similar to FindBestSplit (but for small data)  Grows an entire sub-tree once the data fits in memory  Model file  A file describing the state of the model 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

  19.  Identifies all the attribute values which D need to be considered for splits j  Splits for numerical attributes: X j < v  Would like to consider very possible value v ∈ D  Compute an approximate equi-depth histogram on D*  Idea: Select buckets such that counts per bucket are equal Count for bucket Domain values 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20  Use boundary points of histogram as potential splits  Generates an “attribute metadata” to be loaded in memory by other tasks 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19

  20. Count in bucket 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Domain values  Goal:  Equal number of elements per bucket ( B buckets total)  Construct by first sorting and then taking B-1 equally-spaced splits 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12 14 16 16 18 19 20 20 20  Faster construction: Sample & take equally-spaced splits in the sample  Nearly equal buckets 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20

  21. A B C D E F G H I  Controls the entire process  Determines the state of the tree and grows it:  Decides if nodes should be split  If there is little data entering a node, runs an InMemory-Build MapReduce job to grow the entire subtree  For larger nodes, launches MapReduce FindBestSplit to find candidates for best split  Collects results from MapReduce jobs and chooses the best split for a node  Updates model 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21

  22. D  Master keeps two node queues: j  MapReduceQueue (MRQ) D L D R X j < v  Nodes for which D is too large to fit in memory  InMemoryQueue (InMemQ)  Nodes for which the data D in the node fits in memory  The tree will be built in levels A  Epoch by epoch B C D E F G H I 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

  23. D  Two MapReduce jobs: j  FindBestSplit: Processes nodes D L D R X j < v from the MRQ  For a given set of nodes S , computes a candidate of good split predicate for each node in S  InMemoryBuild: Processes nodes from the InMemQ  For a given set of nodes S , completes tree induction at nodes in S using the InMemoryBuild algorithm  Start by executing FindBestSplit on full data D* 2/28/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend