XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH - - PowerPoint PPT Presentation

xgboost a scalable tree boosting system
SMART_READER_LITE
LIVE PREVIEW

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH - - PowerPoint PPT Presentation

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO 2018/04/17, FROM KDD 2016 Outline Introduction Method Experiment Conclusion 2 Introduction Regression tree CART (Gini) Boosting Ensemble method, an


slide-1
SLIDE 1

XGBOOST: A SCALABLE TREE BOOSTING SYSTEM

ADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO 2018/04/17, FROM KDD 2016

slide-2
SLIDE 2

Outline

Introduction Method Experiment Conclusion

2

slide-3
SLIDE 3

Introduction

Regression tree CART (Gini) Boosting Ensemble method, an iterative procedure adaptively change the distribution of training examples. Adaboost

3

slide-4
SLIDE 4

Introduction

The most important factor of XGBoost — Scalability. Billions of examples.

4

slide-5
SLIDE 5

Introduction

A practical choice: 17 out of 29 winning solutions in Kaggle 2015. Top-10 teams all used XGBoost in KDDcup 2015 T-brain: used in top-3 teams. Ad click through rate prediction, malware classification, customer behavior prediction, etc.

5

slide-6
SLIDE 6

Method

Tree ensemble model: Prediction Leaf weights of a tree

6

slide-7
SLIDE 7

Method

Regularized objective function: Differentiable convex loss function Number of leaves + Weights on leave Model complexity Number of leaves

7

Objective function

slide-8
SLIDE 8

Method

Gradient tree boosting: Model is trained in additive manner. Usual __ _________

8

Objective function

slide-9
SLIDE 9

Method

Additive training (Boosting)

9

Objective function

slide-10
SLIDE 10

Method

Taylor expansion:

10

Objective function

slide-11
SLIDE 11

:instance set of j ( xi in leaf j )

Method

T : number of leaf

11

Objective function

slide-12
SLIDE 12

Method

For a fixed tree q, the optimal weight is:

12

Objective function

slide-13
SLIDE 13

Method

For a fixed tree q, the optimal weight is: The corresponding optimal value is:

13

Objective function

slide-14
SLIDE 14

Method

From now, if the tree is known, we get the optimal value. The problem becomes “what tree is the best ?” Left subtree. Right subtree. Parent Loss reduction

The larger the better, might be negative Greedy strategy

14

Objective function

slide-15
SLIDE 15

Method

Preventing overfitting further: Shrinkage.

  • Subsampling. (column)

15

Objective function

slide-16
SLIDE 16

Method

Basic Exact Greedy Algorithm. Approximate Algorithm. Global Local

16

Split Finding

slide-17
SLIDE 17

Method

Basic Exact Greedy Algorithm:

17

Split Finding

.m

When to stop?

slide-18
SLIDE 18

Method

B.E.G.A. is good, since all possible splits, but…. When data can’t fit in memory, the thrashing slow down the system. Approximations:

18

Split Finding

slide-19
SLIDE 19

Method

Local/ Global agendas: Global: less proposal but more candidate point.

19

Split Finding

slide-20
SLIDE 20

Method

Weighted quantile sketch: Each interval has the same “impact” on OF .

20

Split Finding

slide-21
SLIDE 21

Method

Sparsity-aware: Possible reasons: Missing value Frequent zero Artifacts of feature engineering (like one-hot) Solution: default direction

21

Split Finding

slide-22
SLIDE 22

Method

22

Split Finding

Sort criteria: Missing value last Learn the best direction (of the feature)

slide-23
SLIDE 23

Method

Non-presence -> missing value. Only deal with presence. 50x faster than naive ver. , on Allstate.

23

Split Finding

slide-24
SLIDE 24

Method

The most time consuming part: sorting. Sort just once. Store data in in-memory unit: block.

24

System Design

slide-25
SLIDE 25

Method

CSC format (compressed column) Ex: Different blocks can be distributed across machine, stored

  • n disk in the out-of-core setting.

25

System Design

slide-26
SLIDE 26

Method

Block structure helps split finding. However, it’s a non-continuous memory access. Solution: allocate an internal buffer in each thread.

26

System Design

slide-27
SLIDE 27

Method

Block size matters. (max number of examples) Small blocks result in small workload for each thread. Large blocks lead cache missing.

27

System Design

Balance!

slide-28
SLIDE 28

Method

Out-of-core computation: Block compression Ex: [0, 2, 2, 0, 1, 2] Block sharding A prefetch thread is assigned to each disk.

28

System Design

slide-29
SLIDE 29

Experiment

The open source package: GitHub.com/dmlc/xgboost

29

slide-30
SLIDE 30

Experiment

Classification: GBM expands one branch of a tree. Other two expand full tree.

30

slide-31
SLIDE 31

Experiment

Learning to rank: pGBRT: the best previously published system. pGBRT only supports approximate algorithm.

31

slide-32
SLIDE 32

Experiment

Out-of-core experiment Compression helps 3x times. Sharding into two give 2x speedup.

32

slide-33
SLIDE 33

Conclusion

The most Important feature: Scalability ! Lessons from building XGBoost: Sparsity aware, weighted quantile sketch, cache aware, parallelization.

33

System Design

Fin.