[PPT] - Adaptive Incremental Learning for Statistical Relational Models PowerPoint Presentation

SLIDE 1

Adaptive Incremental Learning for Statistical Relational Models Using Gradient-Based Boosting

Yulong Gu and Paolo Missier

Presenter: Yulong Gu School of Computing, Newcastle University UK

SLIDE 2

Outline

Background
Relational Functional Gradient Boosting (RFGB)
Top-down Induction of first-order logical decision trees (TILDE)
Concept-Adapting Very Fast Decision Tree (CVFDT)
Hoeffding Relational Regression Tree (HRRT)
Rule Stability Metric for CVFDT
Relational Incremental Boosting (RIB)
Relational Boosted Forest (RBF)

2

SLIDE 3

Problem

Supervised Learning with dataset that:

Incomplete – contains missing values
Imbalanced – #negative instances far outnumber #positive instances
Large-Scale – more cost-efficient to update than re-building model
Evolving – concept drifts
Multi-relational – objects are connected in meaningful way

3

SLIDE 4

Solution System Design

Data-driven Model Statistical Relational Model Relational Soft Margin Approach Structural Expectation Maximization Adaptive Incremental Learning Relational Dependency Network Markov Logic Network Relational Functional Gradient Boosting Framework

1. Multi-relational
2. Imblanced
3. Large-scale
4. Evolving
5. Incomplete

Data Properties:

4

SLIDE 5

Relational Functional Gradient Boosting

Study Hard Go to College Academic Awards

Education

Work at fast food joint (Y) Profit more than N Start a Startup Company

Career Startup

Learn Structure&Parameters

Profit more than N Start a Startup Company Work at fast food joint (Y) Study Hard Go to College Academic Awards

Relational Regression Tree

Boosting

Work at fast food joint (Y)

… …

+ + … + Want to build a statistical relational model out of these predicates? Learn a RRT for each predicate encoding both dependencies and parameters Learn multiple weak models rather than a single complex model

5

Natarajan.S (2012). RFGB. Machine Learning

SLIDE 6

Hoeffding Relational Regression Tree(HRRT)

Incrementally learn Relational Regression Tree? learn a relational regression tree? TILDE

allows conjunction of predicates
extensions allow conjunctions of

recursive and aggregated predicates Learn regression tree incrementally? CVFDT

Learn predicate at node with fraction of

streaming data

concept-adapting

+ =

HRRT

Distinction Distinction Work at fast food joint Go to College

0.2

0.5 True False

Pos Example: person(Eric), workatFFJ(Eric), college(Eric), distinction(Eric)

Sufficient Statistics Updated

Distinction

distinction + 1 startup + 0 …

? ? True False

Fork and calculate regression value

Split

Only Split when Hoeffding Bound satisfied

Sliding Window

6

Blockeel, H., & De Raedt, L. (1998). TILDE. Artificial Intelligence(AI); Hulten, G.(2001). CVFDT. KDD CVFDT Splitting Strategy

SLIDE 7

Hoeffding Relational Regression Tree(HRRT)

Hoeffding Bound:

With desired confidence, the upper bound of the difference between the true mean and observed mean of

a random variable is dependent on the number of observations.

Work at fast food joint Go to College 0.5 True False

Pos Example: person(Eric), workatFFJ(Eric), college(Eric), distinction(Eric)

Distinction

distinction + 1 startup + 0 …

0.3

0.1 True False

Example: After update of SS, the node has seen 100 examples, with 99% certainty, the difference between the true !"#(%"&' ()*+),-+)., − %"&' 0+&1+23 ) and observed one is less than pre-defined 5, HB satisfied, split.

Sliding Window

7

SLIDE 8

Hoeffding Relational Regression Tree(HRRT)

Sliding Window Work at fast food joint Go to College Start a Startup Company Study Hard 0.5 True False

0.3

True 0.3 False

0.9

True 0.9 False

Sliding Window Work at fast food joint Start a Startup Company

0.9

True 0.9 False

CVFDT with alternative subtree After substitution

How does CVFDT adapt to concept drift ?

Maintain a set of alternative subtrees for each node with different predicates than the original one
Periodically check HB at each node, if failed, then add new subtree to its subtree set with the best predicate at the

moment

Once one of the subtree outperforms the original one, the wining subtree will replace the original subtree and discard the
riginal subtree entirely.

8

SLIDE 9

Hoeffding Relational Regression Tree(HRRT)

Sliding Window Work at fast food joint Go to College Start a Startup Company Study Hard 0.5 True False

0.3

True 0.3 False

0.9

True 0.9 False

Sliding Window Work at fast food joint Start a Startup Company

0.9

True 0.9 False

CVFDT with alternative subtree After substitution

Why is CVFDT not good enough?

Less Responsive - new concept will need many counter-examples to invalidate old concept
larger prediction variance – old concepts are entirely discarded based on relatively small amount of data
Hard to maintain and analyse – one single complex model

9

Kolter, J.(2007). DWM. J. Mach. Learn. Res.

SLIDE 10

Ensemble Methods for Relational Adaptive Incremental Learning

Boosting: 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑄𝑏 𝑍 = 𝛽𝐵 + 𝛾𝐸, 𝛽 = 𝛾 = 1 Ensemble Method for Concept Drift:

Boosting, Bagging, Weighted Majority...
Train multiple weak models to represent conflicting rules.
Each weak model contributes to the final prediction.

Work at fast food joint (Y) Go to College Study Hard C True False A True B False Start a Startup Company D True E False Work at fast food joint (Y)

Weak Model 1 with weight 𝛽 Weak Model 2 with weight 𝛾

Weighted Majority: 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑄𝑏 𝑍 = 𝛽𝐵 + 𝛾𝐸

10

SLIDE 11

Rule Stability Metric

Definition 1.

Define the Rule Stability of a model as 𝑜, the size of the smallest change in sample 𝐸 that may cause new rule 𝑠3 to

become superior to working rule 𝑠. In following equation, 𝐸′ is 𝐸 after change: 𝑀𝑓𝑏𝑠𝑜𝑓𝑠 ∶ 𝐸𝑗𝑔𝑔 𝐸, 𝐸3 = 𝑜, 𝑠 → 𝑠′ (1) When we apply the Rule stability to a tree trained with HRRT, we can prove that:

With confidence 1 − 𝜀, the size of the smallest change that may cause 𝑠′ to become superior to 𝑠 is:

𝑈𝑝𝑚𝑓𝑠𝑏𝑜𝑑𝑓 = Δ𝐻̅BC,BD − 𝜗 (2)

Δ𝐻̅BC,BD is the average of the difference between the scores of test 𝑌G and 𝑌H evaluated by splitting function 𝐻 𝑌I ,

and 𝜗 is the parameter obtained from the Hoeffding inequality given 𝑜 and a desired confidence 𝜀.

The 𝑈𝑝𝑚𝑓𝑠𝑏𝑜𝑑𝑓 measures the rule stability of an inner node, and we define:

𝑈𝑠𝑓𝑓𝑈𝑝𝑚 = ∑ 𝑜𝑝𝑒𝑓LMNOPGQRO

S QMTOUV

(3)

as the stability of the tree.

11

SLIDE 12

Established Rules

Combine HRRT and Rule Stability to enable Ensemble Methods to handle Concept Drift:

When is the weak model good enough to represent current rules?
It passes the rule stability check with current sliding window data and
It got boosted using current sliding window data

Functional Gradient Tree Functional Gradient Function Training Example Functional Gradient Example Initial HRRT

Functional Gradient Ascent

Boosting

Established Rules:

We will boost an initial HRRT when it is stable so that the objective functional is best

ptimised for the current sliding window data

and the stable rules are transformed into established rules.

Rule Stability Check

Pass 12

SLIDE 13

Relational Incremental Boosting

𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑄𝑏 𝑍 = 𝐵V + 𝐵W + ⋯ + 𝐵Q

13

Work at fast food joint (Y) Go to College Distinction C0 True False A0 True B0 False Work at fast food joint (Y) Go to College Failed C1 True False A1 True B1 False Work at fast food joint (Y) Start a Startup Company Profit more than N Cn True False An True Bn False

…

Initial HRRT t0 Pass RC Boost t0 -> b0 Functional Gradient of b0 Functional Gradient Tree t1 Pass RC Boost b0 + t1 -> b1 Functional Gradient of b1 Functional Gradient Tree tn Pass RC Boost b0 + b1 … tn -> bn Functional Gradient of bn Data Stream d0 Data Stream d1 Data Stream dn

SLIDE 14

Relational Incremental Boosting

14 Discard poorly performing FGTs over time

Evaluation Centre for RIB

Complexity No Concept Drift Monitor global performance Monitor contribution to error of each FGT Strong consistence to Training data over time Set S to False

SLIDE 15

Relational Incremental Boosting Example

Work at fast food joint (Y) Go to College Distinction 0.5 True False

0.2

True 0.8 False Work at fast food joint (Y) Go to College Failed 0.5 True False

1.6

True 1.0 False Work at fast food joint (Y) Start a Startup Company Profit more than N

0.5

True False

1.8

True 0.8 False

Time Line Time Point 1 Time Point 2 Time Point 3

Conflicting Conflicting Conflicting

+ +

The decomposability of ensemble methods allows direct event analysis of time series from the real-time incrementally learned model Assume 𝑄 𝑍 𝑄𝑏 𝑍 = 𝑇𝑗𝑕 𝑦 is a Sigmoid function, 𝑦 is the regression value, 𝑍 in following examples is predicate ‘Work at fast food joint’. Scenario at Time Point 1: College and Distinction = less likely work at fast food joint in that fast food joint pays less competitive 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑑𝑝𝑚𝑚𝑓𝑕𝑓, 𝑒𝑗𝑡𝑢𝑗𝑜𝑑𝑢𝑗𝑝𝑜 = 𝑇𝑗𝑕 −0.2 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑑𝑝𝑚𝑚𝑓𝑕𝑓, 𝑔𝑏𝑗𝑚𝑓𝑒 = 𝑇𝑗𝑕(0.8) Scenario at Time Point 2: College and Failed = less likely work at fast food joint due to fast food joint pays extremely well over this period 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑑𝑝𝑚𝑚𝑓𝑕𝑓, 𝑒𝑗𝑡𝑢𝑗𝑜𝑑𝑢𝑗𝑝𝑜 = 𝑇𝑗𝑕(−0.2 + 1.0 = 0.8) 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑑𝑝𝑚𝑚𝑓𝑕𝑓, 𝑔𝑏𝑗𝑚𝑓𝑒 = 𝑇𝑗𝑕(0.8 − 1.6 = −0.8) Scenario at Time Point 3: Own a Start-up and Profit more than N = less likely work at fast food joint due to tightening job market 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑑𝑝𝑚𝑚𝑓𝑕𝑓, 𝑒𝑗𝑡𝑢𝑗𝑜𝑑𝑢𝑗𝑝𝑜 = 𝑇𝑗𝑕 −0.2 + 1.0 − 0.5 = 0.3 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑑𝑝𝑚𝑚𝑓𝑕𝑓, 𝑔𝑏𝑗𝑚𝑓𝑒 = 𝑇𝑗𝑕(0.8 − 1.6 − 0.5 = −1.2) 𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑡𝑢𝑏𝑠𝑢𝑣𝑞, 𝑞𝑠𝑝𝑔𝑗𝑢𝑛𝑝𝑠𝑓𝑢ℎ𝑏𝑜𝑂 = 𝑇𝑗𝑕 0.5 + 0.5 − 1.8 = −0.8 15

SLIDE 16

Relational Boosted Forest

𝑄 𝑍 = 𝑈𝑠𝑣𝑓 𝑄𝑏 𝑍 = 𝑿 l m 𝑩, 𝐗 = 𝑥V, 𝑥W … , 𝑥Q , 𝑩 = {𝐵V, 𝐵W, … , 𝐵Q}

Work at fast food joint (Y) Go to College Distinction C0 True False A0 True B0 False Work at fast food joint (Y) Go to College Failed C1 True False A1 True B1 False Work at fast food joint (Y) Start a Startup Company Profit more than N Cn True False An True Bn False

…

HRRT t0 Pass RC Boost t0 ->b0 Functional Gradient Tree t1 Pass RC Boost t0 + t1 -> b1 Functional Gradient Tree tn Pass RC Boost t0 + t1 … tn -> bn Data Stream d0 Data Stream d1 Data Stream dn Assigned weight w0 Assigned weight w1 Assigned weight wn 16

SLIDE 17

Relational Boosted Forest

The weights update strategy is inspired by Dynamic Weighted Majority(DWM)
The weights are initialized to 1.
When a boosted tree makes a mistake in a predictive attempt, the evaluation center will decrease its weight by certain proportion.

17

Kolter, J.(2007). DWM. J. Mach. Learn. Res.

Discard trees with normalised weights lower than pre-defined threshold

Evaluation Centre for RBF

Complexity No Concept Drift Monitor global performance Strong consistence to Training data over time Set S to False

SLIDE 18

Conclusion

We have proposed three adaptive incremental learning algorithms:
Hoeffding Relational Regression Tree (HRRT)
Relational Incremental Boosting (RIB)
Relational Boosted Forest (RBF)
All three algorithms can incrementally and adaptively learn the parameters and structure

simultaneously for SRL models such as MLNs and RDNs.

The RIB and RBF extend the classical ensemble methods to relational scenario for handling the

concept drifts.

All three algorithms are compatible with existing RFGB-based algorithms such as Structural EM

and Soft Margin Approach. The combination of these extensions allow us to learn a model from a incomplete, imbalanced, large-scale and evolving multi-relational dataset in an incremental manner.

18

SLIDE 19

References

[1] Neville, J., & Jensen, D. (2007). Relational Dependency Networks. Journal of Machine Learning Research
[2] Natarajan, S., Khot, T., Kersting, K., Gutmann, B., & Shavlik, J. (2012). Gradient-based boosting for statistical relational learning:

The relational dependency network case. Machine Learning

[3] Yang, S., Khot, T., Kersting, K., Kunapuli, G., Hauser, K., & Natarajan, S. (2015). Learning from Imbalanced Data in Relational

Domains: A Soft Margin Approach. ICDM

[4] Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., & Kadie, C. (2000). Dependency networks for inference,

collaborative filtering, and data visualization. The Journal of Machine Learning Research

[5] Khot, T., Natarajan, S., Kersting, K., & Shavlik, J. (2015). Gradient-based boosting for statistical relational learning: the Markov

logic network and missing data cases. Machine Learning

[6] Blockeel, H., & De Raedt, L. (1998). Top-down induction of first-order logical decision trees. Artificial Intelligence(AI)
[7] Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. Proceedings of the Seventh ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining - KDD

[8] Huynh, T. N., & Mooney, R. J. (2011). Online Structure Learning for Markov Logic Networks. Proceedings of the European

Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML PKDD

[9] Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. KDD
[10] Li, R.-H., & Belford, G. G. (2002). Instability of Decision Tree Classification Algorithms. KDD
[11] Kolter, J., Maloof, M.: Dynamic Weighted Majority : An Ensemble Method for Drifting Concepts. J. Mach. Learn. Res. 8, 2755–

2790 (2007).

19