A Pruned Problem Transformation Method for Multi-label - - PowerPoint PPT Presentation

a pruned problem transformation method for multi label
SMART_READER_LITE
LIVE PREVIEW

A Pruned Problem Transformation Method for Multi-label - - PowerPoint PPT Presentation

A Pruned Problem Transformation Method for Multi-label Classification Jesse Read jmr30@cs.waikato.ac.nz University of Waikato A Pruned Problem Transformation Method for Multi-label Classification p. 1/2 Outline Single-label classification


slide-1
SLIDE 1

A Pruned Problem Transformation Method for Multi-label Classification

Jesse Read

jmr30@cs.waikato.ac.nz

University of Waikato

A Pruned Problem Transformation Method for Multi-label Classification – p. 1/2

slide-2
SLIDE 2

Outline

Single-label classification Multi-label classification Problem Transformation Binary Method Combination Method PPT: A Pruned Problem Transformation method Experiments I PPT-ext: PPT extended Experiments II Summary

A Pruned Problem Transformation Method for Multi-label Classification – p. 2/2

slide-3
SLIDE 3

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l)

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-4
SLIDE 4

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” “Antarctic food chain in danger. . . ” “Top sports stars fuelling success. . . ” “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-5
SLIDE 5

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” “Top sports stars fuelling success. . . ” “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-6
SLIDE 6

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” Science “Top sports stars fuelling success. . . ” “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-7
SLIDE 7

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” Science “Top sports stars fuelling success. . . ” Sport “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-8
SLIDE 8

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” Science “Top sports stars fuelling success. . . ” Sport “Steeled for ironman. . . ” Sport “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-9
SLIDE 9

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” Science “Top sports stars fuelling success. . . ” Sport “Steeled for ironman. . . ” Sport “Greens claim report doctored. . . ” Politics “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-10
SLIDE 10

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” Science “Top sports stars fuelling success. . . ” Sport “Steeled for ironman. . . ” Sport “Greens claim report doctored. . . ” Politics “Revealed: Polluting impact of humans on the oceans. . . ” Environment “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-11
SLIDE 11

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” Science “Top sports stars fuelling success. . . ” Sport “Steeled for ironman. . . ” Sport “Greens claim report doctored. . . ” Politics “Revealed: Polluting impact of humans on the oceans. . . ” Environment “Union muzzled while awaiting poll watchdog’s ruling. . . ” Politics “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-12
SLIDE 12

Single-label (Multi-class) Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label l ∈ L Single-label representation: (d, l) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (l ∈ L) “NZ scientists help discover solar system in our galaxy. . . ” Science “Antarctic food chain in danger. . . ” Science “Top sports stars fuelling success. . . ” Sport “Steeled for ironman. . . ” Sport “Greens claim report doctored. . . ” Politics “Revealed: Polluting impact of humans on the oceans. . . ” Environment “Union muzzled while awaiting poll watchdog’s ruling. . . ” Politics “Technology pushes sporting boundaries. . . ” Science

A Pruned Problem Transformation Method for Multi-label Classification – p. 3/2

slide-13
SLIDE 13

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S)

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-14
SLIDE 14

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” “Antarctic food chain in danger. . . ” “Top sports stars fuelling success. . . ” “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-15
SLIDE 15

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” “Top sports stars fuelling success. . . ” “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-16
SLIDE 16

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” {Science, Environment} “Top sports stars fuelling success. . . ” “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-17
SLIDE 17

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” {Science, Environment} “Top sports stars fuelling success. . . ” {Sport} “Steeled for ironman. . . ” “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-18
SLIDE 18

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” {Science, Environment} “Top sports stars fuelling success. . . ” {Sport} “Steeled for ironman. . . ” {Sport} “Greens claim report doctored. . . ” “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-19
SLIDE 19

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” {Science, Environment} “Top sports stars fuelling success. . . ” {Sport} “Steeled for ironman. . . ” {Sport} “Greens claim report doctored. . . ” {Politics, Environment} “Revealed: Polluting impact of humans on the oceans. . . ” “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-20
SLIDE 20

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” {Science, Environment} “Top sports stars fuelling success. . . ” {Sport} “Steeled for ironman. . . ” {Sport} “Greens claim report doctored. . . ” {Politics, Environment} “Revealed: Polluting impact of humans on the oceans. . . ” {Environment, Science} “Union muzzled while awaiting poll watchdog’s ruling. . . ” “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-21
SLIDE 21

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” {Science, Environment} “Top sports stars fuelling success. . . ” {Sport} “Steeled for ironman. . . ” {Sport} “Greens claim report doctored. . . ” {Politics, Environment} “Revealed: Polluting impact of humans on the oceans. . . ” {Environment, Science} “Union muzzled while awaiting poll watchdog’s ruling. . . ” {Politics} “Technology pushes sporting boundaries. . . ”

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-22
SLIDE 22

Multi-label Classification

Set of documents D. Set of labels L. For each d ∈ D, select a label subset S ⊆ L Multi-label representation: (d, S) e.g.

L = {Sport, Environment, Science, Politics}: Document (d ∈ D) Label (S ⊆ L) “NZ scientists help discover solar system in our galaxy. . . ” {Science} “Antarctic food chain in danger. . . ” {Science, Environment} “Top sports stars fuelling success. . . ” {Sport} “Steeled for ironman. . . ” {Sport} “Greens claim report doctored. . . ” {Politics, Environment} “Revealed: Polluting impact of humans on the oceans. . . ” {Environment, Science} “Union muzzled while awaiting poll watchdog’s ruling. . . ” {Politics} “Technology pushes sporting boundaries. . . ” {Sport, Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 4/2

slide-23
SLIDE 23

Applications of ML Classification

Using Machine Learning to train from manually multi-labelled documents, and learn to automatically classify new documents with multi-labels (AKA ‘tags’). News articles Encyclopedia articles Academic papers (categories, key words) Emails Internet forum posts Web pages (as bookmarks, web directories) RSS feeds Biological applications (genes, etc. . . )

A Pruned Problem Transformation Method for Multi-label Classification – p. 5/2

slide-24
SLIDE 24

Problem Transformation

Single-label classification: Analyse a document, make a classification. Multi-label classification: Analyse a document, . . . ? Solution 1.: Make several (single-label) decisions Solution 2.: Make one (single-label) decision involving multiple labels This involves: Transforming a multi-label problem into

  • ne or more single-label problems (and back again) i.e.

Problem Transformation.

A Pruned Problem Transformation Method for Multi-label Classification – p. 6/2

slide-25
SLIDE 25

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-26
SLIDE 26

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

Label set: L = {Sport, Environment, Science, Politics} Multi-label Dtrain; (d, S ⊆ L) d1,{Sports,Politics} d2,{Science,Politics} d3,{Sports} d4,{Environment,Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-27
SLIDE 27

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (d1, 1) (d1, 0) (d1, 0) (d1, 1) (d2, 0) (d2, 0) (d2, 1) (d2, 1) (d3, 1) (d3, 0) (d3, 0) (d3, 0) (d4, 0) (d4, 1) (d4, 1) (d4, 0)

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-28
SLIDE 28

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (d1, 1) (d1, 0) (d1, 0) (d1, 1) (d2, 0) (d2, 0) (d2, 1) (d2, 1) (d3, 1) (d3, 0) (d3, 0) (d3, 0) (d4, 0) (d4, 1) (d4, 1) (d4, 0) dx = “Revealed: Polluting Impact of Humans on the Oceans”

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-29
SLIDE 29

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (d1, 1) (d1, 0) (d1, 0) (d1, 1) (d2, 0) (d2, 0) (d2, 1) (d2, 1) (d3, 1) (d3, 0) (d3, 0) (d3, 0) (d4, 0) (d4, 1) (d4, 1) (d4, 0) dx = “Revealed: Polluting Impact of Humans on the Oceans” Single-label Test; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (dx, ?) (dx, ?) (dx, ?) (dx, ?)

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-30
SLIDE 30

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (d1, 1) (d1, 0) (d1, 0) (d1, 1) (d2, 0) (d2, 0) (d2, 1) (d2, 1) (d3, 1) (d3, 0) (d3, 0) (d3, 0) (d4, 0) (d4, 1) (d4, 1) (d4, 0) dx = “Revealed: Polluting Impact of Humans on the Oceans” Single-label Test; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (dx, 0) (dx, 1) (dx, 1) (dx, 0)

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-31
SLIDE 31

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (d1, 1) (d1, 0) (d1, 0) (d1, 1) (d2, 0) (d2, 0) (d2, 1) (d2, 1) (d3, 1) (d3, 0) (d3, 0) (d3, 0) (d4, 0) (d4, 1) (d4, 1) (d4, 0) dx = “Revealed: Polluting Impact of Humans on the Oceans” Multi-label Test; (d, S ⊆ L) dx,{Environment, Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-32
SLIDE 32

Solution 1. Binary Method

Several single-label classifiers make several binary decisions (a label is relevant, or ¬relevant (1/0)).

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ {0, 1}) CSport CEnvir. CScience CP olitics (d1, 1) (d1, 0) (d1, 0) (d1, 1) (d2, 0) (d2, 0) (d2, 1) (d2, 1) (d3, 1) (d3, 0) (d3, 0) (d3, 0) (d4, 0) (d4, 1) (d4, 1) (d4, 0) dx = “Revealed: Polluting Impact of Humans on the Oceans” Multi-label Test; (d, S ⊆ L) dx,{Environment, Science}

Assumes that all labels are independent

A Pruned Problem Transformation Method for Multi-label Classification – p. 7/2

slide-33
SLIDE 33

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-34
SLIDE 34

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

Label set: L = {Sport, Environment, Science, Politics} Multi-label Dtrain; (d, S ⊆ L) d1,{Sports,Politics} d2,{Science,Politics} d3,{Sports} d4,{Environment,Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-35
SLIDE 35

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ distinct(l ∈ SLDtrain)) d1,Sports_Politics d2,Science_Politics d3,Sports d4,Environment_Science

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-36
SLIDE 36

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ distinct(l ∈ SLDtrain)) d1,Sports_Politics d2,Science_Politics d3,Sports d4,Environment_Science dx = “Revealed: Polluting Impact of Humans on the Oceans”

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-37
SLIDE 37

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ distinct(l ∈ SLDtrain)) d1,Sports_Politics d2,Science_Politics d3,Sports d4,Environment_Science dx = “Revealed: Polluting Impact of Humans on the Oceans” Single-label Test (d, l ∈ distinct(l ∈ SLDtrain)) dx,?

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-38
SLIDE 38

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ distinct(l ∈ SLDtrain)) d1,Sports_Politics d2,Science_Politics d3,Sports d4,Environment_Science dx = “Revealed: Polluting Impact of Humans on the Oceans” Single-label Test (d, l ∈ distinct(l ∈ SLDtrain)) dx,Environment_Science

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-39
SLIDE 39

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ distinct(l ∈ SLDtrain)) d1,Sports_Politics d2,Science_Politics d3,Sports d4,Environment_Science dx = “Revealed: Polluting Impact of Humans on the Oceans” Multi-label Test (d, S ⊆ L) dx,{Environment,Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-40
SLIDE 40

Solution 2. Combination Method

One decision involves multiple labels. Each label combination becomes an atomic label.

Label set: L = {Sport, Environment, Science, Politics} Single-label Dtrain; (d, l ∈ distinct(l ∈ SLDtrain)) d1,Sports_Politics d2,Science_Politics d3,Sports d4,Environment_Science dx = “Revealed: Polluting Impact of Humans on the Oceans” Multi-label Test (d, S ⊆ L) dx,{Environment,Science}

May generate many labels from a few examples Can only predict combinations seen in the training set

A Pruned Problem Transformation Method for Multi-label Classification – p. 8/2

slide-41
SLIDE 41

Initial Conclusions

The Combination Method does best, because it incorporates information about the relationships between labels, e.g.: label X may only ever occur by itself labels X and Y may occur together often labels X and Y may never occur together But, it. . .

  • ften generates too many labels

becomes overwhelmed by so many labels How can we improve? 90% of label combs. only found in 10% of the data concentrate on the key label combinations!

A Pruned Problem Transformation Method for Multi-label Classification – p. 9/2

slide-42
SLIDE 42

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d1 {Sports,Science} d2 {Environment,Science,Politics} d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-43
SLIDE 43

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d1 {Sports,Science} d2 {Environment,Science,Politics} d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-44
SLIDE 44

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science} Doc. Labels (S ⊆ L) d1 {Sports,Science} d2 {Environment,Science,Politics}

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-45
SLIDE 45

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science} Doc. Labels (S ⊆ L) d1 {Sports,Science} d2 {Environment,Science,Politics}

Lost 20% of data. Can we save any of that data?

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-46
SLIDE 46

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science} Doc. Labels (S ⊆ L) d1 {Sports,Science} d2 {Environment,Science,Politics}

Lost 20% of data. Can we save any of that data?

  • Yes. By splitting up S into more frequent subsubsets

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-47
SLIDE 47

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science} Doc. Labels (S ⊆ L) d1 {Sports,Science} d1 {Sports} d1 {Science} d2 {Environment,Science,Politics} d2 {Environment,Science} d2 {Politics}

Lost 20% of data. Can we save any of that data?

  • Yes. By splitting up S into more frequent subsubsets

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-48
SLIDE 48

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science} Doc. Labels (S ⊆ L) d1 {Sports} d1 {Science} d2 {Environment,Science} d2 {Politics}

Lost 20% of data. Can we save any of that data?

  • Yes. By splitting up S into more frequent subsubsets

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-49
SLIDE 49

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 10 examples, 6 combinations:

Doc. Labels (S ⊆ L) d1 {Sports} d1 {Science} d2 {Environment,Science} d2 {Politics} d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-50
SLIDE 50

Pruned Problem Transformation (PPT)

Prune away all examples with infrequent label subsets. e.g. 12 examples, 4 combinations:

Doc. Labels (S ⊆ L) d1 {Sports} d1 {Science} d2 {Environment,Science} d2 {Politics} d3 {Sports} d4 {Environment,Science} d5 {Science} d6 {Sports} d7 {Environment,Science} d8 {Politics} d9 {Politics} d10 {Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 10/2

slide-51
SLIDE 51

Experiments I. Accuracy.

75 75.5 76 76.5 77 77.5 78 1 2 3 4 5 pruning value CM BM PPT 56 58 60 62 64 66 68 70 72 1 2 3 4 5 pruning value CM BM PPT

Medical Dataset Scene Dataset

50 50.5 51 51.5 52 52.5 53 53.5 54 54.5 55 55.5 1 2 3 4 5 pruning value CM BM PPT 28.8 29 29.2 29.4 29.6 29.8 30 30.2 30.4 30.6 1 2 3 4 5 pruning value CM BM PPT

Yeast Dataset Enron Dataset

A Pruned Problem Transformation Method for Multi-label Classification – p. 11/2

slide-52
SLIDE 52

Experiments I. Build Time.

20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 pruning value CM BM PPT 20 25 30 35 40 45 50 55 60 65 70 75 1 2 3 4 5 pruning value CM BM PPT

Medical Dataset Scene Dataset

20 40 60 80 100 120 140 160 180 1 2 3 4 5 pruning value CM BM PPT 1000 2000 3000 4000 5000 6000 7000 1 2 3 4 5 pruning value CM BM PPT

Yeast Dataset Enron Dataset

A Pruned Problem Transformation Method for Multi-label Classification – p. 12/2

slide-53
SLIDE 53

PPT: Initial Conclusions

Fast Superior to BM and CM for some pruning range . . . except Enron, where labelling is very irregular (44% as many distinct label combinations as total examples) PPT can’t form new combinations the Binary Method can (and does better because of this) The Binary Method combines several single labels to create a multi-label prediction Can we combine multiple labels to create new multi-label predictions?

A Pruned Problem Transformation Method for Multi-label Classification – p. 13/2

slide-54
SLIDE 54

PPT Extended (PPT-ext)

Yes–Given a test example dx (about Sports and Science) . . . Look at a posterior Probability for each possible existing combination: Combination (S)

P(S|dx) {Sports, Politics} 0.2 {Science, Politics} 0.2 {Sports} 0.3 {Enviro., Science} 0.3

A Pruned Problem Transformation Method for Multi-label Classification – p. 14/2

slide-55
SLIDE 55

PPT Extended (PPT-ext)

Yes–Given a test example dx (about Sports and Science) . . . Look at a posterior Probability for each possible existing combination: Combination (S)

P(S|dx) {Sports, Politics} 0.2 {Science, Politics} 0.2 {Sports} 0.3 {Enviro., Science} 0.3

Label Score

Sports 0.5 Science 0.5 Politics 0.4 Enviro. 0.3

We can sum these probabilities for each label

A Pruned Problem Transformation Method for Multi-label Classification – p. 14/2

slide-56
SLIDE 56

PPT Extended (PPT-ext)

Yes–Given a test example dx (about Sports and Science) . . . Look at a posterior Probability for each possible existing combination: Combination (S)

P(S|dx) {Sports, Politics} 0.2 {Science, Politics} 0.2 {Sports} 0.3 {Enviro., Science} 0.3

Label Score

Sports 0.5 Science 0.5 Politics 0.4 Enviro. 0.3

We can sum these probabilities for each label Using a threshold of ≥ 0.5, gives us: {Sports, Science}

A Pruned Problem Transformation Method for Multi-label Classification – p. 14/2

slide-57
SLIDE 57

Experiments II. Accuracy

28 29 30 31 32 33 34 35 36 1 2 3 4 5 accuracy p value CM BM PPT PPT-ext

Enron Dataset. Accuracy (no change to build time!)

A Pruned Problem Transformation Method for Multi-label Classification – p. 15/2

slide-58
SLIDE 58

Summary

Multi-label Classification via Problem Transformation Two standard approaches: CM, and BM CM: relationships between labels are important, but too many label combinations causes problems (and can’t form new combinations) PPT: focus on key relationships PPT-ext: able to form new multi-label combinations Experiments: PPT and PPT-ext superior to CM and BM

A Pruned Problem Transformation Method for Multi-label Classification – p. 16/2

slide-59
SLIDE 59

Summary

Multi-label Classification via Problem Transformation Two standard approaches: CM, and BM CM: relationships between labels are important, but too many label combinations causes problems (and can’t form new combinations) PPT: focus on key relationships PPT-ext: able to form new multi-label combinations Experiments: PPT and PPT-ext superior to CM and BM The End. – Questions? / Comments?

A Pruned Problem Transformation Method for Multi-label Classification – p. 16/2

slide-60
SLIDE 60

Appendix 1. Datasets

|D| |L| LCard(D) PDist(D)

Medical 978 45 1.25 0.096 Scene 2407 6 1.07 0.006 Yeast 2417 14 4.24 0.082 Enron 1702 53 3.38 0.442

LCard(D) = average size of number of labels per document PDist(D) = the percentage of documents which are distinct

A Pruned Problem Transformation Method for Multi-label Classification – p. 17/2

slide-61
SLIDE 61

Appendix 2. Combination Popularity

A Pruned Problem Transformation Method for Multi-label Classification – p. 18/2

slide-62
SLIDE 62

Appendix 3. Evaluation

Accuracy:

1 |D|

|D|

i=1 |Si∩Yi| |Si∪Yi|

Hamming loss:

1 |D|

|D|

i=1 Yi∆Si |L|

(∆ = symmetrical difference)

F1:

1 |D|

|D|

i=1 2∗p∗r p+r (precision,recall of Yi from Si)

E.g.:

Y = 0100100010 (predicted) S = 0100101000 (actual)

Accuracy

2/4 0.50

(best = 1.00) Hamming loss

2/10 0.20

(best = 0.00) F1

(2 ∗ 2

3 ∗ 2 3/(2 3 + 2 3))

0.67

(best = 1.00)

A Pruned Problem Transformation Method for Multi-label Classification – p. 19/2

slide-63
SLIDE 63

Appendix 4. Experiments III

Medical Enron RAKEL PPT RAKEL PPT

F1

0.776 0.789 0.457 0.503

  • Ham. Loss

0.012 0.011 0.067 0.074 Accuracy 0.743 0.776 0.323 0.353 McNemar’s

p = 0.295 p = 0.000

Build Time 190s 15s 3163s 195s RAKEL par.

m = 20, k = 27, t = .5 m = 80, k = 21, t = .5

PPT par.

p = 1, −NA p = 5, −NA, −J, t = .21

A Pruned Problem Transformation Method for Multi-label Classification – p. 20/2

slide-64
SLIDE 64

Appendix 5. Graph View

4 29 32 34 16 20 42 25 27 44 23 26 21 36 41 43 37 38 11 13 24 39 8 31 40 10 9 12 22 19 28 35 15 7 17 14 6 5 33 3 30 18 1 2

Figure 1: A multi-label dataset. Each node is a la-

  • bel. Each edge represents at least 1 co-occurrence
  • f the two labels it connects

What if we ignore very infrequent co-occurrences between labels?

A Pruned Problem Transformation Method for Multi-label Classification – p. 21/2

slide-65
SLIDE 65

Appendix 5. Graph View

36 41 43 37 38 11 13 24 39 32 34 44 16 9 22 4 25 27 28 15 17 14 10 31 7 3 23 30 35 19 1 12 2 21

Figure 2: A multi-label dataset. Each node is a label. Each edge represents at least 2 co-

  • ccurrences of the two labels it connects (covers

97% of 978 examples)

A Pruned Problem Transformation Method for Multi-label Classification – p. 21/2

slide-66
SLIDE 66

Appendix 5. Graph View

36 41 24 39 32 34 44 11 4 25 27 28 43 9 17 14 10 31 16 37 23 30 35 19 1 12 38 2 21 22

Figure 3: A multi-label dataset. Each node is a label. Each edge represents at least 3 co-

  • ccurrences of the two labels it connects (covers

92% of 978 examples)

These are the key label relationships.

A Pruned Problem Transformation Method for Multi-label Classification – p. 21/2