Stuff I did in the Spring while not Replying to Email (aka - - PowerPoint PPT Presentation

stuff i did in the spring while not replying to email
SMART_READER_LITE
LIVE PREVIEW

Stuff I did in the Spring while not Replying to Email (aka - - PowerPoint PPT Presentation

Stuff I did in the Spring while not Replying to Email (aka advances in structured prediction) Hal Daum III | University of Maryland | me@hal3.name | @haldaume3 Examples of structured prediction joint The monster ate a big sandwich


slide-1
SLIDE 1 Hal Daumé III | University of Maryland | me@hal3.name | @haldaume3

Stuff I did in the Spring while not Replying to Email

(aka “advances in structured prediction”)

slide-2
SLIDE 2

Examples of

The monster ate a big sandwich

structured prediction joint

slide-3
SLIDE 3

Sequence labeling

The monster ate a big sandwich

x = the monster ate the sandwich y = Dt Nn Vb Dt Nn x = Yesterday I traveled to Lille y = - PER - - LOC

image credit: Richard Padgett
slide-4
SLIDE 4

OUTPUT INPUT

NLP algorithms use a kitchen sink of features

n-mod
  • bject
subject n-mod n-mod p-mod n-mod [root]

Natural language parsing

slide-5
SLIDE 5 image credit: Ben Taskar; Liz Jurrus

(Bipartite) matching

slide-6
SLIDE 6

Machine translation

slide-7
SLIDE 7 image credit: Daniel Muñoz

Segmentation

slide-8
SLIDE 8

Protein secondary structure prediction

slide-9
SLIDE 9 LOLS 9 Hal Daumé III (me@hal3.name)

Outline

➢ Background: learning to search ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks

Isn't this kinda narrow?

slide-10
SLIDE 10 LOLS 10 Hal Daumé III (me@hal3.name)

My experience, 6 months in industry

➢ Standard adage: academia=freedom, industry=time ➢ Number of responsibilities vs number of bosses ➢ Aspects I didn't anticipate ➢ Breadth (academia) versus depth (industry) ➢ Collaborating through students versus directly ➢ Security through tenure versus security through $ ➢ At the end of the day: who are your colleagues and

what do you have to do to pay the piper?

Major caveat: this is comparing a top ranked CS dept to top industry lab, in a time when there's tons of money in this area (more in industry)
slide-11
SLIDE 11

Joint prediction via learning to search

Part of Speech Tagging Dependency Parsing

NLP algorithms use a kitchen sink

  • f

features

*ROOT*

NLP algorithms use a kitchen sink

  • f

features

NN NNS VBP DT NN NN IN NNS
slide-12
SLIDE 12

NLP algorithms use a kitchen sink

  • f

features *ROOT*

Joint Prediction Haiku A joint prediction Across a single input Loss measured jointly Joint Prediction Haiku A joint prediction Across a single input Loss measured jointly Joint prediction via learning to search

slide-13
SLIDE 13

Back to the original problem...

  • How to optimize a discrete, joint loss?
  • Input:

x ∈ X

  • Truth:

y ∈ Y ( x )

  • Outputs:

Y ( x )

  • Predicted: ŷ

∈ Y ( x )

  • Loss:

l

  • s

s ( y , ŷ )

  • Data:

( x , y ) ~ D

I can can a can Pro Md Vb Dt Nn Pro Md Vb Dt Vb Pro Md Vb Dt Md Pro Md Nn Dt Nn Pro Md Nn Dt Vb Pro Md Nn Dt Md Pro Md Md Dt Nn Pro Md Md Dt Vb
slide-14
SLIDE 14

Back to the original problem...

  • How to optimize a discrete, joint loss?
  • Input:

x ∈ X

  • Truth:

y ∈ Y ( x )

  • Outputs:

Y ( x )

  • Predicted: ŷ

∈ Y ( x )

  • Loss:

l

  • s

s ( y , ŷ )

  • Data:

( x , y ) ~ D

Goal:

find h ∈ H such that h ( x ) ∈ Y ( x ) minimizing

E

( x , y ) ~ D

[

l

  • s

s ( y , h ( x ) )

]

based on N samples

( x

n

, y

n

) ~ D

slide-15
SLIDE 15

Search spaces

  • When y

decomposes in an ordered manner, a sequential decision making process emerges

I Pro Md Vb Dt Nn can Pro Md Vb Dt Nn can Pro Md Vb Dt Nn

decision action decision action decision action

slide-16
SLIDE 16

Search spaces

  • When y

decomposes in an ordered manner, a sequential decision making process emerges

a Pro Md Vb Dt Nn can Pro Md Vb Dt Nn

eend Encodes an output ŷ = ŷ ( e ) from which l

  • s

s ( y , ŷ ) can be computed (at training time)

slide-17
SLIDE 17

Policies

  • A policy maps observations to actions

π( )

= a

  • bs.

input: x timestep: t partial traj: τ … anything else

slide-18
SLIDE 18

Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1}

Extracted 27K+ binary features from last 4 observations (14 binary features for every cell)

Output: Input: From Mario AI competition 2009

An analogy from playing Mario

High level goal: Watch an expert play and learn to mimic her behavior

slide-19
SLIDE 19

Training (expert)

Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell
slide-20
SLIDE 20

Warm-up: Supervised learning

π πref

ref

1.Collect trajectories from expert πref 2.Store as dataset D = { ( o, πref(o,y) ) | o ~ πref } 3.Train classifier π on D

  • Let π play the game!
slide-21
SLIDE 21

Test-time execution (sup. learning)

Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell
slide-22
SLIDE 22

What's the (biggest) failure mode?

π πref

ref

The expert never gets stuck next to pipes ⇒ Classifier doesn't learn to recover!

slide-23
SLIDE 23

Warm-up II: Imitation learning

π πref

ref
  • 1. Collect trajectories from expert πref
  • 2. Dataset D0 = { ( o, πref(o,y) ) | o ~ πref }
  • 3. Train π1 on D0
  • 4. Collect new trajectories from π1
➢ But let the expert steer!
  • 5. Dataset D1 = { ( o, πref(o,y) ) | o ~ π1 }
  • 6. Train π2 on D0 ∪ D1
  • In general:
  • Dn = { ( o, πref(o,y) ) | o ~ πn }
  • Train πn+1 on ∪i≤n Di

π π1

1

π π2

2

If N = T log T, L(πn) < T N + O(1) for some n

slide-24
SLIDE 24

Test-time execution (DAgger)

Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell
slide-25
SLIDE 25

What's the biggest failure mode?

Classifier only sees right versus not-right

  • No notion of better or worse
  • No partial credit
  • Must have a single target answer

π π* *

π π1

1

π π2

2
slide-26
SLIDE 26

Learning to search: AggraVaTe

1.Let learned policy π drive for t timesteps to obs. o 2.For each possible action a :

  • Take action a

, and let expert π

r e f

drive the rest

  • Record the overall loss, c
a

3.Update π based on example: (

  • ,

〈 c

1

, c

2

, . . . , c

K

〉 ) 4.Goto (1) π π . 4 1

slide-27
SLIDE 27

Training time versus test accuracy

slide-28
SLIDE 28

Training time versus test accuracy

slide-29
SLIDE 29

Test time speed

slide-30
SLIDE 30

State of the art accuracy in....

  • Part of speech tagging (1 million words)
  • wc:

3.2 seconds

  • US:

6 lines of code 10 seconds to train

  • CRFsgd: 1068 lines

30 minutes

  • CRF++: 777 lines

hours

  • Named entity recognition (200 thousand words)
  • wc:

0.8 seconds

  • US:

30 lines of code 5 seconds to train

  • CRFsgd:

1 minute

  • CRF++:

10 minutes

  • SVMstr: 876 lines

30 minutes (suboptimal accuracy)

slide-31
SLIDE 31

The Magic

  • You write some greedy “test-time” code
  • In your favorite imperative language (C++/Python)
  • It makes arbitrary calls to a Predict function
  • And you add some minor decoration
  • We will automatically:
  • Perform learning
  • Generate non-determinstic (beam) search
  • Run faster than specialized learning software
slide-32
SLIDE 32

How to train?

1.Generate an initial trajectory using a rollin policy 2.Foreach state R on that trajectory:

a)Foreach possible action a (one-step deviations)

  • i. Take that action
  • ii. Complete this trajectory using a rollout policy
iii.Obtain a final loss

b)Generate a cost-sensitive classification example: ( Φ(R), 〈ca〉a∈A )

S R E E E rollin rollout
  • ne-step
deviations loss=.2 loss=0 loss=.8
slide-33
SLIDE 33

The magic in practice

run(vector<example> ec) for i = 0 .. ec.size y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) Loss( # of y_true != y_pred )

How bad was the entire sequence of predictions (at training time)

void run(search& sch, vector<example*> ec) { for (size_t i=0; i<ec.size(); i++) { uint32_t y_true = get_example_label(ec[i]); uint32_t y_pred = sch.predict(ec[i], y_true); sch.loss( y_true != y_pred ); if (sch.output().good()) sch.output() << y_pred << ' '; } }

I'm really not hiding anything...

A “hint” about the correct decision

  • nly at training time
slide-34
SLIDE 34

The illusion of control

  • Execute run O(TxA) times, modifying Predict
  • For each time step myT = 1 .. T:

For each possible action myA = 1 .. A: define Predict(...) = run your code in full set costa = result of Loss Make classification example on xmyT with <costa> myA if t = myT π

  • therwise
run(vector<example> ec) for i = 0 .. ec.size y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) Loss( # of y_true != y_pred )
slide-35
SLIDE 35 LOLS 35 Hal Daumé III (me@hal3.name)

Entity/relation identification

slide-36
SLIDE 36 LOLS 36 Hal Daumé III (me@hal3.name)

Dependency parsing

slide-37
SLIDE 37 LOLS 37 Hal Daumé III (me@hal3.name)

Outline

➢ Background: learning to search ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks
slide-38
SLIDE 38 LOLS 38 Hal Daumé III (me@hal3.name)
slide-39
SLIDE 39 LOLS 39 Hal Daumé III (me@hal3.name)
slide-40
SLIDE 40 LOLS 40 Hal Daumé III (me@hal3.name)
slide-41
SLIDE 41 LOLS 41 Hal Daumé III (me@hal3.name)
slide-42
SLIDE 42 LOLS 42 Hal Daumé III (me@hal3.name)
slide-43
SLIDE 43 LOLS 43 Hal Daumé III (me@hal3.name)
slide-44
SLIDE 44 LOLS 44 Hal Daumé III (me@hal3.name)
slide-45
SLIDE 45 LOLS 45 Hal Daumé III (me@hal3.name)
slide-46
SLIDE 46 LOLS 46 Hal Daumé III (me@hal3.name)
slide-47
SLIDE 47 LOLS 47 Hal Daumé III (me@hal3.name)
slide-48
SLIDE 48 LOLS 48 Hal Daumé III (me@hal3.name)
slide-49
SLIDE 49 LOLS 49 Hal Daumé III (me@hal3.name)
slide-50
SLIDE 50 LOLS 50 Hal Daumé III (me@hal3.name)
slide-51
SLIDE 51 LOLS 51 Hal Daumé III (me@hal3.name)
slide-52
SLIDE 52 LOLS 52 Hal Daumé III (me@hal3.name)
slide-53
SLIDE 53 LOLS 53 Hal Daumé III (me@hal3.name)
slide-54
SLIDE 54 LOLS 54 Hal Daumé III (me@hal3.name)
slide-55
SLIDE 55 LOLS 55 Hal Daumé III (me@hal3.name)
slide-56
SLIDE 56 LOLS 56 Hal Daumé III (me@hal3.name)
slide-57
SLIDE 57 LOLS 57 Hal Daumé III (me@hal3.name)
slide-58
SLIDE 58 LOLS 58 Hal Daumé III (me@hal3.name)
slide-59
SLIDE 59 LOLS 59 Hal Daumé III (me@hal3.name)
slide-60
SLIDE 60 LOLS 60 Hal Daumé III (me@hal3.name)
slide-61
SLIDE 61 LOLS 61 Hal Daumé III (me@hal3.name)
slide-62
SLIDE 62 LOLS 62 Hal Daumé III (me@hal3.name)
slide-63
SLIDE 63 LOLS 63 Hal Daumé III (me@hal3.name)
slide-64
SLIDE 64 LOLS 64 Hal Daumé III (me@hal3.name)
slide-65
SLIDE 65 LOLS 65 Hal Daumé III (me@hal3.name)
slide-66
SLIDE 66 LOLS 66 Hal Daumé III (me@hal3.name)
slide-67
SLIDE 67 LOLS 67 Hal Daumé III (me@hal3.name)
slide-68
SLIDE 68 LOLS 68 Hal Daumé III (me@hal3.name)
slide-69
SLIDE 69 LOLS 69 Hal Daumé III (me@hal3.name)
slide-70
SLIDE 70 LOLS 70 Hal Daumé III (me@hal3.name)

Outline

➢ Background: learning to search ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks

Observation: rollouts at all time steps not equally useful Solution: importance-weighted active learning selection of where to rollout vs skip Hacky heuristic: 5* speedup, slightly increased accuracy Training RNNs with LOLS yields drastic increases in performance on non- adversarial synthetic data

slide-71
SLIDE 71 LOLS 71 Hal Daumé III (me@hal3.name)

Distant supervision

➢ Learning with a human in the loop ➢ Repeat forever: ➢ Information need ➢ Machine makes complex prediction ➢ Human is happy or unhappy, provides extra feedback ➢ Machine learns ➢ Human learns ➢ How to handle the last step?
slide-72
SLIDE 72

Alekh Agarwal Kai-Wei Chang Akshay Krishnamurthy John Langford Alina Beygelzimmer Paul Mineiro Stéphane Ross He He

slide-73
SLIDE 73
  • Novel programming paradigm for

integrating ML into software

  • State of the art results on many

tasks, very quickly, little code

  • New problems, new algorithms
  • Positive results (notion of local
  • ptimality, and regret guarantees)
  • Negative results (hardness of exact

local optimality)

  • Lots of places to go from here...

Thanks! Questions?