stuff i did in the spring while not replying to email
play

Stuff I did in the Spring while not Replying to Email (aka - PowerPoint PPT Presentation

Stuff I did in the Spring while not Replying to Email (aka advances in structured prediction) Hal Daum III | University of Maryland | me@hal3.name | @haldaume3 Examples of structured prediction joint The monster ate a big sandwich


  1. Stuff I did in the Spring while not Replying to Email (aka “advances in structured prediction”) Hal Daumé III | University of Maryland | me@hal3.name | @haldaume3

  2. Examples of structured prediction joint The monster ate a big sandwich

  3. Sequence labeling x = the monster ate the sandwich y = Dt Nn Vb Dt Nn x = Yesterday I traveled to Lille y = - PER - - LOC The monster ate a big sandwich image credit: Richard Padgett

  4. Natural language parsing OUTPUT [root] object n-mod n-mod subject n-mod p-mod n-mod NLP algorithms use a kitchen sink of features INPUT

  5. image credit: Ben Taskar; Liz Jurrus (Bipartite) matching

  6. Machine translation

  7. image credit: Daniel Muñoz Segmentation

  8. Protein secondary structure prediction

  9. Outline Isn't this ➢ Background: learning to search kinda narrow? ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks 9 Hal Daumé III (me@hal3.name) LOLS

  10. My experience, 6 months in industry ➢ Standard adage: academia=freedom, industry=time ➢ Number of responsibilities vs number of bosses ➢ Aspects I didn't anticipate ➢ Breadth (academia) versus depth (industry) ➢ Collaborating through students versus directly ➢ Security through tenure versus security through $ ➢ At the end of the day: who are your colleagues and what do you have to do to pay the piper? Major caveat: this is comparing a top ranked CS dept to top industry lab, in a time when there's tons of money in this area (more in industry) 10 Hal Daumé III (me@hal3.name) LOLS

  11. Joint prediction via learning to search Part of Speech Tagging NN NNS VBP DT NN NN IN NNS NLP algorithms use a kitchen sink of features Dependency Parsing *ROOT* NLP algorithms use a kitchen sink of features

  12. Joint prediction via learning to search use a algorithms Joint Prediction Haiku Joint Prediction Haiku kitchen A joint prediction A joint prediction NLP Across a single input Across a single input Loss measured jointly Loss measured jointly sink *ROOT* of features

  13. Back to the original problem... ● How to optimize a discrete, joint loss? ● Input: x ∈ X I can can a can Pro Md Vb Dt Nn ● Truth: y ∈ Y ( x ) Pro Md Md Dt Vb ● Outputs: Pro Md Md Dt Nn Y ( x ) Pro Md Nn Dt Md ● Predicted: ŷ ∈ Y ( x ) Pro Md Nn Dt Vb ● Loss: Pro Md Nn Dt Nn l o s s ( y , ŷ ) Pro Md Vb Dt Md ● Data: ( x , y ) ~ D Pro Md Vb Dt Vb

  14. Back to the original problem... ● How to optimize a discrete, joint loss? ● Input: Goal: x ∈ X ● Truth: y ∈ Y ( x ) find h ∈ H such that h ● Outputs: ( x ) ∈ Y ( x ) Y ( x ) minimizing ● Predicted: ŷ ∈ Y ( x ) [ ] E D l o s s ( y , h ( x ) ) ● Loss: ( x , y ) ~ l o s s ( y , ŷ ) based on N samples ● Data: ( x , y ) ~ D ( x , y ) ~ D n n

  15. Search spaces ● When y decomposes in an ordered manner, a sequential decision making process emerges decision I action Pro Md Vb Dt Nn decision can action Pro Md Vb Dt Nn decision can action Pro Md Vb Dt Nn

  16. Search spaces ● When y decomposes in an ordered manner, a sequential decision making process emerges Encodes an output ŷ = ŷ ( e ) a from which Pro Md Vb Dt Nn l o s s ( y , ŷ ) can be computed can (at training time) Pro Md Vb Dt Nn e end

  17. Policies ● A policy maps observations to actions obs. π ( ) input: x = a timestep: t partial traj: τ … anything else

  18. An analogy from playing Mario From Mario AI competition 2009 Input: Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} High level goal: Extracted 27K+ binary features Watch an expert play and from last 4 observations (14 binary features for every cell) learn to mimic her behavior

  19. Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Training (expert)

  20. Warm-up: Supervised learning 1.Collect trajectories from expert π ref 2.Store as dataset D = { ( o, π ref (o,y) ) | o ~ π ref } 3.Train classifier π on D ● Let π play the game! ref π ref π

  21. Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Test-time execution (sup. learning)

  22. What's the (biggest) failure mode? The expert never gets stuck next to pipes ⇒ Classifier doesn't learn to recover! ref π ref π

  23. Warm-up II: Imitation learning 1. Collect trajectories from expert π ref If N = T log T, 2. Dataset D 0 = { ( o, π ref (o,y) ) | o ~ π ref } L( π n ) < T  N + O(1) 3. Train π 1 on D 0 for some n 4. Collect new trajectories from π 1 ➢ But let the expert steer! π 1 π 1 5. Dataset D 1 = { ( o, π ref (o,y) ) | o ~ π 1 } π 2 π 2 6. Train π 2 on D 0 ∪ D 1 ref π ref π ● In general: ● D n = { ( o, π ref (o,y) ) | o ~ π n } ● Train π n+1 on ∪ i≤n D i

  24. Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Test-time execution (DAgger)

  25. What's the biggest failure mode? Classifier only sees right versus not-right ● No notion of better or worse ● No partial credit ● Must have a single target answer π 1 π 1 π 2 π 2 π * * π

  26. Learning to search: AggraVaTe 1.Let learned policy π drive for t timesteps to obs. o 2.For each possible action a : ● Take action a , and let expert π drive the rest r e f ● Record the overall loss, c a π π 3.Update π based on example: 0 . 4 ( o , 〈 c , c , . . . , c 〉 ) 1 2 K 0 4.Goto (1) 1 0 0

  27. Training time versus test accuracy

  28. Training time versus test accuracy

  29. Test time speed

  30. State of the art accuracy in.... ● Part of speech tagging (1 million words) ● wc: 3.2 seconds ● US: 6 lines of code 10 seconds to train ● CRFsgd: 1068 lines 30 minutes ● CRF++: 777 lines hours ● Named entity recognition (200 thousand words) ● wc: 0.8 seconds ● US: 30 lines of code 5 seconds to train ● CRFsgd: 1 minute ● CRF++: 10 minutes ● SVM str : 876 lines 30 minutes (suboptimal accuracy)

  31. The Magic ● You write some greedy “test-time” code ● In your favorite imperative language (C++/Python) ● It makes arbitrary calls to a Predict function ● And you add some minor decoration ● We will automatically: ● Perform learning ● Generate non-determinstic (beam) search ● Run faster than specialized learning software

  32. How to train? E loss=0 loss=.2 S R E loss=.8 E one-step deviations rollin 1.Generate an initial trajectory rollout using a rollin policy 2.Foreach state R on that trajectory: a) Foreach possible action a (one-step deviations) i. Take that action ii. Complete this trajectory using a rollout policy iii. Obtain a final loss b) Generate a cost-sensitive classification example: ( Φ (R), 〈 c a 〉 a ∈ A )

  33. The magic in practice A “hint” about the correct decision run(vector<example> ec) I'm really only at training time for i = 0 .. ec.size not hiding y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) anything... Loss( # of y_true != y_pred ) How bad was the entire void run(search& sch, vector<example*> ec) { for (size_t i=0; i<ec.size(); i++) { sequence of predictions uint32_t y_true = get_example_label(ec[i]); (at training time) uint32_t y_pred = sch.predict(ec[i], y_true); sch.loss( y_true != y_pred ); if (sch.output().good()) sch.output() << y_pred << ' '; } }

  34. The illusion of control ● Execute run O(T x A) times, modifying Predict ● For each time step myT = 1 .. T: For each possible action myA = 1 .. A: myA if t = myT define Predict (...) = π otherwise run your code in full set cost a = result of Loss Make classification example on x myT with <cost a > run(vector<example> ec) for i = 0 .. ec.size y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) Loss( # of y_true != y_pred )

  35. Entity/relation identification 35 Hal Daumé III (me@hal3.name) LOLS

  36. Dependency parsing 36 Hal Daumé III (me@hal3.name) LOLS

  37. Outline ➢ Background: learning to search ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks 37 Hal Daumé III (me@hal3.name) LOLS

  38. 38 Hal Daumé III (me@hal3.name) LOLS

  39. 39 Hal Daumé III (me@hal3.name) LOLS

  40. 40 Hal Daumé III (me@hal3.name) LOLS

  41. 41 Hal Daumé III (me@hal3.name) LOLS

  42. 42 Hal Daumé III (me@hal3.name) LOLS

  43. 43 Hal Daumé III (me@hal3.name) LOLS

  44. 44 Hal Daumé III (me@hal3.name) LOLS

  45. 45 Hal Daumé III (me@hal3.name) LOLS

  46. 46 Hal Daumé III (me@hal3.name) LOLS

  47. 47 Hal Daumé III (me@hal3.name) LOLS

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend