Stuff I did in the Spring while not Replying to Email (aka - PowerPoint PPT Presentation

Stuff I did in the Spring while not Replying to Email (aka “advances in structured prediction”) Hal Daumé III | University of Maryland | me@hal3.name | @haldaume3

Examples of structured prediction joint The monster ate a big sandwich

Sequence labeling x = the monster ate the sandwich y = Dt Nn Vb Dt Nn x = Yesterday I traveled to Lille y = - PER - - LOC The monster ate a big sandwich image credit: Richard Padgett

Natural language parsing OUTPUT [root] object n-mod n-mod subject n-mod p-mod n-mod NLP algorithms use a kitchen sink of features INPUT

image credit: Ben Taskar; Liz Jurrus (Bipartite) matching

Machine translation

image credit: Daniel Muñoz Segmentation

Protein secondary structure prediction

Outline Isn't this ➢ Background: learning to search kinda narrow? ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks 9 Hal Daumé III (me@hal3.name) LOLS

My experience, 6 months in industry ➢ Standard adage: academia=freedom, industry=time ➢ Number of responsibilities vs number of bosses ➢ Aspects I didn't anticipate ➢ Breadth (academia) versus depth (industry) ➢ Collaborating through students versus directly ➢ Security through tenure versus security through $ ➢ At the end of the day: who are your colleagues and what do you have to do to pay the piper? Major caveat: this is comparing a top ranked CS dept to top industry lab, in a time when there's tons of money in this area (more in industry) 10 Hal Daumé III (me@hal3.name) LOLS

Joint prediction via learning to search Part of Speech Tagging NN NNS VBP DT NN NN IN NNS NLP algorithms use a kitchen sink of features Dependency Parsing *ROOT* NLP algorithms use a kitchen sink of features

Joint prediction via learning to search use a algorithms Joint Prediction Haiku Joint Prediction Haiku kitchen A joint prediction A joint prediction NLP Across a single input Across a single input Loss measured jointly Loss measured jointly sink *ROOT* of features

Back to the original problem... ● How to optimize a discrete, joint loss? ● Input: x ∈ X I can can a can Pro Md Vb Dt Nn ● Truth: y ∈ Y ( x ) Pro Md Md Dt Vb ● Outputs: Pro Md Md Dt Nn Y ( x ) Pro Md Nn Dt Md ● Predicted: ŷ ∈ Y ( x ) Pro Md Nn Dt Vb ● Loss: Pro Md Nn Dt Nn l o s s ( y , ŷ ) Pro Md Vb Dt Md ● Data: ( x , y ) ~ D Pro Md Vb Dt Vb

Back to the original problem... ● How to optimize a discrete, joint loss? ● Input: Goal: x ∈ X ● Truth: y ∈ Y ( x ) find h ∈ H such that h ● Outputs: ( x ) ∈ Y ( x ) Y ( x ) minimizing ● Predicted: ŷ ∈ Y ( x ) [ ] E D l o s s ( y , h ( x ) ) ● Loss: ( x , y ) ~ l o s s ( y , ŷ ) based on N samples ● Data: ( x , y ) ~ D ( x , y ) ~ D n n

Search spaces ● When y decomposes in an ordered manner, a sequential decision making process emerges decision I action Pro Md Vb Dt Nn decision can action Pro Md Vb Dt Nn decision can action Pro Md Vb Dt Nn

Search spaces ● When y decomposes in an ordered manner, a sequential decision making process emerges Encodes an output ŷ = ŷ ( e ) a from which Pro Md Vb Dt Nn l o s s ( y , ŷ ) can be computed can (at training time) Pro Md Vb Dt Nn e end

Policies ● A policy maps observations to actions obs. π ( ) input: x = a timestep: t partial traj: τ … anything else

An analogy from playing Mario From Mario AI competition 2009 Input: Output: Jump in {0,1} Right in {0,1} Left in {0,1} Speed in {0,1} High level goal: Extracted 27K+ binary features Watch an expert play and from last 4 observations (14 binary features for every cell) learn to mimic her behavior

Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Training (expert)

Warm-up: Supervised learning 1.Collect trajectories from expert π ref 2.Store as dataset D = { ( o, π ref (o,y) ) | o ~ π ref } 3.Train classifier π on D ● Let π play the game! ref π ref π

Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Test-time execution (sup. learning)

What's the (biggest) failure mode? The expert never gets stuck next to pipes ⇒ Classifier doesn't learn to recover! ref π ref π

Warm-up II: Imitation learning 1. Collect trajectories from expert π ref If N = T log T, 2. Dataset D 0 = { ( o, π ref (o,y) ) | o ~ π ref } L( π n ) < T  N + O(1) 3. Train π 1 on D 0 for some n 4. Collect new trajectories from π 1 ➢ But let the expert steer! π 1 π 1 5. Dataset D 1 = { ( o, π ref (o,y) ) | o ~ π 1 } π 2 π 2 6. Train π 2 on D 0 ∪ D 1 ref π ref π ● In general: ● D n = { ( o, π ref (o,y) ) | o ~ π n } ● Train π n+1 on ∪ i≤n D i

Video credit: Stéphane Ross, Geoff Gordon and Drew Bagnell Test-time execution (DAgger)

What's the biggest failure mode? Classifier only sees right versus not-right ● No notion of better or worse ● No partial credit ● Must have a single target answer π 1 π 1 π 2 π 2 π * * π

Learning to search: AggraVaTe 1.Let learned policy π drive for t timesteps to obs. o 2.For each possible action a : ● Take action a , and let expert π drive the rest r e f ● Record the overall loss, c a π π 3.Update π based on example: 0 . 4 ( o , 〈 c , c , . . . , c 〉 ) 1 2 K 0 4.Goto (1) 1 0 0

Training time versus test accuracy

Test time speed

State of the art accuracy in.... ● Part of speech tagging (1 million words) ● wc: 3.2 seconds ● US: 6 lines of code 10 seconds to train ● CRFsgd: 1068 lines 30 minutes ● CRF++: 777 lines hours ● Named entity recognition (200 thousand words) ● wc: 0.8 seconds ● US: 30 lines of code 5 seconds to train ● CRFsgd: 1 minute ● CRF++: 10 minutes ● SVM str : 876 lines 30 minutes (suboptimal accuracy)

The Magic ● You write some greedy “test-time” code ● In your favorite imperative language (C++/Python) ● It makes arbitrary calls to a Predict function ● And you add some minor decoration ● We will automatically: ● Perform learning ● Generate non-determinstic (beam) search ● Run faster than specialized learning software

How to train? E loss=0 loss=.2 S R E loss=.8 E one-step deviations rollin 1.Generate an initial trajectory rollout using a rollin policy 2.Foreach state R on that trajectory: a) Foreach possible action a (one-step deviations) i. Take that action ii. Complete this trajectory using a rollout policy iii. Obtain a final loss b) Generate a cost-sensitive classification example: ( Φ (R), 〈 c a 〉 a ∈ A )

The magic in practice A “hint” about the correct decision run(vector<example> ec) I'm really only at training time for i = 0 .. ec.size not hiding y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) anything... Loss( # of y_true != y_pred ) How bad was the entire void run(search& sch, vector<example*> ec) { for (size_t i=0; i<ec.size(); i++) { sequence of predictions uint32_t y_true = get_example_label(ec[i]); (at training time) uint32_t y_pred = sch.predict(ec[i], y_true); sch.loss( y_true != y_pred ); if (sch.output().good()) sch.output() << y_pred << ' '; } }

The illusion of control ● Execute run O(T x A) times, modifying Predict ● For each time step myT = 1 .. T: For each possible action myA = 1 .. A: myA if t = myT define Predict (...) = π otherwise run your code in full set cost a = result of Loss Make classification example on x myT with <cost a > run(vector<example> ec) for i = 0 .. ec.size y_true = get_example_label(ec[i]) y_pred = Predict(ec[i], y_true) Loss( # of y_true != y_pred )

Entity/relation identification 35 Hal Daumé III (me@hal3.name) LOLS

Dependency parsing 36 Hal Daumé III (me@hal3.name) LOLS

Outline ➢ Background: learning to search ➢ Stuff I did in the Spring ➢ Imperative DSL/library for learning to search ➢ SOTA examples for tagging, parsing, relation extraction, etc. ➢ Learning to search under bandit feedback ➢ Hardness results for learning to search ➢ Active learning for accelerating learning to search ➢ Stuff I'm trying to do now ➢ Distant supervision ➢ Mashups with recurrent neural networks 37 Hal Daumé III (me@hal3.name) LOLS

38 Hal Daumé III (me@hal3.name) LOLS

Stuff I did in the Spring while not Replying to Email (aka - PowerPoint PPT Presentation

Stuff I did in the Spring while not Replying to Email (aka advances in structured prediction) Hal Daum III | University of Maryland | me@hal3.name | @haldaume3 Examples of structured prediction joint The monster ate a big sandwich

While Loops Python While Loops Form of the while loop: while condition : Statement Block

71 st ANNUAL SCIENTIFIC MEETING AND TECHNOLOGY SHOWCASE CALL FOR PAPERS FORM I am replying to

Ultrafast spectroscopy Detector Stuff Generally you have some stuff and you want to

CRITICAL INFORMATICS Our stuff keeps your stuff from becoming their stuff CRITICAL INFORMATICS

while Loops Introducing: while Loops General form of a while loop statement: while [boolean

Music: April Showers By ProleteR Brought to you by: Stuff You Knead! Stuff You Knead!

WHO CAN'T TAKE PICTURES GOOD AND WANNA LEARN TO DO OTHER STUFF GOOD TOO Stay a while and

Mailfence Reclaim your email privacy A SECURE AND PRIVATE EMAIL Mohammad Salman Nadeem

Spring 3 Spring without XML Agenda Industry Forces Whats New Spring 2.0

Stuff Ive Seen: Retrospective and Prospective Susan Dumais SIGIR Desktop

Localization Radius Yicun Zhen Sep 13, 2013 Scheme Mathematical Stuff and Algorithm Numerical

College Student Success How Universities Can Impact Outcomes Some stuff you know and other stuff

Biohacking: Some/mes Spooky Stuff and Some/mes Wonderful Stuff

SELL MORE PRODUCTS ONLINE STEP FIVE SELL THEM STUFF PART 1 jane hamill 1 STEP 5: SELL STUFF

Random Stuff I Find Interesting Random Stuff I Find Interesting Matthew Dockrey

Points to ponder while we wait for everyone to log on Points to ponder while we wait for

Do Not T Do Not Track rack Engineering & Public Policy Lorrie Faith Cranor October

ADAGE Automatic Deployment of Applications in a Grid Environment Key concepts Supports several

Reproducible Analyses for the FCC Valentin Volkl, Tibor Simko, Lukas Heinrich, Clement Helsens

SOCI 325: Sociology of science Agenda 1. Actornetwork theory: overview and key points 2. Next

Stochastic Routing Routing Area Meeting IETF 82 (Taipei) Nov.15, 2011 Routing Topology

Business Strategy Nikolaos Karanasios Assistant Professor CEO of the Serres Business &

Convec'on Heat Transfer External Flows, Internal Flows, and

Discussion on DSM topics during workshop 27 Oct 2015 Please post questions, comments, discussion

Stuff I did in the Spring while not Replying to Email (aka - PowerPoint PPT Presentation

Stuff I did in the Spring while not Replying to Email (aka advances in structured prediction) Hal Daum III | University of Maryland | me@hal3.name | @haldaume3 Examples of structured prediction joint The monster ate a big sandwich

While Loops Python While Loops Form of the while loop: while condition : Statement Block

71 st ANNUAL SCIENTIFIC MEETING AND TECHNOLOGY SHOWCASE CALL FOR PAPERS FORM I am replying to

Ultrafast spectroscopy Detector Stuff Generally you have some stuff and you want to

CRITICAL INFORMATICS Our stuff keeps your stuff from becoming their stuff CRITICAL INFORMATICS

while Loops Introducing: while Loops General form of a while loop statement: while [boolean

Music: April Showers By ProleteR Brought to you by: Stuff You Knead! Stuff You Knead!

WHO CAN'T TAKE PICTURES GOOD AND WANNA LEARN TO DO OTHER STUFF GOOD TOO Stay a while and

Mailfence Reclaim your email privacy A SECURE AND PRIVATE EMAIL Mohammad Salman Nadeem

Spring 3 Spring without XML Agenda Industry Forces Whats New Spring 2.0

Stuff Ive Seen: Retrospective and Prospective Susan Dumais SIGIR Desktop

Localization Radius Yicun Zhen Sep 13, 2013 Scheme Mathematical Stuff and Algorithm Numerical

College Student Success How Universities Can Impact Outcomes Some stuff you know and other stuff

Biohacking: Some/mes Spooky Stuff and Some/mes Wonderful Stuff

SELL MORE PRODUCTS ONLINE STEP FIVE SELL THEM STUFF PART 1 jane hamill 1 STEP 5: SELL STUFF

Random Stuff I Find Interesting Random Stuff I Find Interesting Matthew Dockrey

Points to ponder while we wait for everyone to log on Points to ponder while we wait for

Do Not T Do Not Track rack Engineering &amp; Public Policy Lorrie Faith Cranor October

ADAGE Automatic Deployment of Applications in a Grid Environment Key concepts Supports several

Reproducible Analyses for the FCC Valentin Volkl, Tibor Simko, Lukas Heinrich, Clement Helsens

SOCI 325: Sociology of science Agenda 1. Actornetwork theory: overview and key points 2. Next

Stochastic Routing Routing Area Meeting IETF 82 (Taipei) Nov.15, 2011 Routing Topology

Business Strategy Nikolaos Karanasios Assistant Professor CEO of the Serres Business &amp;

Convec'on Heat Transfer External Flows, Internal Flows, and

Discussion on DSM topics during workshop 27 Oct 2015 Please post questions, comments, discussion

Do Not T Do Not Track rack Engineering & Public Policy Lorrie Faith Cranor October

Business Strategy Nikolaos Karanasios Assistant Professor CEO of the Serres Business &