lti Introduction Two recent lines of research in speeding up large - PowerPoint PPT Presentation

Distributed Asynchronous Online Learning for Natural Language Processing Kevin Gimpel Dipanjan Das Noah A. Smith lti

Introduction � Two recent lines of research in speeding up large learning problems: � Parallel/distributed computing � Online (and mini-batch) learning algorithms: stochastic gradient descent, perceptron, MIRA, stepwise EM � How can we bring together the benefits of parallel computing and online learning? lti

Introduction � We use asynchronous algorithms (Nedic, Bertsekas, and Borkar, 2001; Langford, Smola, and Zinkevich, 2009) � We apply them to structured prediction tasks: � Supervised learning � Unsupervised learning with both convex and non- convex objectives � Asynchronous learning speeds convergence and works best with small mini-batches lti

Problem Setting � Iterative learning � Moderate to large numbers of training examples � Expensive inference procedures for each example � For concreteness, we start with gradient-based optimization � Single machine with multiple processors � Exploit shared memory for parameters, lexicons, feature caches, etc. � Maintain one master copy of model parameters lti

Single-Processor Batch Learning Parameters: θ � Processors: P � Dataset: D lti

Single-Processor Batch Learning θ P � 0 Time Parameters: θ � Processors: P � Dataset: D lti

Single-Processor Batch Learning θ θ � P � � �� , θ � � 0 Time � �� , θ � � Parameters: θ � D Calculate gradient on data using parameters � θ Processors: P � Dataset: D lti

Single-Processor Batch Learning θ � θ θ � P � � �� , θ � � θ � �� θ � , � � 0 Time � �� , θ � � Parameters: θ � D Calculate gradient on data using parameters � θ Processors: P � θ � �� θ � , � � � Dataset: D Update using gradient to obtain θ � θ � � lti

Single-Processor Batch Learning θ � θ θ � P � � �� , θ � � θ � �� θ � , � � � �� , θ � � 0 Time � �� , θ � � Parameters: θ � D Calculate gradient on data using parameters � θ Processors: P � θ � �� θ � , � � � Dataset: D Update using gradient to obtain θ � θ � � lti

Parallel Batch Learning θ θ � P � � � �� , θ � � P � � � �� , θ � � P � � � �� , θ � � 0 Time � Divide data into parts, Parameters: θ � compute gradient on Processors: P � parts in parallel Dataset: D � D � ∪ D � ∪ D � One processor updates parameters Gradient: � � � � � � � � � � lti

Parallel Batch Learning θ � θ θ � P � � � �� , θ � � θ � �� θ � , � � P � � � �� , θ � � P � � � �� , θ � � 0 Time � Divide data into parts, Parameters: θ � compute gradient on Processors: P � parts in parallel Dataset: D � D � ∪ D � ∪ D � � One processor updates parameters Gradient: � � � � � � � � � � lti

Parallel Batch Learning θ � θ θ � P � � � �� , θ � � θ � �� θ � , � � � � �� , θ � � θ � �� θ � , � P � � � �� , θ � � � � �� , θ � � P � � � �� , θ � � � � �� , θ � � 0 Time � Divide data into parts, Parameters: θ � compute gradient on Processors: P � parts in parallel Dataset: D � D � ∪ D � ∪ D � � One processor updates parameters Gradient: � � � � � � � � � � lti

Parallel Synchronous Mini-Batch Learning Finkel, Kleeman, and Manning (2008) θ � θ θ � θ � � � �� P � θ � �� θ � , � � θ � �� θ � , � � � , θ � � � , θ � � � , � � �� P � � , θ � � � , θ � � � , � � �� P � � , θ � � � , θ � � � , 0 Time Parameters: θ � � Same architecture, just more frequent updates Processors: P � Mini-batches: B � � B � � ∪ B � � ∪ B � � Gradient: � � � � � � � � � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ θ � P � P � P � 0 Time Parameters: θ � Processors: P � Mini-batches: B � Gradient: � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ θ � � � �� , θ � � P � P � � � �� , θ � � P � � � �� , θ � � 0 Time Parameters: θ � Processors: P � Mini-batches: B � Gradient: � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ � θ θ � � � �� , θ � � P � θ � �� θ � , � � � P � � � �� , θ � � P � � � �� , θ � � 0 Time Parameters: θ � Processors: P � Mini-batches: B � Gradient: � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ � θ θ � � � �� , θ � � � � �� , θ � � P � θ � �� θ � , � � � P � � � �� , θ � � P � � � �� , θ � � 0 Time Parameters: θ � Processors: P � Mini-batches: B � Gradient: � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ � θ θ � θ � � � �� , θ � � � � �� , θ � � P � θ � �� θ � , � � � P � � � �� , θ � � θ � �� θ � , � � � P � � � �� , θ � � 0 Time Parameters: θ � Processors: P � Mini-batches: B � Gradient: � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ � θ θ � θ � � � �� , θ � � � � �� , θ � � P � θ � �� θ � , � � � P � � � �� , θ � � � � �� , θ � � θ � �� θ � , � � � P � � � �� , θ � � 0 Time Parameters: θ � Processors: P � Mini-batches: B � Gradient: � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ � θ θ � θ � θ � � � �� , θ � � � � �� , θ � � P � θ � �� θ � , � � � P � � � �� , θ � � � � �� , θ � � θ � �� θ � , � � � P � � � �� , θ � � θ � �� θ � , � � � 0 Time Parameters: θ � Processors: P � Mini-batches: B � Gradient: � � lti

Parallel Asynchronous Mini-Batch Learning Nedic, Bertsekas, and Borkar (2001) θ � θ � θ θ � θ � θ � � � �� , θ � � � � �� , θ � � � � �� P � θ � �� θ � , � � � θ � �� θ � , � � � P � � � �� , θ � � � � �� , θ � � θ � �� θ � , � � � θ � �� θ � P � � � �� , θ � � � � �� , θ � � θ � �� θ � , � � � 0 Time � Gradients computed using stale Parameters: θ � parameters Processors: P � � Increased processor utilization Mini-batches: B � � Only idle time caused by lock for updating parameters Gradient: � � lti

Theoretical Results � How does the use of stale parameters affect convergence? � Convergence results exist for convex optimization using stochastic gradient descent � Convergence guaranteed when max delay is bounded (Nedic, Bertsekas, and Borkar, 2001) � Convergence rates linear in max delay (Langford, Smola, and Zinkevich, 2009) lti

Experiments Task Model Method Convex? | θ | m |D| Stochastic Named-Entity CRF Gradient Y 15k 1.3M 4 Recognition Descent Word IBM Stepwise Y 300k 14.2M 10k Alignment Model 1 EM Unsupervised Stepwise Part-of-Speech HMM N 42k 2M 4 EM Tagging � To compare algorithms, we use wall clock time (with a dedicated 4-processor machine) � m = mini-batch size lti

Experiments Task Model Method Convex? |D| | θ | m Stochastic Named-Entity CRF Gradient Y 15k 1.3M 4 Recognition Descent � CoNLL 2003 English data � Label each token with entity type (person, location, organization, or miscellaneous) or non-entity � We show convergence in F1 on development data lti

lti Introduction Two recent lines of research in speeding up large - PowerPoint PPT Presentation

Distributed Asynchronous Online Learning for Natural Language Processing Kevin Gimpel Dipanjan Das Noah A. Smith lti Introduction Two recent lines of research in speeding up large learning problems: Parallel/distributed

Models for LTI systems LTI system stands for linear time invariant system Model describing LTI

Topic 2: LTI Systems and Convolution Response of LTI Systems Impulse response and unit

SIMPLE & LEAN PRODUCER Expanding Production and Reducing Costs Health and Safety Update: No

lti 1 (typically) Unsupervised learning in NLP non-convex optimization lti 2

lti The Goal Input: educational text Output: quiz lti The Goal Input:

C. H. Perez & Associates C Consulting E lti E ngineers, Inc. i I FDOT District Four

CMU LTI @ KBP 2015 Event Track Zhengzhong Liu Dheeru Dua Jun Araki Teruko Mitamura Eduard Hovy

dt < | ( ) | h t (this has to do with system stability system stability)

INC 212 Signals and systems Lecture#4: Frequency response of LTI systems Assoc. Prof. Benjamas

M u lti v ariable logistic regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita

Textual Predictors of Bill Survival in Congressional Committees Tae Yano , LTI, CMU Noah Smith ,

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties Dipanjan Das , LTI, CMU Google

Signal and Systems Chapter 2: LTI Systems Representation of DT signals in terms of shifted unit

dt < | ( ) | h t (this has to do with system stability system stability (ECE

Representation of LTI Systems Prof. Seungchul Lee Industrial AI Lab. Transfer Function

VASCO (VAcuum Stability COde) : multi-gas code to calculate gas density lti d t l l t d it

Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation Michael R.

New fixed-parameter algorithms for the minimum quartet inconsistency problem Maw-Shang Chang 1

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Asymmetry of Genetic Code and the Role of Parrondos Paradox presented by Lee Kee Jin B.Eng.

SARS-CoV-2 Cause of COVID-19 Timothy Borelli, DO, Medical Director of Infectious Disease at

Recent progress on the Viana conjecture Stefano Luzzatto Abdus Salam International Centre for

The art of breaking and designing captchas Elie Bursztein Session ID: HT02-402 Insert

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin Meeting on Analyzing

lti Introduction Two recent lines of research in speeding up large - PowerPoint PPT Presentation

Distributed Asynchronous Online Learning for Natural Language Processing Kevin Gimpel Dipanjan Das Noah A. Smith lti Introduction Two recent lines of research in speeding up large learning problems: Parallel/distributed

Models for LTI systems LTI system stands for linear time invariant system Model describing LTI

Topic 2: LTI Systems and Convolution Response of LTI Systems Impulse response and unit

SIMPLE &amp; LEAN PRODUCER Expanding Production and Reducing Costs Health and Safety Update: No

lti 1 (typically) Unsupervised learning in NLP non-convex optimization lti 2

lti The Goal Input: educational text Output: quiz lti The Goal Input:

C. H. Perez &amp; Associates C Consulting E lti E ngineers, Inc. i I FDOT District Four

CMU LTI @ KBP 2015 Event Track Zhengzhong Liu Dheeru Dua Jun Araki Teruko Mitamura Eduard Hovy

dt &lt; | ( ) | h t (this has to do with system stability system stability)

INC 212 Signals and systems Lecture#4: Frequency response of LTI systems Assoc. Prof. Benjamas

M u lti v ariable logistic regression G E N E R AL IZE D L IN E AR MOD E L S IN P YTH ON Ita

Textual Predictors of Bill Survival in Congressional Committees Tae Yano , LTI, CMU Noah Smith ,

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties Dipanjan Das , LTI, CMU Google

Signal and Systems Chapter 2: LTI Systems Representation of DT signals in terms of shifted unit

dt &lt; | ( ) | h t (this has to do with system stability system stability (ECE

Representation of LTI Systems Prof. Seungchul Lee Industrial AI Lab. Transfer Function

VASCO (VAcuum Stability COde) : multi-gas code to calculate gas density lti d t l l t d it

Performance Potential of Optimization Phase Selection During Dynamic JIT Compilation Michael R.

New fixed-parameter algorithms for the minimum quartet inconsistency problem Maw-Shang Chang 1

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Asymmetry of Genetic Code and the Role of Parrondos Paradox presented by Lee Kee Jin B.Eng.

SARS-CoV-2 Cause of COVID-19 Timothy Borelli, DO, Medical Director of Infectious Disease at

Recent progress on the Viana conjecture Stefano Luzzatto Abdus Salam International Centre for

The art of breaking and designing captchas Elie Bursztein Session ID: HT02-402 Insert

L1-Identification Serhiy Bykh, Detmar Meurers Second Tbingen-Berlin Meeting on Analyzing

SIMPLE & LEAN PRODUCER Expanding Production and Reducing Costs Health and Safety Update: No

C. H. Perez & Associates C Consulting E lti E ngineers, Inc. i I FDOT District Four

dt < | ( ) | h t (this has to do with system stability system stability)

dt < | ( ) | h t (this has to do with system stability system stability (ECE