Theoretical Analysis of Domain Adaptation Current state of the art - PowerPoint PPT Presentation

Theoretical Analysis of Domain Adaptation Current state of the art Shai Ben-David September 14, 2012

Domain Adaptation Most of the statistical learning guarantees are based on assuming that the learning environment is unchanged throughout the learning process.

Domain Adaptation Most of the statistical learning guarantees are based on assuming that the learning environment is unchanged throughout the learning process. Formally, it is common to assume that both the training and the test examples are generated i.i.d. by the same fixed probability distribution.

Domain Adaptation Most of the statistical learning guarantees are based on assuming that the learning environment is unchanged throughout the learning process. Formally, it is common to assume that both the training and the test examples are generated i.i.d. by the same fixed probability distribution. This is unrealistic for many ML applications

Learning when Training and Test distributions differ Examples: ◮ Spam filters – train on email arriving at one address, test on a different mailbox. ◮ Natural Language Processing tasks- train on some content domains, test on others.

Learning when Training and Test distributions differ Examples: ◮ Spam filters – train on email arriving at one address, test on a different mailbox. ◮ Natural Language Processing tasks- train on some content domains, test on others. There is rather little theoretical understanding so far.

Why care about theoretical understanding? ◮ Know when to use (and when not to use) algorithmic paradigms.

Why care about theoretical understanding? ◮ Know when to use (and when not to use) algorithmic paradigms. ◮ Have some performance guarantees.

Why care about theoretical understanding? ◮ Know when to use (and when not to use) algorithmic paradigms. ◮ Have some performance guarantees. ◮ Help choose appropriate algorithmic approach (based on prior knowledge about the task at hand).

Why care about theoretical understanding? ◮ Know when to use (and when not to use) algorithmic paradigms. ◮ Have some performance guarantees. ◮ Help choose appropriate algorithmic approach (based on prior knowledge about the task at hand). ◮ The joy of understanding . . . . . .

Example: Domain adaptation for POS tagging Structural Correspondence Learning(Blitzer, McDonald, Pereira 2005): 1. Choose a set of pivot words (determiners, prepositions, connectors and frequently occurring verbs). 2. Represent every word in a text as a vector of its correlations with each of the pivot words. 3. Train a linear separator on the (images of) the training data coming from one domain and use it for tagging on the other.

Abstraction and analysis (BD, Blitzer, Crammer, Pereira 2005) ◮ Embed the original attribute space into some joint feature space in which: 1. The two tasks look similar. 2. The source task can still be well classified.

Abstraction and analysis (BD, Blitzer, Crammer, Pereira 2005) ◮ Embed the original attribute space into some joint feature space in which: 1. The two tasks look similar. 2. The source task can still be well classified. ◮ Then, treat the images of points from both distributions as if they are coming from a single distribution.

Formalism Domain: X Label set: { 0 , 1 } Source Distribution: P S over X × { 0 , 1 } Target Distribution: P T over X × { 0 , 1 } A DA-learner gets a labeled sample S from the source and a (large) unlabeled sample T from the target and outputs a label predictor h : X → { 0 , 1 } . Goal: Learn a predictor with small target error Err P T ( h ) := Pr [ h ( x ) � = y ] ≤ ǫ ( x , y ) ∼ P T

The error bound supporting that paradigm [BD, Blitzer, Crammer, Pereira 2006] [Mansour, Mohri, Rostamizadeh 2009] For all h ∈ H : Err T ( h ) ≤ Err S ( h ) + A + λ, Where A is an additive measure of discrepancy between the marginals and λ a measure of the discrepancy between the labels, both depending on H .

The error bound supporting that paradigm [BD, Blitzer, Crammer, Pereira 2006] [Mansour, Mohri, Rostamizadeh 2009] For all h ∈ H : Err T ( h ) ≤ Err S ( h ) + A + λ, Where A is an additive measure of discrepancy between the marginals and λ a measure of the discrepancy between the labels, both depending on H . Namely, A = d H ∆ H ( P T , P S ) def = Sup {| P T ( h ∆ h ′ ) − P S ( h ∆ h ′ ) | : h , h ′ ∈ H }

The error bound supporting that paradigm [BD, Blitzer, Crammer, Pereira 2006] [Mansour, Mohri, Rostamizadeh 2009] For all h ∈ H : Err T ( h ) ≤ Err S ( h ) + A + λ, Where A is an additive measure of discrepancy between the marginals and λ a measure of the discrepancy between the labels, both depending on H . Namely, A = d H ∆ H ( P T , P S ) def = Sup {| P T ( h ∆ h ′ ) − P S ( h ∆ h ′ ) | : h , h ′ ∈ H } and λ = Inf { Err T ( h ) + Err S ( h ) : h ∈ H } (The Mansour et al result uses a variation of this - Err T ( h S ) + Err S ( h T ), where h S and h T are minimum error classifiers in H for P S and P T , respectively).

From the bound to an algorithm The bounds imply error guarantees for any algorithm that learns well with respect to the source task.

From the bound to an algorithm The bounds imply error guarantees for any algorithm that learns well with respect to the source task. For example, the simple empirical risk minimization ERM ( H ) paradigms, provided that H has limited capacity (say, finite VC-dimension).

Overview Three aspects determining a DA framework: 1. The type of training samples available to the learner. 2. The assumptions on the relationship between the source (training) and target (test) data-generating distributions. 3. The prior knowledge about the task that the learner has.

Overview Three aspects determining a DA framework: 1. The type of training samples available to the learner. 2. The assumptions on the relationship between the source (training) and target (test) data-generating distributions. 3. The prior knowledge about the task that the learner has. Two types of algorithms: 1. Conservative: Learn the source task and apply the result to the target. 2. Adaptive: Adapt the output classifier based on target information.

The training samples available to the learner Types of “proxy data” ◮ labeled data from a different distribution (source distribution) ◮ (lots of) unlabeled data from the target distribution

The training samples available to the learner Types of “proxy data” ◮ labeled data from a different distribution (source distribution) ◮ (lots of) unlabeled data from the target distribution Questions: ◮ Can we learn with solely with source generated labeled data? ◮ Can target-generated unlabeled data be beneficial or even necessary? ◮ How can we utilize the proxy data if we are also given (little) labeled data from the target distribution?

Relatedness assumptions Relatedness of the unlabeled marginal distributions ◮ Multiplicative measure of distance (the ratio between the source and target probabilities of domain subsets). ◮ Additive measure of distance (the difference between the source and target probabilities of domain subsets, like the d H ∆ H above) (both with respect to some family of domain subsets) Relatedness of the labeling functions ◮ Absolute (like the covariate shift assumption) ◮ Relative to a hypothesis class (like the λ parameter above)

Prior knowledge Prior knowledge about either the source task or the target task. For example: ◮ Realizability by some class of predictors. ◮ Good approximation by some class ◮ Good kernel

Prior knowledge Prior knowledge about either the source task or the target task. For example: ◮ Realizability by some class of predictors. ◮ Good approximation by some class ◮ Good kernel What are the differences between source and target prior knowledge?

The downside of conservative algorithms They can thus be viewed as indicating ”When is domain adaptation not needed?” (the algorithm is just learning with respect to the source-generated traing data)

Adaptive algorithms: A common adaptive paradigm is importance reweighing. Namely, reweigh the source-generate labeled training sample, such that it will look as if it was generated by the target task.

Adaptive algorithms: A common adaptive paradigm is importance reweighing. Namely, reweigh the source-generate labeled training sample, such that it will look as if it was generated by the target task. This is a rather common paradigm in practice.

Adaptive algorithms: A common adaptive paradigm is importance reweighing. Namely, reweigh the source-generate labeled training sample, such that it will look as if it was generated by the target task. This is a rather common paradigm in practice. However, for a theoretical justification of this paradigm, we need some further assumptions.

Relatedness assumptions for the labeling: Covariate shift The covariate- shift assumption: The labeling function is the same for the source and target tasks. (This is reasonable for some DA tasks, such as parts of speech tagging, but may fail in others).

Theoretical Analysis of Domain Adaptation Current state of the art - PowerPoint PPT Presentation

Theoretical Analysis of Domain Adaptation Current state of the art Shai Ben-David September 14, 2012 Domain Adaptation Most of the statistical learning guarantees are based on assuming that the learning environment is unchanged throughout the

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Robust Causal Domain Adaptation in a Simple Diagnostic Setting Thijs van Ommen Ghent, July 4,

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Chapter 24 Chapter 24 Chapter 24 The Domain Name System The Domain Name System The Domain Name

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

Image Processing A case study for a domain decomposed MPI code Domain Decomposition 1

Climate Adaptation Intro and Workshop Overview Paul Moss MPCA Adaptation/Mitigation

Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems Stephen

Protocol on SEA Chapter A3: Determining whether plans & programmes require SEA under the

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele

Building Inclusive Classroom Communities Dr. Ellen Moore, Communication UW Tacoma Dr. Jim

Nanotechnology Plasma & Nanotechnology Graphene nanoflakes CNT Si nanofibers Au nanodots

1 Digital UI Prototype Pipeline - Creation Commercial / third party tools: 2D tools:

Community Survey Data By Cluster (October - November 2014) Carver Community Survey Signature

suffering? If there is an Auschwitz then there can be no God. The truth is that

Electronic Arts Inc. Q1 FY 2020 Results July 30, 2019 Safe Harbor Statement Please review our

Theoretical Analysis of Domain Adaptation Current state of the art - PowerPoint PPT Presentation

Theoretical Analysis of Domain Adaptation Current state of the art Shai Ben-David September 14, 2012 Domain Adaptation Most of the statistical learning guarantees are based on assuming that the learning environment is unchanged throughout the

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Robust Causal Domain Adaptation in a Simple Diagnostic Setting Thijs van Ommen Ghent, July 4,

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Protein Sequence Analysis Protein Sequence Analysis Domain review Domain review What is a

Chapter 24 Chapter 24 Chapter 24 The Domain Name System The Domain Name System The Domain Name

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

Image Processing A case study for a domain decomposed MPI code Domain Decomposition 1

Climate Adaptation Intro and Workshop Overview Paul Moss MPCA Adaptation/Mitigation

Unsupervised Clustering Approaches for Domain Adaptation in Speaker Recognition Systems Stephen

Protocol on SEA Chapter A3: Determining whether plans &amp; programmes require SEA under the

Reinforcement Learning in Configurable Continuous Environments Alberto Maria Metelli, Emanuele

Building Inclusive Classroom Communities Dr. Ellen Moore, Communication UW Tacoma Dr. Jim

Nanotechnology Plasma &amp; Nanotechnology Graphene nanoflakes CNT Si nanofibers Au nanodots

1 Digital UI Prototype Pipeline - Creation Commercial / third party tools: 2D tools:

Community Survey Data By Cluster (October - November 2014) Carver Community Survey Signature

suffering? If there is an Auschwitz then there can be no God. The truth is that

Electronic Arts Inc. Q1 FY 2020 Results July 30, 2019 Safe Harbor Statement Please review our

Protocol on SEA Chapter A3: Determining whether plans & programmes require SEA under the

Nanotechnology Plasma & Nanotechnology Graphene nanoflakes CNT Si nanofibers Au nanodots