6. Approximation and fitting Prof. Ying Cui Department of - - PowerPoint PPT Presentation

6 approximation and fitting
SMART_READER_LITE
LIVE PREVIEW

6. Approximation and fitting Prof. Ying Cui Department of - - PowerPoint PPT Presentation

6. Approximation and fitting Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2018 SJTU Ying Cui 1 / 31 Outline Norm approximation Least-norm problems Regularized approximation Robust approximation SJTU


slide-1
SLIDE 1
  • 6. Approximation and fitting
  • Prof. Ying Cui

Department of Electrical Engineering Shanghai Jiao Tong University

2018

SJTU Ying Cui 1 / 31

slide-2
SLIDE 2

Outline

Norm approximation Least-norm problems Regularized approximation Robust approximation

SJTU Ying Cui 2 / 31

slide-3
SLIDE 3

Basic norm approximation problem

min

x

Ax − b where A ∈ Rm×n with m ≥ n and independent columns and b ∈ Rm are given, and · is a norm on Rm ◮ a solvable convex problem

◮ optimal value is zero iff b ∈ R(A) = {Ax|x ∈ Rn}

◮ solution is A−1b when m = n

◮ optimal value is non-zero and problem is more interesting and useful if b ∈ R(A)

◮ an optimal point x∗ = arg minx Ax − b is called an approximate solution of Ax ≈ b in norm · ◮ r = Ax − b is called the residual for the problem

SJTU Ying Cui 3 / 31

slide-4
SLIDE 4

Interpretations

approximation interpretation: ◮ fit/approximate vector b by a linear combination of columns

  • f A, as closely as possible, with deviation measured in ·

◮ Ax = x1a1 + · · · + xnan (a1, · · · , an ∈ Rm: columns of A)

◮ Ax∗ is best approximation of b ◮ approximation problem is also called regression problem

◮ a1, · · · , an are called regressors ◮ x∗

1 a1 + · · · + x∗ n an is called regression of b

estimation interpretation: ◮ estimate a parameter vector x on based on an imperfect linear vector measurement b

◮ consider a linear measurement model y = Ax + v, where y ∈ Rm is a vector measurement, x ∈ Rm is a vector of parameters to be estimated, v ∈ Rm is some measurement error that is unknown but presumed to be small in ·

◮ x∗ is best guess of x, given y = b

SJTU Ying Cui 4 / 31

slide-5
SLIDE 5

Interpretations

geometric interpretation: ◮ find a projection of point b onto subspace R(A) in · , i.e., a point in R(A) that is closest to b, i.e., an optimal point of min

u

u − b s.t. u ∈ R(A) ◮ Ax∗ is point in R(A) closest to b design interpretation: ◮ choose a vector of design variables x that achieves, as closely as possible, the target/desired results b

◮ residual vector r = Ax − b can be interpreted as the deviation between actual results Ax and target/desired results b

◮ x∗ is design that best achieves desired results b

SJTU Ying Cui 5 / 31

slide-6
SLIDE 6

Examples

least-squares approximation ( · 2): equivalent problem (QP, obtained by squaring objective): min

x

Ax − b2

2 = r 2 1 + · · · + r 2 m

◮ objective function f (x) = xTATAx − 2xT ATb + bTb is convex quadratic ◮ point x is optimal iff it satisfies ∇f (x) = 2ATAx − 2ATb = 0 = ⇒ ATAx = ATb which always have a solution ◮ unique solution x∗ = (ATA)−1ATb if columns of A are independent (i.e., rankA = n)

SJTU Ying Cui 6 / 31

slide-7
SLIDE 7

Examples

Chebyshev or minimax approximation ( · ∞): min

x

Ax − b∞ = max{|r1|, · · · , |rm|} ◮ equivalent problem (LP): min

x∈Rn,t∈R

t s.t. − t1 Ax − b t1 Sum of absolute residuals approximation ( · 1): min

x

Ax − b1 = |r1| + · · · + |rm| ◮ equivalent problem (LP): min

x,t∈Rn

1T t s.t. − t Ax − b t

SJTU Ying Cui 7 / 31

slide-8
SLIDE 8

Penalty function approximation

min

x∈Rn,r∈Rm

φ(r1) + · · · + φ(rm) s.t. r = Ax − b ◮ (residual) penalty function φ : R → R is a measure of dislike

  • f a residual, and assumed to be convex

◮ in many cases, symmetric, nonnegative and φ(0) = 0

◮ interpretation: minimize total penalty incurred by residuals of approximation Ax of b ◮ extension of equivalent problem of lp-norm (1 ≤ p < ∞) approximation min

x

Ax − bp

p = |r1|p + · · · + |rm|p

with a separable and symmetric function of residuals as

  • bjective function

SJTU Ying Cui 8 / 31

slide-9
SLIDE 9

Common penalty functions

◮ lp-norm penalty function φ(u) = |u|p (1 ≤ p < ∞)

◮ quadratic penalty function φ(u) = u2 ◮ absolute value penalty function φ(u) = |u|

◮ deadzone-linear penalty function with deadzone width a > 0 φ(u) = max{0, |u| − a} =

  • 0,

|u| ≤ a |u| − a, |u| > a ◮ log-barrier penalty function with limit a > 0 φ(u) =

  • −a2 log(1 − (u/a)2),

|u| < a ∞, |u| ≥ a

SJTU Ying Cui 9 / 31

slide-10
SLIDE 10

Example

histogram of residual amplitudes for four penalty functions φ(u) = |u|, φ(u) = u2, φ(u) = max{0, |u|−0.5}, φ(u) = − log(1−u2) ◮ many zero or very small residuals, more large ones ◮ many modest residuals, relatively fewer large ones ◮ many residuals right at edge of ‘free’ zone ◮ residual distribution similar to that of quadratic except no residuals larger than 1 ◮ shape of penalty function has a large effect on distribution of residuals

SJTU Ying Cui 10 / 31

slide-11
SLIDE 11

Sensitivity to outliers or large errors

◮ in estimation or regression context, an outlier is a measurement yi = aT

i x + vi with a relatively large noise vi

◮ often associated with a flawed measurement or faulty data

◮ ideally guess which measurements are outliers, and either remove outliers or greatly lower their weight ◮ cannot assign zero penalty for very large residuals

◮ avoid making all residuals large to yield a total penalty of zero

◮ sensitivity to outliers depends on (relative) value of penalty function for large residuals ◮ least sensitive convex penalty functions are those grow linearly, i.e., like |u|, for large u, called robust (against outliers)

SJTU Ying Cui 11 / 31

slide-12
SLIDE 12

Robust convex penalty functions

◮ absolute value penalty function: φ(u) = |u| ◮ Huber penalty function (with parameter M > 0): φhub(u) =

  • u2,

|u| ≤ M M(2|u| − M), |u| > M ◮ example: use an affine function f (t) = α + βt to fit 42 points (ti, yi) (circles) with two obvious outliers

◮ left: Huber penalty for M = 1 ◮ right: fit using quadratic penalty (dashed) is rotated away from non-outlier data, toward outliers; fit using Huber penalty (solid) gives a far better fit to non-outlier data

SJTU Ying Cui 12 / 31

slide-13
SLIDE 13

Approximation with constraints

add constraints to basic norm approximation problem ◮ in an approximation problem, constraints can be used to ensure that approximation Ax of b satisfies certain properties ◮ in an estimation problem, constraints arise as prior knowledge

  • f vector x to be estimated, or from prior knowledge of

estimation error v ◮ in a geometric problem, constraints arise in determining projection of a point b on a set more complicated than a subspace, e.g., a cone or polyhedron

SJTU Ying Cui 13 / 31

slide-14
SLIDE 14

Examples

Nonnegativity constraints: min

x

Ax − b s.t. x 0 ◮ approximate b using a conic combination of columns of A ◮ estimate x known to be nonnegative, e.g., powers, rates ◮ determine projection of b onto cone generated by columns of A Variable bounds: min

x

Ax − b s.t. l x u ◮ estimate x with prior knowledge of interval for each variable ◮ determine projection of b onto image of a box under linear mapping induced by A

SJTU Ying Cui 14 / 31

slide-15
SLIDE 15

Examples

Probability distribution: min

x

Ax − b s.t. x 0, 1T x = 1 ◮ approximate b using a convex combination of columns of A ◮ estimate proportions or relative frequencies Norm ball constraint: min

x

Ax − b s.t. x − x0 ≤ d ◮ estimate x with prior guess x0 and maximum plausible deviation d ◮ approximate b using a linear combination of columns of A within trust region x − x0 ≤ d

SJTU Ying Cui 15 / 31

slide-16
SLIDE 16

Least-norm problems

min

x

x s.t. Ax = b where A ∈ Rm×n with m ≤ n and independent rows, b ∈ Rm, and · is a norm on Rn ◮ a solvable convex problem

◮ only feasible point is A−1b when m = n ◮ problem is interesting only when m < n (Ax = b underdetermined)

◮ an optimal point x∗ is called a least-norm solution of Ax = b in norm · ◮ reformulation: norm approximation problem minu x0 + Zu, (x0 + Zu: general solution of Ax = b, Z ∈ Rn×m, u ∈ Rm) ◮ extension: least-penalty problem min

x

φ(x1) + · · · + φ(xn) φ: convex, nonnegative, φ(0) = 0 s.t. Ax = b

SJTU Ying Cui 16 / 31

slide-17
SLIDE 17

Interpretations

control or design interpretation: ◮ x are n design variables (inputs), b are m required results (outputs), and Ax = b represent m requirements on design ◮ design is underspecified with n − m DoFs (as m < n) ◮ choose smallest (‘most efficient’) design (measured by norm · ) that satisfies requirements estimation interpretation: ◮ x are n parameters, and b are m perfect measurements ◮ measurements do not completely determine parameters (as m < n), and prior information is that parameters are small (measured by norm · ) ◮ choose smallest (‘most plausible’) estimate consistent with measurements geometric interpretation: ◮ find point in affine set {x|Ax = b} with minimum distance (measured by norm · ) to 0

SJTU Ying Cui 17 / 31

slide-18
SLIDE 18

Examples

least-squares solution of linear equations ( · 2): ◮ equivalent problem (strictly convex) min

x

xTx s.t. Ax = b ◮ can be solved via KKT conditions

  • Ax∗ = b

2x∗ + ATν∗ = 0 ⇒

  • x∗ = AT(AAT)−1b

ν∗ = −2(AAT)−1b Sparse solutions via least l1-norm ( · 1): ◮ equivalent problem (LP) min

x,y

1T y s.t. − y x y, Ax = b ◮ tends to produce sparse solutions of Ax = b

SJTU Ying Cui 18 / 31

slide-19
SLIDE 19

Regularized approximation

Bi-criterion formulation (basic form) find x that is small and also makes residual Ax − b small min

x

(w.r.t.R2

+)

(Ax − b, x) where A ∈ Rm×n, and two norms on Rm and Rn can be different ◮ a (convex) vector optimization problem with two objectives

◮ first norm: measure size of residual ◮ second norm: measure size of x

◮ Pareto optimal points form optimal trade-off curve between two objectives

◮ one endpoint: (b, 0) ◮ the other endpoint: (minx Ax − b, minx∈C x) where C = arg minx Ax − b

SJTU Ying Cui 19 / 31

slide-20
SLIDE 20

Regularized approximation

scalarization is a common method for a multi-criterion problem Two forms of scalarized problems ◮ minimize the weighted sum of objectives min

x

Ax − b + γx

◮ solution for γ > 0 traces out optimal trade-off curve

◮ minimize the weighted sum of squared objectives min

x

Ax − b2 + δx2

◮ solution for δ > 0 traces out optimal trade-off curve

Tikhonov regularization ( · 2) min

x

Ax − b2

2 + δx2 2 = xT(ATA + δI)x − 2bTAx + bTb

◮ a convex quadratic problem (as ATA + δI ≻ 0 for any δ > 0) ◮ analytical solution x∗ = (ATA + δI)−1ATb

SJTU Ying Cui 20 / 31

slide-21
SLIDE 21

Interpretations

◮ estimation interpretation: linear measurement model y = Ax + v, with prior knowledge that x is small ◮ design interpretation: design cost x should be small ◮ approximation interpretation: linear model y = Ax is only valid for small x ◮ robust approximation interpretation: variation in A causes large variation in Ax for large x, and small x is less sensitive to errors in A

SJTU Ying Cui 21 / 31

slide-22
SLIDE 22

Example

Optimal input design consider a linear dynamical system y(t) =

t

  • τ=0

h(τ)u(t − τ), t = 0, 1, · · · , N ◮ input sequence u(0), u(1), · · · , u(N) ∈ R ◮ output sequence y(0), y(1), · · · , y(N) ∈ R ◮ impulse response h(0), h(1), · · · , h(N) ∈ R choose input sequence u to achieve three goals: ◮ output tracking (desired target ydes): Jtrack =

1 N+1

N

t=0(y(t) − ydes(t))2

◮ small input: Jmag =

1 N+1

N

t=0 u(t)2

◮ small input variations: Jder = 1

N

N−1

t=0 (u(t + 1) − u(t))2

regularized least-squares formulation: minu Jtrack + δJder + ηJmag ◮ for fixed δ, η > 0, a least-squares problem in u ◮ vary δ, η > 0 to trade off three objectives

SJTU Ying Cui 22 / 31

slide-23
SLIDE 23

Example

  • ptimal inputs (left) and resulting outputs (right) for three values
  • f δ (input variation) and η (input magnitude)

◮ N = 200, h(t) = 1/9(0.9)t(1 − 0.4 cos(2t)), dashed line ydes ◮ top: δ = 0, η = 0.005; middle: δ = 0, η = 0.05; bottom: δ = 0.3, η = 0.05 ◮ tracking is good, but input required is large and rapidly varying ◮ input is smaller, at the cost

  • f a larger tracking error

◮ input variation is substantially reduced, with not much increase in

  • utput tracking error

SJTU Ying Cui 23 / 31

slide-24
SLIDE 24

Signal reconstruction, de-noising, and smoothing

given corrupted signal xcor, form an estimate ˆ x of original signal x, i.e., remove noise from corrupted signal xcor, often end up performing smoothing operation on xcor min

ˆ x

(w.r.t.R2

+)

(ˆ x − xcor2, φ(ˆ x)) ◮ x ∈ Rn is unknown signal ◮ xcor = x + v is (known) corrupted version of x, with additive noise v ◮ variable ˆ x (reconstructed signal) is estimate of x ◮ φ : Rn → R is regularization function or smoothing objective to measure roughness or lack of smoothness of estimate ˆ x

◮ quadratic smoothing function: φquad(ˆ x) = n−1

i=1 (ˆ

xi+1 − ˆ xi)2 ◮ total variation smoothing function: φtv(ˆ x) = n−1

i=1 |ˆ

xi+1 − ˆ xi|

◮ a convex bi-criterion problem: Pareto optimal points can be found by scalarization, and solving a (scalar) convex problem

SJTU Ying Cui 24 / 31

slide-25
SLIDE 25

Examples

quadratic smoothing: φquad(ˆ x) = n−1

i=1 (ˆ

xi+1 − ˆ xi)2 works well for very smooth original signal and rapidly varying noise

(a) original signal x and corrupted signal xcor (b) three solutions on opt. trade-off curve ˆ x − xcor2 vs. φquad(ˆ x)

◮ large ˆ x − xcor2 (top): too much smoothing ◮ ˆ x − xcor2 = 3 (middle): best reconstruction ◮ small ˆ x − xcor2 (bottom): too little smoothing

SJTU Ying Cui 25 / 31

slide-26
SLIDE 26

Examples

quadratic smoothing: φquad(ˆ x) = n−1

i=1 (ˆ

xi+1 − ˆ xi)2 not suitable for original signal with rapid variations

(c) original signal x and cor- rupted signal xcor (d) three solutions on opt. trade-

  • ff curve ||ˆ

x − xcor||2 vs. φquad(ˆ x)

◮ first two: rapid variations in signal are smoothed with noise ◮ third: steep edges in signal preserved with significant noise left

SJTU Ying Cui 26 / 31

slide-27
SLIDE 27

Examples

total variation reconstruction: φtv(ˆ x) = n−1

i=1 |ˆ

xi+1 − ˆ xi| removes most noise, preserves occasional rapid variations in signal

(e) original signal x and cor- rupted signal xcor (f) three solutions on opt. trade-

  • ff curve ||ˆ

x − xcor||2 vs. φtv(ˆ x)

◮ φtv(ˆ x) = 5 (top): eliminates some slow variations in signal ◮ φtv(ˆ x) = 8 (middel): best reconstruction ◮ φtv(ˆ x) = 10 (bottom): does not remove enough noise

SJTU Ying Cui 27 / 31

slide-28
SLIDE 28

Robust approximation

two approaches for minimizing Ax − b with uncertain A: ◮ stochastic robust approximation: assume A is random, and consider min

x

EAx − b ◮ worst-case robust approximation: assume A lies in (nonempty and bounded) set A of possible values, and consider min

x

sup

A∈A

Ax − b always convex problems, but tractable only in special cases: ◮ certain norms · , distributions, sets A

SJTU Ying Cui 28 / 31

slide-29
SLIDE 29

Stochastic robust least-squares problem

A = ¯ A + U, where U random with EU = 0 and EUT U = P min

x

EAx − b2

2

◮ explicit expression for objective EAx − b2

2 = E( ¯

Ax − b + Ux)T( ¯ Ax − b + Ux) = ¯ Ax − b2

2 + ExT UTUx = ¯

Ax − b2

2 + xTPx

◮ equivalent problem (regularized least-squares problem) min

x

¯ Ax − b2

2 + P1/2x2 2

◮ when A is subject to variation, balance making ¯ Ax − b small with the desire for a small x (to keep variation in Ax small)

◮ optimal solution x∗ = ( ¯ AT ¯ A + P)−1 ¯ ATb ◮ when P = δI (Uij’s: zero mean, uncorrelated random variables with variance δ/m) , get Tikhonov regularization problem min

x

¯ Ax − b2

2 + δx2 2

SJTU Ying Cui 29 / 31

slide-30
SLIDE 30

Worst-case robust least-squares problem

A ∈ A = { ¯ A + u1A1 + · · · + upAp|u2 ≤ 1}, where · is a norm

  • n Rp, and p + 1 matrices ¯

A, A1, · · · , Ap ∈ Rm×n are given min

x

sup

A∈A

Ax − b2

2 =

sup

u2≤1

P(x)u + q(x)2

2

P(x) = [A1x A2x · · · Apx] ∈ Rm×p and q(x) = ¯ Ax − b ∈ Rm ◮ strong duality holds (for given x) max

u

P(x)u + q(x)2

2

min

t,λ

t + λ s.t. u2

2 ≤ 1

s.t.   I P(x) q(x) P(x)T λI q(x)T t   0 ◮ equivalent problem (SDP) min

x,t,λ

t + λ s.t.   I P(x) q(x) P(x)T λI q(x)T t   0

SJTU Ying Cui 30 / 31

slide-31
SLIDE 31

Comparison example

A(u) = A0 + uA1 with u ∈ [−1, 1] ◮ nominal opt.: xnorm = arg minx A0x − b2

2

◮ stochastic robust opt.: xstoch = arg minx EA(u)x − b2

2 with

u uniformly distributed on [−1, 1] ◮ worst-case robust opt.: xwc = arg minx supu∈[−1,1] A(u)x − b2

2

residual r(u) = A(u)x − b2 versus u for x = xnorm, xstoch, xwc ◮ xnorm: smallest residual at u = 0, quite sensitive to parameter variation ◮ xwc: a larger residual at u = 0, not sensitive to parameter variation ◮ xstoch: in between

SJTU Ying Cui 31 / 31