Information Theory in an Industrial Research Lab Marcelo J. - - PowerPoint PPT Presentation

information theory in an industrial research lab
SMART_READER_LITE
LIVE PREVIEW

Information Theory in an Industrial Research Lab Marcelo J. - - PowerPoint PPT Presentation

Information Theory in an Industrial Research Lab Marcelo J. Weinberger Information Theory Research Group Hewlett-Packard Laboratories Advanced Studies Palo Alto, California, USA with contributions from the ITR group Purdue University


slide-1
SLIDE 1

Information Theory in an Industrial Research Lab

Marcelo J. Weinberger

Information Theory Research Group Hewlett-Packard Laboratories – Advanced Studies Palo Alto, California, USA with contributions from the ITR group Purdue University – November 19, 2007

slide-2
SLIDE 2

it’s all about models, bounds, and algorithms The mathematical theory:

measures of information fundamentals of data representation (codes) for

compactness secure, reliable communication/storage over a possibly noisy channel

A formal framework for areas of engineering and science for which the notion of “information” is relevant

Components:

data models fundamental bounds codes, efficient encoding/decoding algorithms

Engineering problems addressed:

data compression error control coding cryptography

Information Theory (Shannon, 1948)

enabling technologies with many practical applications in: computing, imaging, storage, multimedia, communications...

slide-3
SLIDE 3

Information Theory research in the industry

Mission

Research the mathematical foundations and practical applications

  • f information theory, generating intellectual property and

technology for “XXX Company” through the advancement of scientific knowledge in these areas

Apply the theory and work on the applications makes

  • bvious sense for “XXX Company” research labs;

But why invest on advancing the theory?

some simple answers which apply to any basic research area: long-term investment, prestige, visibility, give back to society... this talk will be about a different type of answer: differentiating technology vs. enabling technology

Main claim:

working on the theory helps developing analytical tools that are needed to envision innovative, technology-differentiating ideas

slide-4
SLIDE 4

Case studies

JPEG-LS: from universal context modeling to a lossless image compression standard DUDE (Discrete Universal DEnoiser): from a formal setting for universal denoising to actual image denoising algorithms Error-correcting codes in nanotechnology: the advantages of interdisciplinary research 2-D information theory: looking into the future

  • f storage devices

compress store, transmit de- compress Input Output

010010... 010010...

S

001 010 100 111

1 S

001 010 100 111

1

slide-5
SLIDE 5

Work paradigm

work on the theory work on practical solutions identify a (fairly abstract) practical problem start e.g. e.g.

  • image compression
  • 2D channel coding
  • ECC for nano
  • denoising

motivation

  • scientific interest
  • vision of benefit to XXX

academia, scientific community ideas, papers, participation patents, technology, consulting visibility

talent

standards visibility industry new challenges

slide-6
SLIDE 6

Universal Modeling and Coding

Traditional Shannon theory assumes that a (probabilistic) model of the data is available, and aims at compressing the data optimally w.r.t. the model Kraft’s inequality: Every uniquely decipherable (UD) code of length L(s) satisfies

string of length n

  • ver finite alphabet A

⇒ a code defines a probability distribution P(s) = 2-L(s) over An

Conversely, given a distribution P( ) (a model), there exists a UD code that assigns bits to s (Shannon code) Hence, P( ) serves as a model to encode s, and every code has an associated model

a model is a probabilistic tool to “understand” and predict the behavior of the data

− ∈

L s s An ( )

2

1

⎡ ⎤

− log ( ) p s

slide-7
SLIDE 7

Universal Modeling and Coding (cont.)

Given a model P( ) on n-tuples, arithmetic coding provides an effective mean to sequentially assign a code word of length close to -log P(s) to s

if s = x1 x2 … xn the “ideal code length” for symbol xt is the model can vary arbitrarily and “adapt” to the data

CODING SYSTEM = MODEL + CODING UNIT

two separate problems: design a model and use it to encode

We will view data compression as a problem of assigning probabilities to data

) ... | (

1 2 1 − t t

x x x x p

slide-8
SLIDE 8

Coding with Model Classes

Universal data compression deals with the optimal description of data in the absence of a given model

in most practical applications, the model is not given to us

How do we make the concept of “optimality” meaningful?

there is always a code that assigns just 1 bit to the given data!

The answer: Model classes We want a “universal” code to perform as well as the best model in a given class C for any string s, where the best competing model changes from string to string

universality makes sense only w.r.t. a model class

A code with length function L(xn) is pointwise universal w.r.t. a class if, when n → ∞,

)] ( min ) ( [ 1 ) , ( → − =

∈ n C C n n

x L x L n x L R

C C

code length with model C

slide-9
SLIDE 9

How to Choose a Model Class?

Universal coding tells us how to encode optimally w.r.t. to a class; it doesn’t tell us how to choose a class! Some possible criteria:

complexity: existence of efficient algorithms prior knowledge on the data

We will see that the bigger the class, the slower the best possible convergence rate of the redundancy to 0

in this sense, prior knowledge is of paramount importance: don’t learn what you already know!

Ultimately, the choice of model class is an art

slide-10
SLIDE 10

Parametric Model Classes

A useful limitation to the model class is to assume

C = {Pθ, θ ∈ Θd }

Examples:

Bernoulli: d = 1 , general i.i.d. model: d = α-1 (α = |A|) FSM model with k states: d = k(α-1) memoryless geometric distribution on the integers i ≥ 0:

P(i) = θi (1-θ), d = 1 A straightforward method: two-part code [Rissanen ‘84] Trade-off: the dimension of the parameter space plays a fundamental role in modeling problems

a parameter space of dimension d

⎡ ⎤

θ θ + − log ( | ) p x n

encode best bits “model cost”: grows with d probability of the data under θ: grows with d

slide-11
SLIDE 11

Fundamental Lower Bound

A criterion for measuring the optimality of a universal model is provided by Rissanen’s lower bound [Rissanen ‘84]

for every p( ), any ε > 0, and sufficiently large n, for all parameter values θ except for a set whose volume → 0 as

n → ∞, provided a “good” estimator of θ exists Conclusion: the number of parameters affects the achievable convergence rate of a universal code length to the entropy

) 1 ( 2 log + ) ( )] ( log [

1

ε θ

θ

− ≥ −

n n d H x P E n

n n

slide-12
SLIDE 12

Contexts and Tree Models

More efficient parametrization of a Markov process

[Weinberger/Lempel/Ziv ‘92, Weinberger/Rissanen/Feder ‘95]

Any suffix of a sequence xt is called a context in which the next symbol xt+1 occurs For a finite-memory source P, the conditioning states s(xt) are contexts that satisfy

# of parameters: α-1 per leaf of the tree

There exist efficient universal schemes in the class of tree models of any size [Weinberger/Rissanen/Feder ’95,

Willems/Shtarkov/Tjalkens ’95, Martín/Seroussi/Weinberger ’04]

1 1 1 1 1

... 1 1 1 ... 0 0 1 0

next input bit context

A a A u x us a P x a P

t t

∈ ∈ ∀ = *, )), ( | ( ) | (

slide-13
SLIDE 13

Lossless Image Compression (the real thing…)

Some applications of lossless image compression:

Images meant for further analysis and processing (as opposed to just human perception) Images where loss might have legal implications Images obtained at great cost Applications with intensive editing and repeated compression/decompression cycles Applications where desired quality of rendered image is unknown at time of acquisition

International standard: JPEG-LS (1998)

Compress Store, transmit De- compress Input Output

010010... 010010...

slide-14
SLIDE 14

Universality vs. Prior Knowledge

Application of universal algorithms for tree models directly to real images yields poor results

some structural symmetries typical of images are not captured by the model a universal model has an associated “learning cost:” why learn something we already know?

Modeling approach: limit model class by use of “prior knowledge”

for example, images tend to be a combination of smooth regions and edges predictive coding was successfully used for years: it encodes the difference between a pixel and a predicted value of it prediction errors tend to follow a Laplacian distribution ⇒ AR model + Laplacian, where both the center and the decay are context dependent Prediction = fixed prediction + adaptive correction

slide-15
SLIDE 15

Models for Images

In practice, contexts are formed out of a finite subset of the past sequence Conditional probability model for prediction errors: two-sided geometric distribution (TSGD)

“discrete Laplacian” shift s constrained to [0,1) by integer-valued adaptive correction (bias cancellation) on the fixed predictor

c a b d x

Current sample Causal template

) 1 , [ (0,1), , ) (

| |

∈ ∈ =

+

s c e P

s e

θ θ

s e P(e)

–s TSGD

–1

slide-16
SLIDE 16

Complexity Constraints

Are sophisticated models worth the price in complexity?

Algorithm Context and CTW are linear time algorithms for tree sources of limited depth, but quite expensive in practice even arithmetic coding is not something that a practitioner will easily buy in many applications!

Is high complexity required to approach the best possible compression? The idea in JPEG-LS: apply judicious modeling to reduce complexity, rather than to improve compression the modeling/coding separation paradigm is less neat without complex models or arithmetic coding Optimal prefix codes for TSGDs:

[Merhav/Seroussi/Weinberger ’00]

slide-17
SLIDE 17

The LOCO-I algorithm

JPEG-LS is based on the LOCO-I algorithm: LOw COmplexity LOssless COmpression of Images Basic components:

Fixed + Adaptive prediction Conditioning contexts based on quantized gradients Two-parameter conditional probability model (TSGD) Low complexity adaptive coding matched to the model (variants of Golomb codes) Run length coding in flat areas to address drawback of symbol-by-symbol coding

slide-18
SLIDE 18

The goal: upon observing , choose to

  • ptimize some fidelity criterion (e.g.: minimize number of symbol

errors, squared distance, etc.) A natural extension of work on prediction/compression Applications: image and video denoising, text correction, financial data denoising, DNA sequence analysis, wireless communications…

discret discrete source source discret discrete memoryless memoryless channel channel (noise) (noise) denoiser denoiser

n

x x x , , ,

2 1

K

n

z z z , , ,

2 1

K

n

x x x ˆ , , ˆ , ˆ

2 1

K

n

x x x ˆ , , ˆ , ˆ

2 1

K

n

z z z , , ,

2 1

K

Discrete Universal DEnoising (DUDE)

slide-19
SLIDE 19

A denoising scheme that is

  • universal (no prior knowledge or assumption on X

X) )

  • asymptotically optimal (provably approaches the performance of the

best scheme that has full knowledge of the statistics of X X) )

  • practical (low complexity)

Theoretical impact: universal denoising is possible and can be accomplished in linear time

(2006 Communications Society/Information Theory Society Best Paper Award)

Practical impact: over 20 patents filed/granted Contributors: Ordentlich, Seroussi, Weinberger (ITR – HP Labs)

Weissman (Stanford/ITR), Verdú (Princeton)

The DUDE algorithm

slide-20
SLIDE 20

DUDE: how it’s done

pass 1: - gather statistics on symbol occurrences per context pattern

  • estimate noiseless symbol distribution given context pattern

and noisy sample (posterior distribution)

pass 2: denoise each symbol, based on estimated posterior who do you believe? what you see, or what the global stats tell you? precise decision formula proven asymptotically optimal context template size must be carefully chosen

zi

“context” “context” samples amples sample being denoised sample being denoised data data sequence sequence noisy noisy channel channel

n

x x x , , ,

2 1

K

n

z z z , , ,

2 1

K

slide-21
SLIDE 21

Application 1: Image denoising

Best previous result in the literature: PSNR = 35.6 dB @ error rate=30% (Chan,Ho&Nikolova, IEEE IP Oct’05)

error rate=30% PSNR=10.7 dB

(``salt and pepper” noise)

dude-denoised PSNR=38.3 dB

slide-22
SLIDE 22

Key component of DUDE: Model the conditional distribution

P(Zi|context of Zi) and infer P(Xi| Zi and context of Zi) from it

Main issue: large alphabet large number of model parameters

high learning cost

Leveraged “semi-universal’’ approach from image compression:

rely on prior knowledge (except that here data is noisy and models are non-causal). Main tools:

prediction contexts based on quantized data parameterized distributions

Again, the “holy grail” is to incorporate a “safe amount” of prior knowledge to reduce the richness of the class

State-of-the-art for “salt-and-pepper” noise removal

Competitive for Gaussian and ``real world” noise removal, but still room for improvement

The main challenge in image denoising

slide-23
SLIDE 23

Application 2: Denoiser-enhanced ECC

Suitable for wireless communications Leaves overall system ``as-is’’, but enhances receiver by denoising signal prior to error correction (ECC) decoding Allows to design a “better receiver” that will recover signals other receivers would reject as undecodable

transmitted codeword decodable region for regular ECC (handles code redundancy, structured) received noisy codeword denoising (handles source redundancy, natural)

non-enhanced (no reception) DUDE-enhanced

slide-24
SLIDE 24

DUDE Enhanced Decoding

unknown data source channel encoder discrete memoryless channel channel decoder

transmitter receiver

received data

x1x2...xk x1x2...xk p1p2...pm r1r2...rm

x1 ^ x2 ^ xk ^ ...

universal denoiser

z1z2...zk z1z2...zk

slide-25
SLIDE 25

What is “2D Information Theory”?

Analysis and design of communication systems involving 2D signals and channels

1,1 1,2 1, 2,1 2,2 2, ,1 ,2 , n n n n n n

x x x x x x x x x L L M M O M L

W

1,1 1,2 1, 2,1 2,2 2, ,1 ,2 , n n n n n n

y y y y y y y y y L L M M O M L

W’

encoder decoder 2D channel

Emphasis on nontrivial 2D channels that do not decompose into 1D “tracks” Examples:

Inter-symbol interference (ISI): yi,j = xi-1,j+ xi,j+ xi-1,j + xi,j-1 + ni,j Constrained channel:

xi,j ∈ {0,1} no two adjacent 1’s in any row and column (inherently 2D)

slide-26
SLIDE 26

Motivation

Existing storage media (while 2D) are “converted” to 1D tracks separated by buffer space Example – DVD+RW: Next generation systems will seek to make use of buffer space for storage

will have to deal with 2D interference and constraints channels will be inherently 2D

[from DVD+RW alliance white paper]

slide-27
SLIDE 27

Our research activity

2D ISI channels

computational complexity of “optimal” detection [Ordentlich and Roth, 2006]

2D constrained channels

new coding techniques and lower bounds on capacity [Ordentlich and Roth, 2000 and 2007]

slide-28
SLIDE 28

2D constrained channels

2D channel only permits arrays {xi,j} satisfying certain constraints Examples (binary arrays)

DC Free: # 0’s = #1’s in every row and column Run-length limited (RLL): at least d and no more than k 0’s between any two 1’s in any row and column No isolated bits (NIB): and not allowed anywhere in array Not yet clear which constraints will be most relevant in practice.

research focused on developing general tools and analysis

1 1 1 1 1

slide-29
SLIDE 29

Research problems and contributions

Determining capacity C: asymptotic bits per channel symbol (rate) that can be encoded into constraint Encoding/decoding algorithms:

low complexity, high rates imply lower bounds on capacity

A new coding approach: 2D approximate enumerative coding

fixed rate achieves lowest redundancy for 2D DC-free constraint achieves highest provable rate for at least one 2D RLL constraint (likely others), and has highest rate among (tractable) fixed rate encoders in other cases

2 2

1 lim log (# of legal x arrays)

n

C n n n

→∞

=

slide-30
SLIDE 30

Error-correcting codes in nanotechnology

Manufacturing perfect electronic circuits is expensive

as feature size decreases, cost of perfection may become prohibitive well before quantum phenomena dominate feature properties

In the future, circuits will need to function perfectly even with a significant percentage of imperfect components In nano-electronics, a major roadblock is the interconnect between the outside world and nano-scale resources We will discuss application of error-correcting codes to the design of defect-tolerant micro/nano demultiplexers (demuxes)

slide-31
SLIDE 31

nano

crossbar memory

micro

encoder

scale block

Hybrid circuits

demux

mixed nanowires conventional wires

slide-32
SLIDE 32

Two examples

Collaboration between the Information Theory Research group (Seroussi, Roth, Vontobel) and Quantum Science Research

Diode-based demuxes

a peculiar error probability formula a notion of coding gain for manufacturing cost

Resistor-based demuxes

a new combinatorial constraint on codes construction of optimal constant-weight codes for the constraint

[Scientific American Nov. ’05]

slide-33
SLIDE 33

Addressing nano-wires with diode logic conventional wires

low high

nano-wires A unique nano-wire is set to low (selected)

S

00 01 10

1 S

00 11

slide-34
SLIDE 34

Effect of open-connection defect With an open connection, a given address may select more than

  • ne nanowire

Example: 11 and 01 are both selected with address 01

low high

S 1

00 01 10

1

00 11

slide-35
SLIDE 35

Adding redundant address lines: Error-correcting codes

A redundant address line is added: addresses are

  • ver-specified

00 → 000 01 → 011 10 → 101 11 → 110 Overall parity check: d ≥ 2

S 1 S 1

110 101 011 000

slide-36
SLIDE 36

Over-specified addressing withstands open connection

Wire 110 is not selected, since the extra line “pulls it up” ⇒ 011 is uniquely selected In general, we will use an (n, M, d) code C, where M is the number of nanowires, and n the number of encoded address lines We use an encoder for C, but no decoder is needed Given M, circuit area is linear in n The requirements are similar to usual ECC: max d with min n With current processes, parameters are fairly small, e.g. M ≈ 100—1000, n ≈10—20

S 1 S 1

110 101 011 000

slide-37
SLIDE 37

Defect model and failure modes

Assume a diode is not successfully formed with probability p (open connection—process can be biased so that this is the dominant defect mode) Defects are statistically independent Two nanowire failure modes:

a nanowire is destructive if it has enough

  • pen connections that it is selected

when another nanowire is addressed a nanowire is a victim if addressing it causes a destructive nanowire to be selected both nanowires are disabled

S

1

110 101 011 000

slide-38
SLIDE 38

Addressable memory per unit of area

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Fraction of open connections Addressable memory per unit area (normalized) uncoded [8,7,2] [11,7,3] [12,7,4]

increasing cost

Normalized addressable memory for 128x128 cross-point nanowire blocks [12,7,4] coded system achieves density = 0.40 at defect rate = 22.5% same density in an uncoded system requires 3% defect rate area cost of supplemental address lines is taken into account – coding gain is real $$$!