Information Theory in an Industrial Research Lab Marcelo J. - - PowerPoint PPT Presentation
Information Theory in an Industrial Research Lab Marcelo J. - - PowerPoint PPT Presentation
Information Theory in an Industrial Research Lab Marcelo J. Weinberger Information Theory Research Group Hewlett-Packard Laboratories Advanced Studies Palo Alto, California, USA with contributions from the ITR group Purdue University
it’s all about models, bounds, and algorithms The mathematical theory:
measures of information fundamentals of data representation (codes) for
compactness secure, reliable communication/storage over a possibly noisy channel
A formal framework for areas of engineering and science for which the notion of “information” is relevant
Components:
data models fundamental bounds codes, efficient encoding/decoding algorithms
Engineering problems addressed:
data compression error control coding cryptography
Information Theory (Shannon, 1948)
enabling technologies with many practical applications in: computing, imaging, storage, multimedia, communications...
Information Theory research in the industry
Mission
Research the mathematical foundations and practical applications
- f information theory, generating intellectual property and
technology for “XXX Company” through the advancement of scientific knowledge in these areas
Apply the theory and work on the applications makes
- bvious sense for “XXX Company” research labs;
But why invest on advancing the theory?
some simple answers which apply to any basic research area: long-term investment, prestige, visibility, give back to society... this talk will be about a different type of answer: differentiating technology vs. enabling technology
Main claim:
working on the theory helps developing analytical tools that are needed to envision innovative, technology-differentiating ideas
Case studies
JPEG-LS: from universal context modeling to a lossless image compression standard DUDE (Discrete Universal DEnoiser): from a formal setting for universal denoising to actual image denoising algorithms Error-correcting codes in nanotechnology: the advantages of interdisciplinary research 2-D information theory: looking into the future
- f storage devices
compress store, transmit de- compress Input Output
010010... 010010...
S
001 010 100 111
1 S
001 010 100 111
1
Work paradigm
work on the theory work on practical solutions identify a (fairly abstract) practical problem start e.g. e.g.
- image compression
- 2D channel coding
- ECC for nano
- denoising
motivation
- scientific interest
- vision of benefit to XXX
academia, scientific community ideas, papers, participation patents, technology, consulting visibility
talent
standards visibility industry new challenges
Universal Modeling and Coding
Traditional Shannon theory assumes that a (probabilistic) model of the data is available, and aims at compressing the data optimally w.r.t. the model Kraft’s inequality: Every uniquely decipherable (UD) code of length L(s) satisfies
string of length n
- ver finite alphabet A
⇒ a code defines a probability distribution P(s) = 2-L(s) over An
Conversely, given a distribution P( ) (a model), there exists a UD code that assigns bits to s (Shannon code) Hence, P( ) serves as a model to encode s, and every code has an associated model
a model is a probabilistic tool to “understand” and predict the behavior of the data
− ∈
∑
≤
L s s An ( )
2
1
⎡ ⎤
− log ( ) p s
Universal Modeling and Coding (cont.)
Given a model P( ) on n-tuples, arithmetic coding provides an effective mean to sequentially assign a code word of length close to -log P(s) to s
if s = x1 x2 … xn the “ideal code length” for symbol xt is the model can vary arbitrarily and “adapt” to the data
CODING SYSTEM = MODEL + CODING UNIT
two separate problems: design a model and use it to encode
We will view data compression as a problem of assigning probabilities to data
) ... | (
1 2 1 − t t
x x x x p
Coding with Model Classes
Universal data compression deals with the optimal description of data in the absence of a given model
in most practical applications, the model is not given to us
How do we make the concept of “optimality” meaningful?
there is always a code that assigns just 1 bit to the given data!
The answer: Model classes We want a “universal” code to perform as well as the best model in a given class C for any string s, where the best competing model changes from string to string
universality makes sense only w.r.t. a model class
A code with length function L(xn) is pointwise universal w.r.t. a class if, when n → ∞,
)] ( min ) ( [ 1 ) , ( → − =
∈ n C C n n
x L x L n x L R
C C
code length with model C
How to Choose a Model Class?
Universal coding tells us how to encode optimally w.r.t. to a class; it doesn’t tell us how to choose a class! Some possible criteria:
complexity: existence of efficient algorithms prior knowledge on the data
We will see that the bigger the class, the slower the best possible convergence rate of the redundancy to 0
in this sense, prior knowledge is of paramount importance: don’t learn what you already know!
Ultimately, the choice of model class is an art
Parametric Model Classes
A useful limitation to the model class is to assume
C = {Pθ, θ ∈ Θd }
Examples:
Bernoulli: d = 1 , general i.i.d. model: d = α-1 (α = |A|) FSM model with k states: d = k(α-1) memoryless geometric distribution on the integers i ≥ 0:
P(i) = θi (1-θ), d = 1 A straightforward method: two-part code [Rissanen ‘84] Trade-off: the dimension of the parameter space plays a fundamental role in modeling problems
a parameter space of dimension d
⎡ ⎤
θ θ + − log ( | ) p x n
encode best bits “model cost”: grows with d probability of the data under θ: grows with d
Fundamental Lower Bound
A criterion for measuring the optimality of a universal model is provided by Rissanen’s lower bound [Rissanen ‘84]
for every p( ), any ε > 0, and sufficiently large n, for all parameter values θ except for a set whose volume → 0 as
n → ∞, provided a “good” estimator of θ exists Conclusion: the number of parameters affects the achievable convergence rate of a universal code length to the entropy
) 1 ( 2 log + ) ( )] ( log [
1
ε θ
θ
− ≥ −
−
n n d H x P E n
n n
Contexts and Tree Models
More efficient parametrization of a Markov process
[Weinberger/Lempel/Ziv ‘92, Weinberger/Rissanen/Feder ‘95]
Any suffix of a sequence xt is called a context in which the next symbol xt+1 occurs For a finite-memory source P, the conditioning states s(xt) are contexts that satisfy
# of parameters: α-1 per leaf of the tree
There exist efficient universal schemes in the class of tree models of any size [Weinberger/Rissanen/Feder ’95,
Willems/Shtarkov/Tjalkens ’95, Martín/Seroussi/Weinberger ’04]
1 1 1 1 1
... 1 1 1 ... 0 0 1 0
next input bit context
A a A u x us a P x a P
t t
∈ ∈ ∀ = *, )), ( | ( ) | (
Lossless Image Compression (the real thing…)
Some applications of lossless image compression:
Images meant for further analysis and processing (as opposed to just human perception) Images where loss might have legal implications Images obtained at great cost Applications with intensive editing and repeated compression/decompression cycles Applications where desired quality of rendered image is unknown at time of acquisition
International standard: JPEG-LS (1998)
Compress Store, transmit De- compress Input Output
010010... 010010...
Universality vs. Prior Knowledge
Application of universal algorithms for tree models directly to real images yields poor results
some structural symmetries typical of images are not captured by the model a universal model has an associated “learning cost:” why learn something we already know?
Modeling approach: limit model class by use of “prior knowledge”
for example, images tend to be a combination of smooth regions and edges predictive coding was successfully used for years: it encodes the difference between a pixel and a predicted value of it prediction errors tend to follow a Laplacian distribution ⇒ AR model + Laplacian, where both the center and the decay are context dependent Prediction = fixed prediction + adaptive correction
Models for Images
In practice, contexts are formed out of a finite subset of the past sequence Conditional probability model for prediction errors: two-sided geometric distribution (TSGD)
“discrete Laplacian” shift s constrained to [0,1) by integer-valued adaptive correction (bias cancellation) on the fixed predictor
c a b d x
Current sample Causal template
) 1 , [ (0,1), , ) (
| |
∈ ∈ =
+
s c e P
s e
θ θ
s e P(e)
–s TSGD
–1
Complexity Constraints
Are sophisticated models worth the price in complexity?
Algorithm Context and CTW are linear time algorithms for tree sources of limited depth, but quite expensive in practice even arithmetic coding is not something that a practitioner will easily buy in many applications!
Is high complexity required to approach the best possible compression? The idea in JPEG-LS: apply judicious modeling to reduce complexity, rather than to improve compression the modeling/coding separation paradigm is less neat without complex models or arithmetic coding Optimal prefix codes for TSGDs:
[Merhav/Seroussi/Weinberger ’00]
The LOCO-I algorithm
JPEG-LS is based on the LOCO-I algorithm: LOw COmplexity LOssless COmpression of Images Basic components:
Fixed + Adaptive prediction Conditioning contexts based on quantized gradients Two-parameter conditional probability model (TSGD) Low complexity adaptive coding matched to the model (variants of Golomb codes) Run length coding in flat areas to address drawback of symbol-by-symbol coding
The goal: upon observing , choose to
- ptimize some fidelity criterion (e.g.: minimize number of symbol
errors, squared distance, etc.) A natural extension of work on prediction/compression Applications: image and video denoising, text correction, financial data denoising, DNA sequence analysis, wireless communications…
discret discrete source source discret discrete memoryless memoryless channel channel (noise) (noise) denoiser denoiser
n
x x x , , ,
2 1
K
n
z z z , , ,
2 1
K
n
x x x ˆ , , ˆ , ˆ
2 1
K
n
x x x ˆ , , ˆ , ˆ
2 1
K
n
z z z , , ,
2 1
K
Discrete Universal DEnoising (DUDE)
A denoising scheme that is
- universal (no prior knowledge or assumption on X
X) )
- asymptotically optimal (provably approaches the performance of the
best scheme that has full knowledge of the statistics of X X) )
- practical (low complexity)
Theoretical impact: universal denoising is possible and can be accomplished in linear time
(2006 Communications Society/Information Theory Society Best Paper Award)
Practical impact: over 20 patents filed/granted Contributors: Ordentlich, Seroussi, Weinberger (ITR – HP Labs)
Weissman (Stanford/ITR), Verdú (Princeton)
The DUDE algorithm
DUDE: how it’s done
pass 1: - gather statistics on symbol occurrences per context pattern
- estimate noiseless symbol distribution given context pattern
and noisy sample (posterior distribution)
pass 2: denoise each symbol, based on estimated posterior who do you believe? what you see, or what the global stats tell you? precise decision formula proven asymptotically optimal context template size must be carefully chosen
zi
“context” “context” samples amples sample being denoised sample being denoised data data sequence sequence noisy noisy channel channel
n
x x x , , ,
2 1
K
n
z z z , , ,
2 1
K
Application 1: Image denoising
Best previous result in the literature: PSNR = 35.6 dB @ error rate=30% (Chan,Ho&Nikolova, IEEE IP Oct’05)
error rate=30% PSNR=10.7 dB
(``salt and pepper” noise)
dude-denoised PSNR=38.3 dB
Key component of DUDE: Model the conditional distribution
P(Zi|context of Zi) and infer P(Xi| Zi and context of Zi) from it
Main issue: large alphabet large number of model parameters
high learning cost
Leveraged “semi-universal’’ approach from image compression:
rely on prior knowledge (except that here data is noisy and models are non-causal). Main tools:
prediction contexts based on quantized data parameterized distributions
Again, the “holy grail” is to incorporate a “safe amount” of prior knowledge to reduce the richness of the class
State-of-the-art for “salt-and-pepper” noise removal
Competitive for Gaussian and ``real world” noise removal, but still room for improvement
The main challenge in image denoising
Application 2: Denoiser-enhanced ECC
Suitable for wireless communications Leaves overall system ``as-is’’, but enhances receiver by denoising signal prior to error correction (ECC) decoding Allows to design a “better receiver” that will recover signals other receivers would reject as undecodable
transmitted codeword decodable region for regular ECC (handles code redundancy, structured) received noisy codeword denoising (handles source redundancy, natural)
non-enhanced (no reception) DUDE-enhanced
DUDE Enhanced Decoding
unknown data source channel encoder discrete memoryless channel channel decoder
transmitter receiver
received data
x1x2...xk x1x2...xk p1p2...pm r1r2...rm
x1 ^ x2 ^ xk ^ ...
universal denoiser
z1z2...zk z1z2...zk
What is “2D Information Theory”?
Analysis and design of communication systems involving 2D signals and channels
1,1 1,2 1, 2,1 2,2 2, ,1 ,2 , n n n n n n
x x x x x x x x x L L M M O M L
W
1,1 1,2 1, 2,1 2,2 2, ,1 ,2 , n n n n n n
y y y y y y y y y L L M M O M L
W’
encoder decoder 2D channel
Emphasis on nontrivial 2D channels that do not decompose into 1D “tracks” Examples:
Inter-symbol interference (ISI): yi,j = xi-1,j+ xi,j+ xi-1,j + xi,j-1 + ni,j Constrained channel:
xi,j ∈ {0,1} no two adjacent 1’s in any row and column (inherently 2D)
Motivation
Existing storage media (while 2D) are “converted” to 1D tracks separated by buffer space Example – DVD+RW: Next generation systems will seek to make use of buffer space for storage
will have to deal with 2D interference and constraints channels will be inherently 2D
[from DVD+RW alliance white paper]
Our research activity
2D ISI channels
computational complexity of “optimal” detection [Ordentlich and Roth, 2006]
2D constrained channels
new coding techniques and lower bounds on capacity [Ordentlich and Roth, 2000 and 2007]
2D constrained channels
2D channel only permits arrays {xi,j} satisfying certain constraints Examples (binary arrays)
DC Free: # 0’s = #1’s in every row and column Run-length limited (RLL): at least d and no more than k 0’s between any two 1’s in any row and column No isolated bits (NIB): and not allowed anywhere in array Not yet clear which constraints will be most relevant in practice.
research focused on developing general tools and analysis
1 1 1 1 1
Research problems and contributions
Determining capacity C: asymptotic bits per channel symbol (rate) that can be encoded into constraint Encoding/decoding algorithms:
low complexity, high rates imply lower bounds on capacity
A new coding approach: 2D approximate enumerative coding
fixed rate achieves lowest redundancy for 2D DC-free constraint achieves highest provable rate for at least one 2D RLL constraint (likely others), and has highest rate among (tractable) fixed rate encoders in other cases
2 2
1 lim log (# of legal x arrays)
n
C n n n
→∞
=
Error-correcting codes in nanotechnology
Manufacturing perfect electronic circuits is expensive
as feature size decreases, cost of perfection may become prohibitive well before quantum phenomena dominate feature properties
In the future, circuits will need to function perfectly even with a significant percentage of imperfect components In nano-electronics, a major roadblock is the interconnect between the outside world and nano-scale resources We will discuss application of error-correcting codes to the design of defect-tolerant micro/nano demultiplexers (demuxes)
nano
crossbar memory
micro
encoder
scale block
Hybrid circuits
demux
mixed nanowires conventional wires
Two examples
Collaboration between the Information Theory Research group (Seroussi, Roth, Vontobel) and Quantum Science Research
Diode-based demuxes
a peculiar error probability formula a notion of coding gain for manufacturing cost
Resistor-based demuxes
a new combinatorial constraint on codes construction of optimal constant-weight codes for the constraint
[Scientific American Nov. ’05]
Addressing nano-wires with diode logic conventional wires
low high
nano-wires A unique nano-wire is set to low (selected)
S
00 01 10
1 S
00 11
Effect of open-connection defect With an open connection, a given address may select more than
- ne nanowire
Example: 11 and 01 are both selected with address 01
low high
S 1
00 01 10
1
00 11
Adding redundant address lines: Error-correcting codes
A redundant address line is added: addresses are
- ver-specified
00 → 000 01 → 011 10 → 101 11 → 110 Overall parity check: d ≥ 2
S 1 S 1
110 101 011 000
Over-specified addressing withstands open connection
Wire 110 is not selected, since the extra line “pulls it up” ⇒ 011 is uniquely selected In general, we will use an (n, M, d) code C, where M is the number of nanowires, and n the number of encoded address lines We use an encoder for C, but no decoder is needed Given M, circuit area is linear in n The requirements are similar to usual ECC: max d with min n With current processes, parameters are fairly small, e.g. M ≈ 100—1000, n ≈10—20
S 1 S 1
110 101 011 000
Defect model and failure modes
Assume a diode is not successfully formed with probability p (open connection—process can be biased so that this is the dominant defect mode) Defects are statistically independent Two nanowire failure modes:
a nanowire is destructive if it has enough
- pen connections that it is selected
when another nanowire is addressed a nanowire is a victim if addressing it causes a destructive nanowire to be selected both nanowires are disabled
S
1
110 101 011 000
Addressable memory per unit of area
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40
Fraction of open connections Addressable memory per unit area (normalized) uncoded [8,7,2] [11,7,3] [12,7,4]