Universal Similarity Paul Vitanyi CWI and University of Amsterdam , - - PowerPoint PPT Presentation

universal similarity
SMART_READER_LITE
LIVE PREVIEW

Universal Similarity Paul Vitanyi CWI and University of Amsterdam , - - PowerPoint PPT Presentation

Universal Similarity Paul Vitanyi CWI and University of Amsterdam , The Problem: Given: Literal objects (binary files) 2 1 3 4 5 Determine: Similarity Distance Matrix (distances between every pair) Applications:


slide-1
SLIDE 1

Universal Similarity Paul Vitanyi

CWI and University of Amsterdam,

slide-2
SLIDE 2

The Problem:

1 2 3 4 5 Given: Literal objects Determine: “Similarity” Distance Matrix (distances between every pair)‏ (binary files)‏ Applications: Clustering, Classification, Evolutionary trees of Internet documents, computer programs, chain letters, genomes, languages, texts, music pieces, ocr, ……

slide-3
SLIDE 3

Andrey Nikolaevich Kolmogorov (1903-1987, Tambov, Russia)‏

 Measure Theory  Probability  Analysis  Intuitionistic Logic  Cohomology  Dynamical Systems  Hydrodynamics  Kolmogorov complexity

slide-4
SLIDE 4

TOOL:

 Information Distance (Li, Vitanyi, 96; Bennett,Gacs,Li,Vitanyi,Zurek, 98)‏

D(x,y) = min { |p|: p(x)=y & p(y)=x}

Binary program for a Universal Computer (Lisp, Java, C, Universal Turing Machine)‏

Theorem

(i) D(x,y) = max {K(x|y),K(y|x)}

Kolmogorov complexity of x given y, defined as length of shortest binary ptogram that

  • utputs x on input y.

(ii) D(x,y) ≤D’(x,y)

Any computable distance satisfying ∑2 --D’(x,y)‏

y

for every x.

≤ 1

(iii) D(x,y) is a metric.

slide-5
SLIDE 5

However:

 x  So, we Normalize:

 d(x,y) = D(x,y)

Y

X’ Y’ D(x,y)=D(x’,y’) = But x and y are much more similar than x’ and y’

Max {K(x),K(y)}

Normalized Information Distance (NID)‏

The “Similarity metric”

slide-6
SLIDE 6

Properties NID:

 Theorem:  Drawback: NID(x,y) = d(x,y) is

noncomputable, since K(.) is!

  • 0 ≤ d(x,y) ≤ 1
  • d(x,y) is a metric

symmetric,triangle inequality, d(x,x)=0

:

slide-7
SLIDE 7

In Practice:

 Replace NID(x,y) by

NCD(x,y)= Z(xy)-min{Z(x),Z(y)} max{Z(x),Z(y)}

 This NCD is actually about the same formula as NID,

but rewritten using “Z” instead of “K”

Normalized Compression Distance (NCD)‏ Length (#bits) compressed version x using compressor Z (gzip, bzip2, PPMZ,…)‏

Li Badger Chen Kwong Kearney Zhang 01 Li Vitanyi 01/02 Li Chen Li Ma Vitanyi 04

slide-8
SLIDE 8

Family of compression-based similarities

 The NCD is actually a family of

similarity measures, parametrized with the compressor, e.g., gzip, bzip2, PPMZ,... (forget the crippled compressors like compress, awk, ...)‏

slide-9
SLIDE 9

Application: Clustering of Natural Data

 Unusual

 We don’t know number of clusters  We don’t have criterion to distinguish clusters

 Therefore, we hierarchically cluster to let the

data decide these issues naturally.

slide-10
SLIDE 10

Applications:

First One: Phylogeny of Species

Eutherian Orders: Ferungula, Primates, Rodents (Outgroup: Platypus, Wallaroo)‏

Hasegawa et al 98 concatenates selected proteins and gets different groupings depending

  • n proteins used

We use whole mtDNA , Approximate K(.) by GenCompress to determine NCD matrix; Get only one tree.

slide-11
SLIDE 11

Who is our closer relative?

slide-12
SLIDE 12

Evolutionary Tree of Mammals:

Li Badger Chen Kwong Kearney Zhang 01 Li Vitanyi 01/02 Li Chen Li Ma Vitanyi 04

slide-13
SLIDE 13

Embedding NCD Matrix in dendrogram (hierarchical clustering) for this Large Phylogeny (no errors it seems)‏

Therian hypothesis Versus Marsupionti hypothesis Mammals: Eutheria Metatheria Prototheria Which pair is closest? Cilibrasi, Vitanyi 2005

slide-14
SLIDE 14

NCD Matrix 24 Species (mtDNA).

Diagonal elements about 0. Distances between primates ca 0.6.

slide-15
SLIDE 15

Identifying SARS Virus: S(T)=0.988

AvianAdeno1CELO.inp: Fowl adenovirus 1; AvianIB1.inp: Avian infectious bronchitis virus (strain Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus (strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdeno1.inp: Duck adenovirus 1; HumanAdeno40.inp: Human adenovirus type 40; HumanCorona1.inp: Human coronavirus 229E; MeaslesMora.inp: Measles virus strain Moraten; MeaslesSch.inp: Measles virus strain Schwarz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2; PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis coronavirus; SARS.inp: SARS TOR2v120403; SIRV1.inp: Sulfolobus virus SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2.

slide-16
SLIDE 16

Clustering : Phylogeny of 15 languages: Native American, Native African, Native European Languages

slide-17
SLIDE 17

Applications Everywhere

Genomics and Language Tree just one example; also used with (e.g.): Cilibrasi, Vitanyi, de Wolf, 2003/2004; Cilibrasi, Vitanyi, 2005.

MIDI music files (music clustering)‏ Plagiarism detection Phylogeny of chain letters SARS virus classification Computer worms and internet traffic (attacks) analysis Literature OCR Astronomy—Radio telecope time sequences Spam detection Time sequences: (All data bases used in all major data-mining conferences of last 10Y)‏ Superior over all methods: In: Anomaly detection Heterogenous data

slide-18
SLIDE 18

Russian Authors (in original Cyrillic) S(T)=0.949

I.S. Turgenev, 1818--1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky 1821--1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. Tolstoy 1828--1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V. Gogol 1809--1852 [Dead Souls, Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled];

  • M. Bulgakov 1891--1940 [The Master and Margarita, The Fatefull Eggs, The Heart of a Dog]
slide-19
SLIDE 19

Same Russian Texts in English Translation; S(T)=0953

Files start to cluster according to translators!

I.S. Turgenev, 1818--1883 [Father and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve (Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821--1881 [Crime and Punishment (Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 1828--1910 [Anna Karenina (Garnett, C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and Piece (L. and M. Aylmer)]; N.V. Gogol 1809—1852 [Dead Souls (C.J. Hogarth), Taras Bulba ($\approx$ G. Tolstoy, 1860, B.C. Baskerville), The Mysterious Portrait + How the Two Ivans Quarrelled ($\approx$ I.F. Hapgood]; M. Bulgakov 1891--1940 [The Master and Margarita (R. Pevear, L. Volokhonsky), The Fatefull Eggs (K. Gook-Horujy), The Heart of a Dog (M. Glenny)]

slide-20
SLIDE 20

12 Classical Pieces (Bach, Debussy, Chopin)

  • --- No errors
slide-21
SLIDE 21

Optical Character Recognition: Data Handwritten Digits from NIST Data Base

slide-22
SLIDE 22

Optical Character Recognition: Clustering: S(T)=0.901

slide-23
SLIDE 23

Heterogenous Data; Clustering perfect with S(T)=0.95.

Clustering of radically different data. No features known. Only our parameter-free method can do this!!

slide-24
SLIDE 24

You can use it too!

 CompLearn Toolkit: http://www.complearn.org

 “x” and “y” are literal objects (files);

What about abstract objects like “home”,

“red”, “Socrates”, “chair”, ….?

Or names for literal objects?

But what if we do not have the object as a file????

slide-25
SLIDE 25

Non-Literal Objects

 Googling for Meaning  Google distribution:

g(x) = Google page count “x” # pages indexed

Cilibrasi, Vitanyi, 2004/2007.

slide-26
SLIDE 26

Google Compressor

 Google code length:

G(x) = log 1 / g(x)‏ This is the Shannon-Fano code length that has

minimum expected code word length w.r.t. g(x).

Hence we can view Google as a Google Compressor.

slide-27
SLIDE 27

Normalized Google Distance (NGD)‏

 NGD(x,y) = G(x,y) – min{G(x),G(y)}

max{G(x),G(y)} Same formula as NCD, using Z = G (Google compressor)‏

Use the Google counts and the CompLearn Toolkit to apply NGD.

slide-28
SLIDE 28

Example

 “horse”: #hits = 46,700,000  “rider”: #hits = 12,200,000  “horse” “rider”: #hits = 2,630,000  #pages indexed: 8,058,044,651

NGD(horse,rider) = 0.443 Theoretically+empirically: scale-invariant

slide-29
SLIDE 29

Colors and Numbers—The Names! Hierarchical Clustering

colors numbers

slide-30
SLIDE 30

Hierarchical Clustering of 17th Century Dutch Painters, Paintings given by name, without painter’s name.

Hendrickje slapend, Portrait of Maria Trip, Portrait of Johannes Wtenbogaert, The Stone Bridge, The Prophetess Anna, Leiden Baker Arend Oostwaert, Keyzerswaert, Two Men Playing Backgammon, Woman at her Toilet, Prince's Day, The Merry Family, Maria Rey, Consul Titus Manlius Torquatus, Swartenhont, Venus and Adonis

slide-31
SLIDE 31

Mathematicians

slide-32
SLIDE 32

H5N1 (Birdflu) virus mutaions

slide-33
SLIDE 33

Next: Binary Classification

 Here we use the NGD

for a Support Vector Machine (SVM)‏ binary classification learner (we could also use a neural network)‏ Setup: Anchor terms, positive/negative examples, Test set  Accuracy

slide-34
SLIDE 34

Using NGD in SVM (Support Vector Machines) to

learn concepts (binary classification)‏

Example: Emergencies

slide-35
SLIDE 35

Example: Classifying Prime Numbers

Actually, 91 is not a prime. So accuracy is 17/19=89,47%

slide-36
SLIDE 36

Example: Electrical Terms

slide-37
SLIDE 37

Example: Religious Terms

slide-38
SLIDE 38

Comparison with WordNet Semantics

http://www.cogsci.princeton.edu/~wn

NGD-SVM Classifier on 100 randomly selected WordNet Categories

Randomly selected positive, negative and test sets Histogram gives accuracy With respect to PhD experts entered knowledge in the WordNet Database Mean Accuracy is 0.8725 Standard deviation is 0.1169 Accuracy almost always > 75%

  • -Automatically
slide-39
SLIDE 39

Translation Using NGD

Problem: Translation:

slide-40
SLIDE 40

Selected Bibliography

  • D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping, Physical Review Letters, 88:4(2002) 048702.

C.H. Bennett, P. Gacs, M. Li, P.M.B. Vitanyi, and W. Zurek. Information Distance, IEEE Transactions on Information Theory, 44:4(1998), 1407--1423. C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary histories, Scientific American, June 2003, 76--81.

  • X. Chen, B. Francia, M. Li, B. McKinnon, A. Seker, Shared information and program plagiarism detection, IEEE Trans. Inform. Th.,

50:7(2004), 1545--1551.

  • R. Cilibrasi, The CompLearn Toolkit, 2003, http://complearn.sourceforge.net/ .
  • R. Cilibrasi, P.M.B. Vitanyi, R. de Wolf, Algorithmic clustering of music based on string compression, Computer Music Journal,

28:4(2004), 49-67.

  • R. Cilibrasi, P.M.B. Vitanyi, Clustering by compression, IEEE Trans. Inform. Th., 51:4(2005), 1523-1545.
  • R. Cilibrasi, P.M.B. Vitanyi, Automatic meaning discovery using Google, http://xxx.lanl.gov/abs/cs.CL/0412098 (2004)
  • E. Keogh, S. Lonardi, and C.A. Rtanamahatana, Toward parameter-free data mining, In: Proc. 10th ACM SIGKDD

Intn'l Conf. Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22---25, 2004, 206--215.

  • M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang. An information-based sequence distance and

its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149--154.

  • M. Li and P.M.B. Vitanyi, Reversibility and adiabatic computation: trading time and space for energy, Proc. Royal Society of London,

Series A, 452(1996), 769-789.

  • M. Li and P.M.B Vitanyi. Algorithmic Complexity, pp. 376--382 in: International Encyclopedia of the Social

\& Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds., Pergamon, Oxford, 2001/2002.

  • M. Li, X. Chen, X. Li, B. Ma, P.M.B. Vitanyi. The similarity metric, IEEE Trans. Inform. Th., 50:12(2004), 3250- 3264.
  • M. Li and P.M.B. Vitanyi. An Introduction to Kolmogorov Complexity and its Applications,

Springer-Verlag, New York, 2nd Edition, 1997. A.Londei, V. Loreto, M.O. Belardinelli, Music style and authorship categorization by informative compressors,

  • Proc. 5th Triannual Conference of the European Society for the Cognitive Sciences of Music (ESCOM),

September 8-13, 2003, Hannover, Germany, pp. 200-203.

  • S. Wehner, Analyzing network traffic and worms using compression, Manuscript, CWI, 2004. Partially available

at http://homepages.cwi.nl/~wehner/worms/