Structural Information Theory: Principles for Distinguishing Order - - PowerPoint PPT Presentation

structural information theory principles for
SMART_READER_LITE
LIVE PREVIEW

Structural Information Theory: Principles for Distinguishing Order - - PowerPoint PPT Presentation

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching Structural Information Theory: Principles for Distinguishing Order From Disorder Angsheng Li Institute of Software Chinese Academy


slide-1
SLIDE 1

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Structural Information Theory: Principles for Distinguishing Order From Disorder

Angsheng Li

Institute of Software Chinese Academy of Sciences

The 3rd Workshop on Big Data and Computational Intelligence, Beijing, 29 - 31, July, 2016

slide-2
SLIDE 2

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Outline

  • 1. The challenges
  • 2. Previous measures
  • 3. Our ideas
  • 4. Structural information
  • 5. Three-dimensional gene map
  • 6. Resistance
  • 7. Theory
  • 8. Next-generation search engine
slide-3
SLIDE 3

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Shannon’s Information

Shannon, 1948: Given a distribution p = (p1, p2, · · · , pn), the Shannon’s information is H(p) = −

n

  • i=1

pi · log2 pi. (1) pi is the probability that item i is chosen, − log2 pi is the “self-information" of item i.

  • Shannon’s information measure the uncertainty of a

probabilistic distribution.

  • This metric and the associated notions of noises form the

foundation of information theory and the information theoretic study in all areas of the current science.

  • Shannon’s metric provides the foundation for the current

generation information technology.

slide-4
SLIDE 4

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Shannon’s Question, 1953

Shannon, 1953:

  • 1. Is there a structural theory of information that supports

communication network analysis?

  • 2. What is the optimal communication network?

Shannon noticed that his theory fails to support communication

  • network. The reason is as follows.

Given a communication network G,

  • 1. (De-structuring) Let p be a distribution computed from G,

degree distribution, or distance distribution, and so on. This discards the interesting properties of G.

  • 2. Define H(p) to be the information of G.

This number H(p) does not tell us anything about the interactions and communications occurred in G. The question is hence: How to measure the information embedded in a physical system?

slide-5
SLIDE 5

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Physical systems

Given a physical system G, the information embedded in G should determine and decode the essential structure of G. For example, for a car and a boat, the essential structures of the two objects should be different, and the essential structures

  • f a car and a boat should be determined by the information

embedded in the car and the boat respectively. Question: What is the essential structure of a physical system?

slide-6
SLIDE 6

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Evolving Network

Given a network G that is evolved in nature by two mechanisms:

  • 1. The rules, regulations and laws of the objects
  • 2. Perturbations by noises and random variations

In this case, the information embedded in G should determine and decode the structure of G that is formed by the rules, regulations and laws in which the noises and random variations

  • ccurred in G are excluded.
slide-7
SLIDE 7

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Noisy Data

Given a structured noisy data G, the information embedded G should determine and decode the structure T of G that excludes the noises occurred in G. Questions:

  • 1. What is the principle for data mining?
  • 2. What is the principle for establishing the theory of

structures and algorithms of big data?

slide-8
SLIDE 8

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Dynamical Complexity of a Network

Given a network G, the dynamical complexity of G should be the measure of complexity of the interactions, operations and communications occurring in G. This is different from the static complexity such as the number of nodes, the number of edges etc. What is the measure of dynamical complexity of a network?

slide-9
SLIDE 9

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Natural Structure and Natural Rank

In Nature and Society, individuals form natural structures and follow some natural ranking. This is different from the current-generation search engine based on PageRank.

  • What is the natural rank?
  • What is the next generation search engine?
slide-10
SLIDE 10

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Analysis of Biological Structures

  • The coding of genetic information by DNA
  • The folded structure of proteins.
  • Explaining and predicting the structures
slide-11
SLIDE 11

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Principles of big data

  • What is the principle of structuring of unstructured data?
  • What is the principle for extracting the order from

structured noisy data?

slide-12
SLIDE 12

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Structural Information Is the Key

The solution for all the challenges mentioned above depends

  • n a well-defined measure of structural information.
slide-13
SLIDE 13

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Brooks Comments, 2003

  • We have no theory that gives us a metric for the

information embedded in structures

  • The missing metric is the most fundamental gap in

information science and computer science. The quantification of structural information as the first of the three great challenges for half-century old CS.

slide-14
SLIDE 14

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Hartmanis Comments

In 2008, Juris Hartmanis commented that Shannon’s definition fails to analyse structures, and suggested the question to me to give a new definition for information.

slide-15
SLIDE 15

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Rashevsky, 1955

Given a connected graph G of n vertices, for every vertex i, let n)i be the number of vertices in the orbit of vertex i (under automorphisms of G). Suppose that there are k orbits with number of vertices n1, n2, · · · , nk . Then let p = (n1 n , n2 n , · · · , nk n ). Define the entropy of G to be the Shannon entropy of p.

slide-16
SLIDE 16

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Local Entropy Measures

Raychaudhury et al, 1984 Given a connected graph G, for vertices i, j, let d(i, j) be the distance between vertex i and vertex j. Define the entropy of G to be the Shannon entropy of the distributions of the distances {d(i, j)}.

slide-17
SLIDE 17

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Gibbs Entropy

This measures the number of bits needed to determine a graph generated from some model.

slide-18
SLIDE 18

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Shannon’s Entropy for Graph Models

It measures the number of bits needed to describe the graph that is generated from a model.

slide-19
SLIDE 19

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Von Neumann Entropy

This is defined by the spectral of the Laplacian of the graph. That is the distribution of the eigenvalues of the Laplacian of the graph. It is claimed to measure the complexity of quantum systems.

slide-20
SLIDE 20

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Hierarchical Thesis

  • The natural structure of a physical system is a hierarchical

structure

  • The natural structure of a network evolving in Nature and

Society is a hierarchical structure

  • The true structure of a structured noisy data is a

hierarchical structure

slide-21
SLIDE 21

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Decoding the Truth

slide-22
SLIDE 22

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Decoding ECC

Figure: Decoding error correcting code.

slide-23
SLIDE 23

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

One-Dimensional Structural Information

Definition

(One-dimensional structural information) Given a connected graph G = (V, E) with n nodes and m edges, for each node i ∈ {1, 2, · · · , n}, let di be the degree of i in G, and let pi =

di 2m.

We define the one-dimensional structural information or positioning entropy of G by using the entropy function H as follows: H1(G) = H(p) = H d1 2m, . . . , dn 2m

  • = −

n

  • i=1

di 2m · log2 di

  • 2m. (2)
slide-24
SLIDE 24

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Intuition of H1(G)

  • The Shannon information for graphs
  • It is the number of bits required to determine the code of

the node that is accessible from random walk in G

slide-25
SLIDE 25

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Static vs Dynamic

  • One-dimensional structural information fails to distinct

static from dynamic This is the fundamental weakness of the one-dimensional structural information and the Shannon information metric.

slide-26
SLIDE 26

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Structural Information by Partition

Definition

(Structural information of networks by a partition) Given a graph G = (V, E), suppose that P = {X1, X2, · · · , XL} is a partition of

  • V. We define the structural information of G by P as follows:

HP(G) :=

L

  • j=1

Vj 2m · H  d(j)

1

Vj , . . . , d(j)

nj

Vj   −

L

  • j=1

gj 2m log2 Vj 2m = −

L

  • j=1

Vj 2m

nj

  • i=1

d(j)

i

Vj log2 d(j)

i

Vj −

L

  • j=1

gj 2m log2 Vj 2m,(3) where L is the number of modules in P, nj is the number of nodes in Xj, d(j)

i

is the degree of the i-th node of Xj, Vj is the volume of Xj which is the sum of degrees of nodes in Xj, and gj is the number of edges with exactly one endpoint in Xj.

slide-27
SLIDE 27

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Understanding HP(G)

1.

Vj 2m: the probability that random walk in G arrives at Xj

  • 2. −

nj

  • i=1

d(j)

i

Vj log2 d(j)

i

Vj : the positioning information in Xj

3.

gj 2m: the probability that random walk going into Xj from

nodes outside Xj

  • 4. − log2

Vj 2m: self-information of Xj

  • 5. HP(G): the number of bits required to determine the

two-dimensional code of the node v that is accessible from random walk

slide-28
SLIDE 28

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Examples

  • Local number
  • Area codes
slide-29
SLIDE 29

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Two-dimensional Structural Information

Definition

(Two-dimensional structural information of networks) Let G be a connected graph. (1) Define the two-dimensional structural information of G as follows: H2(G) = min

P {HP(G)},

(4) where P runs over all the partitions of G. (2) We say that a partition P of the vertices of G is a natural structure of G, if: HP(G) = H2(G). (5)

slide-30
SLIDE 30

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Partitioning Tree

Definition

(Partitioning tree of graphs) Let G = (V, E) be an undirected and connected network. We define the partitioning tree T of G as a tree T with the following properties: (1) For the root node denoted λ, we define the set Tλ = V. (2) For every node α ∈ T , the immediate successors of α are αˆj for j from 1 to a natural number N ordered from left to right as j increases. Therefore, αˆi is to the left of αˆj written as αˆi <L αˆj, if and only if i < j. (3) For every α ∈ T , there is a subset Tα ⊂ V that is associated with α. For α and β, we use α ⊂ β to denote that α is an initial segment of β. For every node α = λ, we use α− to denote the longest initial segment of α, or the longest β such that β ⊂ α.

slide-31
SLIDE 31

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Partitioning Tree - II

(4) For every i, {Tα | h(α) = i} is a partition of V, where h(α) is the height of α (note that the height of the root node λ is 0, and for every node α = λ, h(α) = h(α−) + 1). (5) For every α, Tα is the union of Tβ for all β’s such that β− = α; thus, Tα = ∪β−=αTβ. (6) For every leaf node α of T , Tα is a singleton; thus, Tα contains a single node of V.

slide-32
SLIDE 32

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Structural Information by Partitioning Tree

Definition

(Structural information of a graph by a partitioning tree) For an undirected and connected network G = (V, E), suppose that T is a partitioning tree of G. We define the structural information

  • f G by T as follows:

(1) For every α ∈ T , if α = λ, then define HT (G; α) = − gα 2m log2 Vα Vα− , (6) where gα is the number of edges from nodes in Tα to nodes outside Tα, Vβ is the volume of set Tβ, namely, the sum of the degrees of all the nodes in Tβ.

slide-33
SLIDE 33

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Definition

(2) We define the structural information of G by the partitioning tree T as follows: HT (G) =

  • α∈T ,α=λ

HT (G; α). (7)

slide-34
SLIDE 34

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

K-dimensional Structural Information

Definition

(K-dimensional structural information) Let G = (V, E) be a connected network. (1) We define the K-dimensional structural information of G as follows: HK(G) = min

T {HT (G)},

(8) where T ranges over all of the partitioning trees of G of height K. (2) Given a K-level partitioning tree T of G, we say that T is the K-dimensional knowledge tree of G, if: HT (G) = HK(G). (9)

slide-35
SLIDE 35

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Static vs Dynamic

For K > 1, the K-dimensional structural information is dynamic, and not equal to the static definition of Shannon’s information.

slide-36
SLIDE 36

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Cell Sample Network

Suppose that v1, v2, · · · , vn are n samples of cells and that g1, g2, · · · , gN are N genes. For every pair (i, j), let a(i, j) be the expression profile of gene gi in sample vj. Then, for every j from 1 to n, a vector (a(1, j), a(2, j). · · · , a(N, j)) occurs and represents the gene expression profiles of the sample vj, denoted Pj. For every pair (j, j′), let Wj,j′ be the Pearson correlation coefficient between Pj and Pj′, the gene expression profiles of samples vj and vj′, respectively. A cell sample network G = (V, E) is constructed on the basis of the gene expression profiles by the following algorithm, denoted G. Algorithm G works with a fixed natural number k, and proceeds as follows: (1) The vertices of G are the cell samples v1, v2, · · · , vn, that is, let V = {v1, v2, · · · , vn}; and

slide-37
SLIDE 37

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

(2) For every j, suppose that u1, u2, · · · , uk are the cell samples such that W(vj, u1), W(vj, u2), · · · , W(vj, uk) are the highest k weights among the weights W(vj, u) for all of the samples u, where W(vj, u) is the Pearson correlation coefficient between the gene expression profiles of samples vj and u. For every i from 1 to k, create an edge (vj, ui) with weight W(vj, ui). This constructs the weighted graph G = (V, E).

slide-38
SLIDE 38

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Structuring of Gene Expression Profiles

Algorithm C proceeds as follows: (1) (Noise amplifying) Fix a noise amplifier σ. Let W be the average wight among all the pairs of cell samples. Let M = σ · W be the modifier. Let H be the weighted graph of the cell samples such that for every pair (i, j) of cell samples, there is a weight W ′(i, j) = W(i, j) + M. This step amplifies the noise for all the weights. The roles

  • f this step are two-fold: if the weight W(i, j) between cell

samples i and j is nontrivially high, then the modified weight W ′(i, j) = W(i, j) + M is approximately the original weight W(i, j) since the modifier M is small, and if the weight W(i, j) is trivial or noisy, then the modified weight W ′(i, j) = W(i, j) + M is significantly amplified, which allows our algorithm to better filter the noise or trivial weights from the highly nontrivial weights.

slide-39
SLIDE 39

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

(2) For every k, let Hk be the weighted graph obtained from H as follows: – The modifier M is kept for every edge. – For every cell sample i, keep the weighted edges of the top k weights, and delete all the other weights. (3) For each k, let H(k) be the one-dimensional structure entropy of the weighted graph Hk. We say that k is a stable point, if both H(k − 1) > H(k) and H(k + 1) > H(k) hold. (4) (Minimisation of non-determinism or uncertainty) Define k to be the k′ that achieves the least one-dimensional structure entropy among all the stable points. That is, k is a stable point, and H(k) is the least among the H(k′) for all the stable points k′. This step ensures that the chosen k generates a network structure with minimum uncertainty or non-determinism.

slide-40
SLIDE 40

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

One-dimensional Structural Information Minimisation Principle

  • One-dimensional structural entropy minimisation is the

principle for structuring of unstructured data.

slide-41
SLIDE 41

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Lymphomas Real

1 2 3 4 5 6 7 8 9

sample

155 850 1477 1736 2514 3126 3334 3588 4026

gene

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 Level

Figure: Gene map of true types of the lymphomas.

slide-42
SLIDE 42

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Lymphomas: Two-dimensional Structural Information

1 2 3 4 5 6 7 8 9 10 11

sample

487 809 913 1242 1682 1976 2203 3048 3270 3544 4026

gene

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 Level

Figure: Gene map of types of the lymphomas found by E2.

slide-43
SLIDE 43

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Lymphomas: Three-dimensional Structural Information

1.1 1.2 1.3 2.1 2.2 3.1 3.2 3.3 4.1 4.2 4.3 5.1 5.2 5.3 6.1 6.2 7.1 7.2 8.1 8.2 8.3 9.1 9.2 9.3 10.1 10.2 11.1 11.2 11.3 12.1 12.2 13.1 13.2 13.3

sample

169 343 502 696 757 776 1132 1212 1235 1291 1312 1360 1442 1703 1866 1965 1990 2112 2177 2267 2299 2388 2455 2585 3028 3281 3326 3418 3459 3494 3643 3790 3945 4026

gene

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1

Level

Figure: Gene map of types of the lymphomas found by E3.

slide-44
SLIDE 44

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Clinical Data Analysis

(1) The DLBCL samples in each of the submodules 2.2, 3.1, 3.3, 4.1, 4.3, 5.1, 6.1, 6.2 and 8.1 are similar to one another in survival times, survival indicators and IPI scores. (2) However, the DLBCL samples in submodules 3.2, 7.1, 7.2, 8.2 and 8.3 are divergent in survival times, survival indicators and IPI scores. (3) The overall survival times, survival ratios and IPI scores in most of the submodules are distinguishable. Therefore, many of the submodules of the DLBCL samples identified by E3 are interpretable by the similarity of survival times, survival indicators and IPI scores for the cell samples within the same submodule, and distinguishable by overall survival times, survival ratios and IPI scores for different submodules.

slide-45
SLIDE 45

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Resistance

Definition

Given a connected network G = (V, E), let P be a partition of

  • G. We define the resistance of G given by P as follows:

RP(G) = −

L

  • j=1

Vj − gj 2m log2 Vj 2m, (10) where Vj is the volume of the j-th module Xj of P, and gj is the number of edges from Xj to nodes outside Xj. In Equation (10), for the j-th term − Vj−gj

2m log2 Vj 2m, Vj−gj 2m

= Vj−gj

Vj

· Vj

2m is the probability that a random walk goes to

the j-th module Xj and fails to escape from the j-th module Xj, and − log2

Vj 2m is the number of bits to determine the code of the

j-th module.

slide-46
SLIDE 46

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Resistance Law

For the resistance of graph G by P, we have the following resistance principle: Let G = (V, E) be a connected graph. Suppose that P is a partition of V with the notations the same as that in the definitions of H1(G) and HP(G). Then the positioning entropy

  • f G, H1(G), and the structure entropy of G by given P, i.e.,

HP(G), satisfy the following properties: (1) (Additivity of H1(G)) The positioning entropy of G satisfies: H1(G) = −

L

  • j=1

Vj 2m

nj

  • i=1

d(j)

i

Vj log2 d(j)

i

Vj −

L

  • j=1

Vj 2m log2 Vj 2m. (11)

slide-47
SLIDE 47

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Resistance law - II

(2) (Local resistance law of networks) RP(G) = −

L

  • j=1

Vj − gj 2m log2 Vj 2m = H1(G) − HP(G) (12) (3) Assume that for each j, Vj ≤ m, for m = |E|. Then RP(G) = −

L

  • j=1

(1 − Φ(Xj)) Vj 2m log2 Vj 2m = H1(G) − HP(G)(13) where Φ(Xj) is the conductance of Xj in G.

slide-48
SLIDE 48

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Resistance law - III

Now, we are ready to define the resistance of a graph G as follows: R(G) = max

P {RP(G)},

(14) where P runs over all the partitions of G. By the definition of the resistance of G, the local resistance law in (2) above and the definition of the two-dimensional structure entropy, we have the following: Global resistance law of networks: For a network G, we have R(G) = H1(G) − H2(G). (15)

slide-49
SLIDE 49

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Security Index and Resistor Graphs

Given a graph G, we define the security index of G to be the normalised resistance of G, that is, θ(G) = 1 − H2(G) H1(G). We say that G is an (n, d, ρ)-resistor graph, if G has n vertices, least degree d, and θ(G) ≥ ρ. Question

  • To establish the theory of resistor graphs.
  • What is the theory of communication networks?
slide-50
SLIDE 50

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

One-dimensional Structural Information - I

Theorem

(Lower bound of positioning entropy of simple graphs) Let G = (V, E) be an undirected, connected, and simple graph with m edges, i.e., |E| = m. Then: H1(G) ≥ 1 2 (log2 m − 1) .

slide-51
SLIDE 51

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

One-dimensional Structural Information - II

Theorem

(Lower bound of positioning entropy of graphs of balanced weights) Let G = (V, E) be a connected graph with weight function w. Let m = |E| be the number of edges. If the ratio of maximum weight and minimum weight is at most mǫ, that is

maxe∈G{w(e)} mine∈G{w(e)} ≤ mǫ, for some constant ǫ < 1, then:

H1(G) ≥ 1 2 [(1 − ǫ) log2 m − 1] .

slide-52
SLIDE 52

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Locality

Theorem

(Locality Theorem) Given a connected graph G, let P be the partition of nodes of G such that each module X of P contains a single node of V, and let Q be the partition of G containing

  • nly one module of the whole set V. Then, we have

HP(G) = HQ(G). (16)

slide-53
SLIDE 53

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Separation

Theorem

(Separation theorem) Let G = (V, E) be a connected graph. Suppose that P is a partition of V, and X and Y are two modules of P. Let Z = X ∪ Y. Let Q be the partition consisting Z and all the modules of P other than X and Y. If there is no edge between the nodes in X and the nodes in Y, then, we have: HP(G) ≤ HQ(G). (17)

slide-54
SLIDE 54

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Basic Principle

Theorem

(Structural information principle) For any graph G, the structural information of G follows: H2(G) ≥ Φ(G) · H1(G), (18) where Φ(G) is the conductance of G, and H1(G) is the positioning entropy of G.

slide-55
SLIDE 55

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Lower Bounds - I

Foe simple graphs, we have

Theorem

(Lower bounds of two-dimensional structural information of simple graphs) Let G = (V, E) be an undirected, connected and simple graph with number of edges |E| = m. Then the two-dimensional structural information of G satisfies H2(G) = Ω(log2 log2 m). (19)

slide-56
SLIDE 56

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Lower Bounds - II

For the graphs with balanced weights, we have

Theorem

(Lower bound of two-dimensional structural information of graphs with balanced weights) Let G = (V, E) be a connected graph with weight function w. Let m = |E| be the number of

  • edges. If the ratio of maximum weight and minimum weight is at

most logǫ

2 m, that is maxe∈G{w(e)} mine∈G{w(e)} ≤ logǫ 2 m, for some constant

ǫ < 1, then the structural information of G satisfies H2(G) = Ω(log2 log2 m). (20)

slide-57
SLIDE 57

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Trees

Theorem

(Upper bounds of structural information of trees) Let T be a complete binary tree of depth H and thus of size n = 2H − 1. Then the structural information of T satisfies H2(T) ≤ log2 log2 n + 4 + o(1). (21)

slide-58
SLIDE 58

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Grids

Theorem

(Upper bound of two-dimensional structural information of grid graphs) Let G = (V, E) be an n × n grid graph. Then the two-dimensional structural information of G satisfies H2(G) ≤ 2 log2 log2 n + O(1). (22)

slide-59
SLIDE 59

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Information Theoretical Characterisation of Expanders

For expander graphs, we have

Theorem

(Expanders) Let {Gn} be a family of expanders, each of which is either a simple graph or a graph with balanced weights on

  • edges. Then for each G = Gn, we have that

H2(G) = Ω(log n). (23) New direction: We could define expander by H2(G) = Ω(log2 n), giving a new class and an information theoretical characterisation of expanders.

slide-60
SLIDE 60

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Phase Transition in a Small World

Theorem

(Phase transition theorem of two-dimensional structural information of networks of the small world model) Let G be a network generated from the small world model with parameter r ≥ 0. Then the two-dimensional structural information has a sharp phase transition at the point r = 2. That is, (1) if r ≥ 2, then with probability 1 − o(1), H2(G) = O(log log n); (2) if r < 2, then with probability 1 − o(1), H2(G) = Ω(log n). New directions More phase transition results are possible.

slide-61
SLIDE 61

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Black Hole Principle

Definition

Given a weighted graph G, we say that G consists of black holes, if G consists of a number of highly dense modules, each

  • f which contains only a few vertices.

Theorem

Given a weighted graph G, G consists of black holes if and only if H2(G) = o(log log n). (24)

slide-62
SLIDE 62

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Black Hole - I

Theorem

(Black hole theorem - necessity) Let G = (V, E) be a connected weighted graph of size n = |V| and weight function w : E → R+. (1) If there is a subset S ⊆ V of size s and volume vol(S) = ρ · vol(G) for some 0 < ρ ≤ 1, then both positioning entropy H1(G) and structural information H2(G) of G are at most H(1 − ρ, ρ) + (1 − ρ) log2(n − s) + ρ log2 s. (2) If s = logo(1) n and ρ ≥ 1 −

1 log n, then

H2(G) ≤ H1(G) = o(log log n).

slide-63
SLIDE 63

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Black Hole - II

Theorem

(Black hole theorem - sufficiency) Let G = (V, E) be a connected graph of size n = |V| and volume vol(G). If H2(G) = o(log log n), then we have the following conclusions. (1) If H1(G) = o(log n), then there is a subset S ⊆ V in G whose size is no(1) and whose volume is (1 − o(1)) · vol(G). (2) Otherwise, there is a subset S ⊆ V in G whose volume is vol(S) ≥ ρ · vol(G) for some constant 0 < ρ < 1, and each node in S belongs to a subset of size logo(1) n and conductance O(1/ log1−o(1) n) (understood as a black hole, that is, S is composed by black holes). For the complement S of S, either its volume is o(vol(G)), in which case, the complement of S consists of only “tiny dusts" and it is trivial, or there is a subset U ⊆ S with size |U| = no(1), volume vol(U) = (1 − o(1)) · vol(S) and conductance Φ(U) = o(1), in which case, U corresponds to a black hole.

slide-64
SLIDE 64

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Small Community Phenomenon - I

Theorem

(Small community phenomenon – necessity) Let G = (V, E) be a connected and balanced graph of size n = |V|. Then both (1) and (2) below hold: (1) If there is a set of modules A satisfying

(i) vol(A) = (1 − o(1)) · vol(G), where vol(A) is the sum of the weighted degrees of all the nodes in the modules in A; (ii) For each module X ∈ A, its size |X| = no(1); (iii) For each module X ∈ A, its conductance Φ(X) = o(1),

then the two-dimensional structural information of G is H2(G) = o(log n). (2) If there is a set of modules A satisfying

(i) vol(A) =

  • 1 − O
  • log log n

log n

  • · vol(G);

(ii) For each module X ∈ A, |X| = logO(1) n; (iii) For each module X ∈ A, Φ(X) = O

  • log log n

log n

  • ,

then H2(G) = O(log log n).

slide-65
SLIDE 65

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Small Community Phenomenon - II

Theorem

(Small community phenomenon – sufficiency) Let G = (V, E) be a graph of number of edges m = |E| and volume vol(G) without isolated nodes. Let w : E → R+ be the weight function satisfying maxe∈G{w(e)}

mine∈G{w(e)} ≤ W, for some constant W ≥ 1. If

H2(G) ≤ c log2 log2 m for some constant 0 < c ≤ 1 and sufficiently large m, then for any ε > 0, and sufficiently large m, there is a set of modules of nodes, denoted by A, satisfying (1) vol(A) ≥ (1 − 2ε) · vol(G); (2) For each module X ∈ A, |X| ≤ log3c/ε m; (3) For each module X ∈ A, Φ(X) ≤ 2ε/(1 − ε).

slide-66
SLIDE 66

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Small Community Phenomenon

Theorem

(Small community phenomenon) A graph G has the small community phenomenon if and only if H2(G) = O(log log n). (25)

slide-67
SLIDE 67

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

Locally Listing Rank - Natural Rank (NR)

Given a network G, the locally listing rank is to find a short

  • rdered list of vertices from any personalised input query vertex
  • v. The algorithm proceeds as follows.
  • 1. Given input query vertex v.
  • 2. Let X = {v} and let P be the partition consisting of X and

singletons {y} for all y ∈ X.

  • 3. Let z be the y ∈ X such that

∆ = HP(G) − HQ(G) is maximised, where Q is the partition obtained from P by enumerating y into X. Enumerate z into X. The algorithm outputs a short ordering of personalised query from any query input. There are a number of variations of the algorithm, each of which has remarkably better performance than the existing algorithms based on PageRank.

slide-68
SLIDE 68

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

New Directions

  • 1. Algorithmic theory of structural information
  • 2. What are the optimal communication networks?
  • 3. What are the principles for knowledge discovery from noisy

data, and for structuring of unstructured data?

  • 4. To establish the structural information theory
  • 5. To establish the information theoretical theory of graphs
  • 6. To establish the next-generation search engine
slide-69
SLIDE 69

The Challenges Previous Measures Our Ideas Structural Information Gene Map Resistance Theory New Searching

References

  • 1. A. Li, Y. Pan, Structural Information and Dynamical

Complexity of Networks, IEEE Transactions on Information Theory, Vol. 62, No. 6, pp. 3290 - 3339, 2016.

  • 2. A, Li, X. Yin and Y. Pan, Three-dimensional gene map of

cancer cell types: Structural entropy minimisation principle for defining tumour subtypes. Scientific Reports, 6: 20412 (2016).

  • 3. Brooks, F. P

., Jr. Three great challenges for half-century-old computer science. Journal of the ACM, 50 (1), pp 25 - 26 (2003).

  • 4. C. Shannon, The lattice of information, IEEE Transactions
  • n Information Theory, Vol. 1, No. 1, pp. 105 - 107, 1953.