HashGraph : Semantic Hashing using external knowledge base. C. - - PowerPoint PPT Presentation

hashgraph semantic hashing using external knowledge base
SMART_READER_LITE
LIVE PREVIEW

HashGraph : Semantic Hashing using external knowledge base. C. - - PowerPoint PPT Presentation

HashGraph : Semantic Hashing using external knowledge base. C. Gravier 1 , J. Subercaze 1 1 Satin team, LT2C laboratory Universit e Jean Monnet ecom Saint- T el Etienne, France C. Gravier, J. Subercaze (Universities of) HashGraph :


slide-1
SLIDE 1

HashGraph: Semantic Hashing using external knowledge base.

  • C. Gravier1, J. Subercaze1

1Satin team, LT2C laboratory

Universit´ e Jean Monnet T´ el´ ecom Saint-´ Etienne, France

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 1 / 43

slide-2
SLIDE 2

Preambule

Outline

1

Preambule

2

Semantic Hashing Introduction Existing solutions

3

HashGraph User profile : graph of terms Graph to binary footprint Evaluation

4

HashGraph and HashWordnet On hashing node values Using an exertnal is-a taxonomy

5

Demos

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 2 / 43

slide-3
SLIDE 3

Preambule

References

This presentation is based on :

◮ [BambaCIKM12]: Bamba P., Subercaze J., Gravier C., Benmira N.,

Fontaine J., The Twitaholic Next Door, Proc. of 21st ACM International Conference on Information and Knowledge Management (CIKM’12), pp.2275–2278, Maui, Hawai’i, USA, October, 30th 2012

◮ [SubercazeWI13]: Subercaze J., Gravier C., HashGraph : an

expressive and scalable Twitter users profile for recommendation, 2013 IEEE/WIC/ACM International Conference on Web Intelligence (WI’13), Atlanta, USA, November 17th–20th, 2013 .. with a different agenda, additional informations and thoughts.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 3 / 43

slide-4
SLIDE 4

Preambule

Who are we ?

◮ Christophe Gravier

◮ Associate Professor in Computer Science ◮ Working at T´

el´ ecom Saint-´ Etienne (Universit´ e Jean Monnet)

◮ Julien Subercaze

◮ Researcher in Computer Science ◮ Working at T´

el´ ecom Saint-´ Etienne (Universit´ e Jean Monnet)

◮ Contacts :

◮ mail:

{julien.subercaze,christophe.gravier}@univ-st-etienne.fr

◮ homepage :

http://satin-ppl.telecom-st-etienne.fr/cgravier/ and http://satin-ppl.telecom-st-etienne.fr/jsubercaze/

◮ twitter : @chgravier and @JulienSubercaze

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 4 / 43

slide-5
SLIDE 5

Semantic Hashing

Outline

1

Preambule

2

Semantic Hashing Introduction Existing solutions

3

HashGraph User profile : graph of terms Graph to binary footprint Evaluation

4

HashGraph and HashWordnet On hashing node values Using an exertnal is-a taxonomy

5

Demos

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 5 / 43

slide-6
SLIDE 6

Semantic Hashing Introduction

Hashing techniques for Information Retrieval

◮ Methods for embedding high dimensional data into a

similarity-preserving low-dimensional Hamming space [Kim and Choi, 2011].

◮ Usually the hash space is an ”absolute partitioning of the space of

document representation” [Stein and Potthast, 2007]

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 6 / 43

slide-7
SLIDE 7

Semantic Hashing Introduction

Hashing techniques for Information Retrieval

◮ Methods for embedding high dimensional data into a

similarity-preserving low-dimensional Hamming space [Kim and Choi, 2011].

◮ Usually the hash space is an ”absolute partitioning of the space of

document representation” [Stein and Potthast, 2007]

Figure: Hashing for information retrieval (From [Stein and Potthast, 2007])

Historically, learn hφ that partitions the Hamming space so that two documents that are at least close to θ threshold of similarity in the original space, are associated to the same bucket in the Hamming space.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 6 / 43

slide-8
SLIDE 8

Semantic Hashing Introduction

Semantic hashing

Similarity Search

◮ In similarity search, a document is used as the query ◮ This is fundamentally different with the standard keyword search

paradigm, e.g., in TREC [Zhang et al., 2010].

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 7 / 43

slide-9
SLIDE 9

Semantic Hashing Introduction

Semantic hashing

Similarity Search

◮ In similarity search, a document is used as the query ◮ This is fundamentally different with the standard keyword search

paradigm, e.g., in TREC [Zhang et al., 2010]. Semantic Hashing Semantic hashing is about providing the hφ function(s) for providing an index in the Hamming space for fast similarity search.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 7 / 43

slide-10
SLIDE 10

Semantic Hashing Introduction

kNN and ǫ−kNN problems

◮ We use a document q as a query: hash it to identify its bucket and

then we use the bucket value to address the two problems below1 :

1as coined by the founding paper on Semantic Hashing [Gionis et al., 1999]

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 8 / 43

slide-11
SLIDE 11

Semantic Hashing Introduction

kNN and ǫ−kNN problems

◮ We use a document q as a query: hash it to identify its bucket and

then we use the bucket value to address the two problems below1 :

  • 1. kNN search : Find k nearest documents from hash(q) in the

Hamming space (aka top-K search).

1as coined by the founding paper on Semantic Hashing [Gionis et al., 1999]

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 8 / 43

slide-12
SLIDE 12

Semantic Hashing Introduction

kNN and ǫ−kNN problems

◮ We use a document q as a query: hash it to identify its bucket and

then we use the bucket value to address the two problems below1 :

  • 1. kNN search : Find k nearest documents from hash(q) in the

Hamming space (aka top-K search).

  • 2. ǫ−kNN search: Find all documents p, d(q,p) ≥ (1+ǫ)×d(q,P),

where d(q,P) is the distance of q to the its closest point in P (Hamming ball of size ǫ)

1as coined by the founding paper on Semantic Hashing [Gionis et al., 1999]

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 8 / 43

slide-13
SLIDE 13

Semantic Hashing Introduction

kNN and ǫ−kNN problems

◮ We use a document q as a query: hash it to identify its bucket and

then we use the bucket value to address the two problems below1 :

  • 1. kNN search : Find k nearest documents from hash(q) in the

Hamming space (aka top-K search).

  • 2. ǫ−kNN search: Find all documents p, d(q,p) ≥ (1+ǫ)×d(q,P),

where d(q,P) is the distance of q to the its closest point in P (Hamming ball of size ǫ) Remark on Perfect Semantic Hashing It is possible to provide a perfect hashing scheme [Linial et al., 1995], but at a prohibitive code length cost. All semantic hashing schemes try to provide either an approximation (which means hashing with semantic-relatedness preservation guarantees) or a heuristic.

1as coined by the founding paper on Semantic Hashing [Gionis et al., 1999]

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 8 / 43

slide-14
SLIDE 14

Semantic Hashing Introduction

A ”good” Semantic Hashing function ?

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 9 / 43

slide-15
SLIDE 15

Semantic Hashing Introduction

A ”good” Semantic Hashing function ?

  • 1. Entropy maximizing [Baluja and Covell, 2008]. Large coverage of

the set of 2l binary strings of length l.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 9 / 43

slide-16
SLIDE 16

Semantic Hashing Introduction

A ”good” Semantic Hashing function ?

  • 1. Entropy maximizing [Baluja and Covell, 2008]. Large coverage of

the set of 2l binary strings of length l.

  • 2. Complexity. Obvisouly, a ”good semantic hashing” would exhibit a

computational complexity as low as possible.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 9 / 43

slide-17
SLIDE 17

Semantic Hashing Introduction

A ”good” Semantic Hashing function ?

  • 1. Entropy maximizing [Baluja and Covell, 2008]. Large coverage of

the set of 2l binary strings of length l.

  • 2. Complexity. Obvisouly, a ”good semantic hashing” would exhibit a

computational complexity as low as possible.

  • 3. Monotonicity. The quality of the embedding should improve with

the increase of bits dedicated to the array of bits.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 9 / 43

slide-18
SLIDE 18

Semantic Hashing Introduction

A ”good” Semantic Hashing function ?

  • 1. Entropy maximizing [Baluja and Covell, 2008]. Large coverage of

the set of 2l binary strings of length l.

  • 2. Complexity. Obvisouly, a ”good semantic hashing” would exhibit a

computational complexity as low as possible.

  • 3. Monotonicity. The quality of the embedding should improve with

the increase of bits dedicated to the array of bits.

  • 4. Independance to dimensions [Stein and Potthast, 2007]. As most

approaches relies on embedding a high dimensional space of dimension d into a Hamming space of dimension d′, the semantic hashing strategy should scale well w.r.t. to the increase of d.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 9 / 43

slide-19
SLIDE 19

Semantic Hashing Introduction

A ”good” Semantic Hashing function ?

  • 1. Entropy maximizing [Baluja and Covell, 2008]. Large coverage of

the set of 2l binary strings of length l.

  • 2. Complexity. Obvisouly, a ”good semantic hashing” would exhibit a

computational complexity as low as possible.

  • 3. Monotonicity. The quality of the embedding should improve with

the increase of bits dedicated to the array of bits.

  • 4. Independance to dimensions [Stein and Potthast, 2007]. As most

approaches relies on embedding a high dimensional space of dimension d into a Hamming space of dimension d′, the semantic hashing strategy should scale well w.r.t. to the increase of d.

  • 5. Semantic Preserving [Zhang et al., 2010]. To minimize the

differences between the semantic similarity of documents in the

  • riginal and the Hamming distance of their binary strings.
  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 9 / 43

slide-20
SLIDE 20

Semantic Hashing Introduction

Applications of Semantic Hashing

◮ It’s a trade-off: you gain a massive speedup at the cost of a

precision/recall.

◮ It is therefore interesting in applications with large datasets (textual,

images, . . . ).

◮ Main applications so far is near-duplicate detection. Examples

includes :

◮ Webpage crawls [Manku et al., 2007], ◮ Social Networks (microposts), ◮ Book collections, ◮ . . .

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 10 / 43

slide-21
SLIDE 21

Semantic Hashing Existing solutions

Main existing solutions

◮ Locality-senstive hash function [Gionis et al., 1999]: A LSH h is a

combination of k functions h ∈ H where each use a random vector

  • btained by an independant and identically distributed random choice.

◮ Fuzzy fingerprint [Stein, 2005]: After learning the a priori probabilities

  • f term prefixes in the corpus, the noticeable differences for an item

to be hashed is used in the hash function.

◮ Self-Taught Hashing [Zhang et al., 2010]: For l-bit binary codes,

train l classifiers to predict each of the l bits for any query document.

◮ Spec hashing [Lin et al., 2010]: Using sparse affinity matrix of items

in the original space, train l classifiers that minize the Kullback-Leibler divergence.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 11 / 43

slide-22
SLIDE 22

Semantic Hashing Existing solutions

Remarks on existing solutions

◮ Learning-based approaches and/or near-duplicates oriented. ◮ Pros:

◮ Near-dupliccate detection (this is the learning objective), ◮ Entropy maximising for some of them (mainly

[Lin et al., 2010, Zhang et al., 2010])

◮ Demonstrates the usefullness of this research for pratical problems.

◮ Cons:

◮ Data-dependency (cold start, generalization, . . . ), ◮ Online learning is difficult for many of these solutions, ◮ Term and frequency term matching is concept-oblivious, ◮ Few propositions

([Zhang et al., 2010, Weiss et al., 2008, Lin et al., 2010]) consider poviding a similarity distance preserving scheme (additionaly to increase collisions chances for similar items).

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 12 / 43

slide-23
SLIDE 23

HashGraph

Outline

1

Preambule

2

Semantic Hashing Introduction Existing solutions

3

HashGraph User profile : graph of terms Graph to binary footprint Evaluation

4

HashGraph and HashWordnet On hashing node values Using an exertnal is-a taxonomy

5

Demos

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 13 / 43

slide-24
SLIDE 24

HashGraph

Our contribution is HashGraph

◮ We present HashGraph, a Semantic Hashing scheme that addresses

some of the previous limitations.

◮ It will be illustrated on a use case where the documents are

microposts, but remember this can be any documents (as we will showcase in the demos).

◮ We choose microposts as the first application of HashGraph as we

were inspired by [Wilson et al., 2009] who showed that content and user interactions prevail over social graph.

◮ However computation time of content and semantic based model

prevent from scalability, therefore calling for Semantic Hashing !

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 14 / 43

slide-25
SLIDE 25

HashGraph

Two steps approach

HashGraph for microposts is a two-steps approach:

  • 1. Create a user model using the textual content of his/her microposts.

◮ Each user document is the corpus of his/her microposts.

  • 2. Hash the user model into a binary footprint for fast kNN and ǫ−kNN

queries.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 15 / 43

slide-26
SLIDE 26

HashGraph User profile : graph of terms

Computing user model

User profile as graph of terms Inspired by [2], very successful into keyphrase extraction, we adapted the approach to generate a user profile as a graph of terms based on the user generated content.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 16 / 43

slide-27
SLIDE 27

HashGraph User profile : graph of terms

Computing user model

User profile as graph of terms Inspired by [2], very successful into keyphrase extraction, we adapted the approach to generate a user profile as a graph of terms based on the user generated content. Process We consider a tweet as the unity in term of performative speech act.

  • 1. Porter-Stemming, N-grams, nouns and adjective filtering
  • 2. Compute terms co-occurrence
  • 3. Build Graph from JS divergence
  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 16 / 43

slide-28
SLIDE 28

HashGraph User profile : graph of terms

Example I

Let’s consider the following co-occurrence table

a b c d e Total a 3 4 2 1 10 b 3 2 5 c 4 4 8 d 2 4 6 12 e 1 2 6 9

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 17 / 43

slide-29
SLIDE 29

HashGraph User profile : graph of terms

Example I

Let’s consider the following co-occurrence table

a b c d e Total a 3 4 2 1 10 b 3 2 5 c 4 4 8 d 2 4 6 12 e 1 2 6 9

Co-Occurrence Frequency

a b c d e Total frequency 3 4 2 1 10 probability 0.3 0.4 0.2 0.1 1

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 17 / 43

slide-30
SLIDE 30

HashGraph User profile : graph of terms

Example II

Term distance To determinate if two terms are close, we compute the Jensen-Shannon divergence (similarity measure) of their co-occurrence distributions.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 18 / 43

slide-31
SLIDE 31

HashGraph User profile : graph of terms

Example II

Term distance To determinate if two terms are close, we compute the Jensen-Shannon divergence (similarity measure) of their co-occurrence distributions. Resulting graph

a b c d e 0.86 0.26 0.72 0.44 0.77 0.21 0.40 0.45 0.81 0.69

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 18 / 43

slide-32
SLIDE 32

HashGraph User profile : graph of terms

Example III

Unrelated terms [2] showed that O.95×log(2) is a good threshold for related/unrelated terms.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 19 / 43

slide-33
SLIDE 33

HashGraph User profile : graph of terms

Example III

Unrelated terms [2] showed that O.95×log(2) is a good threshold for related/unrelated terms. Resulting graph

a b c d e 0.86 0.72 0.77 0.81 0.69

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 19 / 43

slide-34
SLIDE 34

HashGraph User profile : graph of terms

Summary of first part : User profile as graph of terms

Twitter username : How many friends : @user k Search friends Tweet querier + Text preprocessing (Stopwords, POS) @user Twitter API @user {tweets} Co-occurencer (Jensen-Shannon divergence threshold s = 0.3) {cleaned tweets} Graph Builder 0.3 a b c d a b 0.6 0.7 0.9 < s < s c d a,b,c,d : terms from cleaned tweets co-occurence matrix a c b d 0.3 0.6 0.7 0.9 in-memory graph model Step 1
  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 20 / 43

slide-35
SLIDE 35

HashGraph User profile : graph of terms

Summary of first part : User profile as graph of terms

Twitter username : How many friends : @user k Search friends Tweet querier + Text preprocessing (Stopwords, POS) @user Twitter API @user {tweets} Co-occurencer (Jensen-Shannon divergence threshold s = 0.3) {cleaned tweets} Graph Builder 0.3 a b c d a b 0.6 0.7 0.9 < s < s c d a,b,c,d : terms from cleaned tweets co-occurence matrix a c b d 0.3 0.6 0.7 0.9 in-memory graph model Step 1

We finished the first step which was to create a user-profile (a weighted graph of terms in our case). The next step is about hashing this profile while preserving ”user-relatedness” based on their microposts textual content.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 20 / 43

slide-36
SLIDE 36

HashGraph Graph to binary footprint

Hashing a graph

Existing to hash a graph SimHash [3] is a method that allows to hash a document splitted into weighted tokens.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 21 / 43

slide-37
SLIDE 37

HashGraph Graph to binary footprint

Hashing a graph

Existing to hash a graph SimHash [3] is a method that allows to hash a document splitted into weighted tokens. Example

Token Weight Hash a 3 1 0 1 1 0 1 b 2 0 1 1 0 0 1 c 1 1 0 0 1 1 1

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 21 / 43

slide-38
SLIDE 38

HashGraph Graph to binary footprint

Hashing a graph

Existing to hash a graph SimHash [3] is a method that allows to hash a document splitted into weighted tokens. Example - Set bit value to +/- weight

Token Weight Hash a 3 3 -3 3 3 -3 3 b 2

  • 2 2 2 -2 -2 2

c 1 1 -1 -1 1 1 1

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 21 / 43

slide-39
SLIDE 39

HashGraph Graph to binary footprint

Hashing a graph

Existing to hash a graph SimHash [3] is a method that allows to hash a document splitted into weighted tokens. Example - Sum the values

Token Weight Hash a 3 3 -3 3 3 -3 3 b 2

  • 2 2 2 -2 -2 2

c 1 1 -1 -1 1 1 1 total 2 -2 4 2 -4 6

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 21 / 43

slide-40
SLIDE 40

HashGraph Graph to binary footprint

Hashing a graph

Existing to hash a graph SimHash [3] is a method that allows to hash a document splitted into weighted tokens. Example - Final hash

Token Weight Hash a 3 3 -3 3 3 -3 3 b 2

  • 2 2 2 -2 -2 2

c 1 1 -1 -1 1 1 1 total 2 -2 4 2 -4 6 hash 1 0 1 1 0 1

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 21 / 43

slide-41
SLIDE 41

HashGraph Graph to binary footprint

SimHash for Top-K computation

Very fast computation Pairwise comparison of the user’s hash vs the database : Hamming distance. Very fast with current processors intrinsics, XOR and POPCNT. Possible optimisations using bitshift indexing.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 22 / 43

slide-42
SLIDE 42

HashGraph Graph to binary footprint

Hashing our graph

◮ Our graph includes weighted edges and no weights on vertices.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 23 / 43

slide-43
SLIDE 43

HashGraph Graph to binary footprint

Hashing our graph

◮ Our graph includes weighted edges and no weights on vertices. ◮ But :

◮ SimHash can hash any weigthed graph information, e.g. vertices. ◮ We can introduce a weight on the vertices using the freshness of the

information: the more recent this term was used, the higher the weight.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 23 / 43

slide-44
SLIDE 44

HashGraph Graph to binary footprint

Hashing our graph

◮ Our graph includes weighted edges and no weights on vertices. ◮ But :

◮ SimHash can hash any weigthed graph information, e.g. vertices. ◮ We can introduce a weight on the vertices using the freshness of the

information: the more recent this term was used, the higher the weight.

◮ And finally we could just apply SimHash to our graph (the user

profile) which will result in a binary footprint associated to a user.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 23 / 43

slide-45
SLIDE 45

HashGraph Graph to binary footprint

Global overview

Twitter username : How many friends : @user k Search friends Tweet querier @user Twitter API @user {tweets} Text preprocessing (Stopwords, POS) Co-occurencer {tweets} Graph Builder (Jensen-Shannon divergence threshold s = 0.3) 3 a b c d a b 6 7 9 5 2 c d a,b,c,d : terms from cleaned tweets co-occurence matrix a c b d 0.3 0.6 0.7 0.9 in-memory graph model Graph Signature SimHash()) node N bits hash depending on method a b c 1100101010001...................... 01101111 0100100010111...................... 11000000 0000011110100...................... 00101011 (result of Simhash() applied to values of a,b,c) 1110101.... 0101011 Hash of the graph ArgMax (top-k users closest to @user) k 1110101.... 0101011 possible friends dataset 25.000 users 1 million of tweets user profiles : precomputed hashes R = {user} / |R| ≤ k You may want to follow : @user1 Tweets Follow ! @user2 Tweets Follow ! @userk Tweets Follow ! .... WI attendee Querying using in-house API Step 1 Step 2
  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 24 / 43

slide-46
SLIDE 46

HashGraph Evaluation

Dataset

Crawling Twitter First crawl to setup limits. Start with up to 1K tweets for 5 users.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 25 / 43

slide-47
SLIDE 47

HashGraph Evaluation

Dataset

Crawling Twitter First crawl to setup limits. Start with up to 1K tweets for 5 users. Distribution Average tweets per user ≈ 120 Standard deviation of tweets per user ≈ 212 Average interval between two tweets ≈ 6 days Standard deviation between two tweets ≈ 240 days

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 25 / 43

slide-48
SLIDE 48

HashGraph Evaluation

Dataset

Crawling Twitter First crawl to setup limits. Start with up to 1K tweets for 5 users. Distribution Average tweets per user ≈ 120 Standard deviation of tweets per user ≈ 212 Average interval between two tweets ≈ 6 days Standard deviation between two tweets ≈ 240 days Final crawl around 1 million tweets for 25K users. Stored in a three machines Cassandra-cluster.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 25 / 43

slide-49
SLIDE 49

HashGraph Evaluation

Experimental Setup

Hashing the tokens To hash the nodes of the user’s graph, we used several methods

◮ MD5 ◮ SHA-256 ◮ SH1-512

Measures

◮ Precision : 1-RMSE vs Lucene’s TF/IDF ◮ Precomputation time ◮ Computation time

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 26 / 43

slide-50
SLIDE 50

HashGraph Evaluation

Experimental results : Precision

103 104 105 0.4 0.45 0.5 0.55 0.6 Frequent terms in the TF/IDF Precision

HashGraphMD5 HashGraphSHA256 HashGraphSHA512

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 27 / 43

slide-51
SLIDE 51

HashGraph Evaluation

Experimental results : Precomputation time

TF/IDF 1K TF/IDF 50K HashGraphMD5 HashGraphSHA256 HashGraphSHA512

104.5 105 Computation Time (ms)

Figure: Precomputation time

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 28 / 43

slide-52
SLIDE 52

HashGraph Evaluation

Experimental results : Computation time

0.2 0.4 0.6 0.8 1 1.2 ·107 102 103 104 105 Comparisons Computation Time (ms)

TF/IDF1K TF/IDF10K HashGraphMD5 HashGraphSHA-256 HashGraphSHA-512

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 29 / 43

slide-53
SLIDE 53

HashGraph and HashWordnet

Outline

1

Preambule

2

Semantic Hashing Introduction Existing solutions

3

HashGraph User profile : graph of terms Graph to binary footprint Evaluation

4

HashGraph and HashWordnet On hashing node values Using an exertnal is-a taxonomy

5

Demos

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 30 / 43

slide-54
SLIDE 54

HashGraph and HashWordnet On hashing node values

A key step left unspoken

◮ “To hash the nodes of the user’s graph, we used several methods”. ◮ This means that this is a frequency term-based approach, like most

(all ?) semantic hashing scheme.

◮ The primary objective of HashGraph is to go beyond :

concept-sensitive search in sublinear time !

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 31 / 43

slide-55
SLIDE 55

HashGraph and HashWordnet On hashing node values

A key step left unspoken

◮ “To hash the nodes of the user’s graph, we used several methods”. ◮ This means that this is a frequency term-based approach, like most

(all ?) semantic hashing scheme.

◮ The primary objective of HashGraph is to go beyond :

concept-sensitive search in sublinear time !

◮ Basically, for pair of sentences like :

◮ ”The doctor is presenting his research at the seminar” ◮ ”The scientist offers a talk on computer science at the group

conference”

◮ We expect the hash function to preserve the semantic relatedness,

although no terms match in both sentences, but in the HashGraph version presented so far, this is not the case.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 31 / 43

slide-56
SLIDE 56

HashGraph and HashWordnet On hashing node values

How ?

◮ As for now, the nodes in the graph (the user profile), are labelled

using a cryptogrpahic hash function (MD5, SHA-256, . . . ) applied on the node string value for its term.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 32 / 43

slide-57
SLIDE 57

HashGraph and HashWordnet On hashing node values

How ?

◮ As for now, the nodes in the graph (the user profile), are labelled

using a cryptogrpahic hash function (MD5, SHA-256, . . . ) applied on the node string value for its term.

◮ However, two nodes having strings that are conceptually related,

should have a hash value that is within a similar Hamming distance.

◮ doctor should have a hash value close to researcher ◮ research should have a hash value close to computer science ◮ . . .

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 32 / 43

slide-58
SLIDE 58

HashGraph and HashWordnet On hashing node values

How ?

◮ As for now, the nodes in the graph (the user profile), are labelled

using a cryptogrpahic hash function (MD5, SHA-256, . . . ) applied on the node string value for its term.

◮ However, two nodes having strings that are conceptually related,

should have a hash value that is within a similar Hamming distance.

◮ doctor should have a hash value close to researcher ◮ research should have a hash value close to computer science ◮ . . .

◮ Under this assumption, the SimHash algorithm we use will favour

relatedness of hash values for nodes having similar hash values (hence similar meaning !).

◮ In order to do this, we rely on an external is-a taxonomy. . .

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 32 / 43

slide-59
SLIDE 59

HashGraph and HashWordnet Using an exertnal is-a taxonomy

External is-a taxonomy (example)

Things NotLiving Vehicules Planes Boats Trucks Cars Furnitures Living animals humans

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 33 / 43

slide-60
SLIDE 60

HashGraph and HashWordnet Using an exertnal is-a taxonomy

External is-a taxonomy (example)

Things NotLiving Vehicules Planes Boats Trucks Cars Furnitures Living animals humans The proposal is as follows :

◮ To associate to each node a binary string that is close to the parent

node binary string,

◮ Replace each node in the user profile not by its SHA/MD5/. . . hash

value but hash value from the matching term in the taxonomy

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 33 / 43

slide-61
SLIDE 61

HashGraph and HashWordnet Using an exertnal is-a taxonomy

Strategy to hash the is-a taxonomy

◮ Building a strategy to hash each node in the is-a taxonomy is not

trivial (could be a talk by itself !)

◮ Let us assume that we choose to hash the taxonmomy using grey

codes and Breadth-first search graph traversal.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 34 / 43

slide-62
SLIDE 62

HashGraph and HashWordnet Using an exertnal is-a taxonomy

Strategy to hash the is-a taxonomy

◮ Building a strategy to hash each node in the is-a taxonomy is not

trivial (could be a talk by itself !)

◮ Let us assume that we choose to hash the taxonmomy using grey

codes and Breadth-first search graph traversal.

Things (00000000) NotLiving (00100000) Vehicules (00101000) Planes (00101100) Boats (00101011) Trucks (00101010) Cars (00101001) Furnitures (00110000) Living (01000000) animals (01000001) humans (01000010)

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 34 / 43

slide-63
SLIDE 63

HashGraph and HashWordnet Using an exertnal is-a taxonomy

Using HashWordnet in HashGraph

◮ We apply this strategy to the Wordnet taxonomy (limited to nouns

synsets)

◮ The list of all terms (actually bag of terms) associated to a single hash

values using the process described previously is called hashwordnet

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 35 / 43

slide-64
SLIDE 64

HashGraph and HashWordnet Using an exertnal is-a taxonomy

Using HashWordnet in HashGraph

◮ We apply this strategy to the Wordnet taxonomy (limited to nouns

synsets)

◮ The list of all terms (actually bag of terms) associated to a single hash

values using the process described previously is called hashwordnet

◮ Then, when substituting the string value of a node in the user profile

by a hash value, we substitute it for the matching hash value in hashwordnet

◮ SimHash does the rest. . .

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 35 / 43

slide-65
SLIDE 65

HashGraph and HashWordnet Using an exertnal is-a taxonomy

Advantages

◮ Goes beyong term matching: concept-sensitive hashing. ◮ No machine learning step involved: no cold start, no assumption of

the distribution of items needed, no dataset dependency.

◮ Can be tuned to use different is-a taxonomies, e.g. domain

taxonomies for better results !

◮ Online learning by default, ◮ Provides a metric in the Hamming space : interesting indexing for

Top-k and ǫ−kNN problems.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 36 / 43

slide-66
SLIDE 66

HashGraph and HashWordnet Using an exertnal is-a taxonomy

Current shortcomings and pending works

◮ Wordnet provide several nodes with the same terms (term have

different meanings),

◮ Need to provide formal guarantees on the approximation, ◮ One top-level Hashed taxonomy per language, ◮ Implementation limited to nouns after the POS tagging process for

concept-sensitivity.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 37 / 43

slide-67
SLIDE 67

Demos

Outline

1

Preambule

2

Semantic Hashing Introduction Existing solutions

3

HashGraph User profile : graph of terms Graph to binary footprint Evaluation

4

HashGraph and HashWordnet On hashing node values Using an exertnal is-a taxonomy

5

Demos

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 38 / 43

slide-68
SLIDE 68

Demos

Some live demos

  • 1. HashWordnet : semeval dataset online application for testing.
  • 2. FIRE: ebook recommendation based on content.
  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 39 / 43

slide-69
SLIDE 69

Thank you !

slide-70
SLIDE 70

Demos

Baluja, S. and Covell, M. (2008). Learning to hash: forgiving hash functions and applications. Data Mining and Knowledge Discovery, 17(3):402–430. Gionis, A., Indyk, P., Motwani, R., et al. (1999). Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518–529. Kim, S. and Choi, S. (2011). Semi-supervised discriminant hashing. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 1122–1127. Lin, R.-S., Ross, D. A., and Yagnik, J. (2010). Spec hashing: Similarity preserving algorithm for entropy-based coding. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 848–854. IEEE.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 41 / 43

slide-71
SLIDE 71

Demos

Linial, N., London, E., and Rabinovich, Y. (1995). The geometry of graphs and some of its algorithmic applications. Combinatorica, 15(2):215–245. Manku, G. S., Jain, A., and Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web, pages 141–150. ACM. Stein, B. (2005). Fuzzy-fingerprints for text-based information retrieval. In Proceedings of the 5th international conference on knowledge management (I-KNOW 05), Graz, Journal of Universal Computer Science, pages 572–579.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 42 / 43

slide-72
SLIDE 72

Demos

Stein, B. and Potthast, M. (2007). Applying hash-based indexing in text-based information retrieval. In Proceedings of the 7th Dutch-Belgian Information Retrieval Workshop (DIR 07), pages 29–35. Weiss, Y., Torralba, A., and Fergus, R. (2008). Spectral hashing. In NIPS, volume 9, page 6. Wilson, C., Boe, B., Sala, A., Puttaswamy, K. P., and otehrs (2009). User interactions in social networks and their implications. In EuroSys’09, pages 205–218. Zhang, D., Wang, J., Cai, D., and Lu, J. (2010). Self-taught hashing for fast similarity search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 18–25. ACM.

  • C. Gravier, J. Subercaze (Universities of)

HashGraph: Semantic Hashing using external knowledge base. 43 / 43