Some graph optimization problems in data mining
University of Chicago, October 16, 2012
- P. Van Dooren, CESAME, Univ. catholique Louvain
based on work in collaboration with the group on
Some graph optimization problems in data mining P. Van Dooren, - - PowerPoint PPT Presentation
Some graph optimization problems in data mining P. Van Dooren, CESAME, Univ. catholique Louvain based on work in collaboration with the group on University of Chicago, October 16, 2012 Leuven Lambiotte et al Phys Rev, 2008 Call density over 6
University of Chicago, October 16, 2012
based on work in collaboration with the group on
Lambiotte et al Phys Rev, 2008 Call density over 6 months Leuven
Lambiotte et al Phys Rev, 2008 Call density over 6 months Brussels
Ref: Melchior, Eng. Thesis, UCL
Detecting dishonest participants in auction systems ( )
Given a bipartite graph with n raters and m objects and votes
Example: graph matrix form = X (votes) Characterize the reputation f of the raters and r of the objects
r1 r2 r3
1 1 2 3 5
r1 r2 r3
4.2 4.5 2.8 3.4 3.3 4.9 f ? f1 = 4.6 f2 = 4.2 f3 = 3 r ? Belief divergence = Variance
4.2 4.5 2.8 3.4 3.3 4.9 f ? f1 = 4.6 f2 = 4.2 f3 = 3 r ? Belief divergence = Variance
4.2 4.5 2.8 3.4 3.3 4.9 f ? r ? Belief divergence = Variance f1 = 5 f2 = 4.8 f3 = 1.4
Assume that every rater evaluates all objects with a vote [0,1] and that f >0 are the voting matrix and the raters’ reputation The object’s reputation vector r is the weighted sum of the votes The rater’s reputation f depends on the discrepancy with the other votes There is a unique pair of vectors r and f satisfying these formulas when d Inf De Kerchove-VD,SIAM News 08
These two formulas lead to define the following iteration: where the voting matrix could be dynamic and then changes at each
Theorem If d > m, the iteration converges towards the unique fixed point that gives the reputations r of the objects and f(r) of the raters.
If d > m, the fixed point of our iteration corresponds to the minimum
E.g. for m=2, the energy function looks like (for d>2 and for d=1.5)
and one iteration step corresponds to the steepest descent (with a particular step size) and this converges monotonically to r* since we have
||rk+1-rk||2
Data set consists of 100,000 ratings (1-5) from 943 users on 1682 movies. Each user has rated at least 20 movies. The data was collected through the MovieLens web site (movielens.umn.edu) during a seven-month period 237 spammers (scoring always 1 except for their unique best friend that receive the maximum: 5) are added (+25%): The mean (Left) is less robust than our iteration (Middle) that also gives good results for the raters’ reputations (Right).
Convergence for spammers separation after step 1, 2 and Inf
For A and B adjacency matrices of the two graphs S solves ρS = A S BT + AT S B This matrix can be obtained via fixed point of power method (linear)
Ref: Blondel et al, SIAM Rev., ‘04
For A and B adjacency matrices of the two graphs S solves ρS = A S BT + AT S B Element S54 says how similar node 5 of A is to node 4 of B
For A and B adjacency matrices of the two graphs S solves ρS = A S BT + AT S B Element S43 says how similar node 4 of A is to node 3 of B
For A and B adjacency matrices of the two graphs S solves ρS = A S BT + AT S B Two nodes are similar if their parents and children are similar Such a recursive definition leads to an eigenvector equation
The (normalized) sequence Zk+1 = (AZk BT+AT
Zk B)/ ||AZk BT+AT Zk B||F
has two fixed points Zeven and Zodd for every Z0>0 Similarity matrix S = lim k→∞ Z2k , Z0 =1 Si,j is the similarity score between Vi (A) and Vj (B) With zk=vec(Zk), this is equivalent to the power method zk+1 = (B A + BT AT )zk / ||(B A + BT AT )zk||2 which is the power method on M = B A + BT AT
Satisfies ρS=ASBT+ATSB, ρ=||ASBT+ATSB||F It is the nonnegative fixed point S of largest 1-norm It solves the optimization problem max ASBT+ATSB , S subject to ||S||F=1 Extension of Kleinberg’s Hits method Linear convergence (power method for sparse M)
Edge (u,v) if v appears in the definition of u : 1,398,424 edges Average of 12 edges per node
Ref: Blondel et al, SIAM Rev., ‘04
is the subset of vertices used for finding synonyms : it contains “all” parents and children of the node neighborhood graph of likely “Central” uses this sub-graph to rank automatically synonyms Rank each node in the graph with the similarity to node c in
Ref: Blondel et al, SIAM Rev., ‘04
b c e
Vectors Central ArcRanc Wordnet Microsoft 1 vanish vanish epidemic vanish vanish 2 wear pass disappearing go away cease to exist 3 die die port end fade away 4 sail wear dissipate finish die out 5 faint faint cease terminate go 6 light fade eat cease evaporate 7 port sail gradually wane 8 absorb light instrumental expire 9 appear dissipate darkness withdraw 10 cease cease efface pass away Mark 3.6 6.3 1.2 7.5 8.6 Std Dev 1.8 1.7 1.2 1.4 1.3
Vectors, Central and ArcRank are automatic, Wordnet, Microsoft Word are manual
Vectors Central ArcRanc Wordnet Microsoft 1 juice cane granulation sweetening darling 2 starch starch shrub sweetener baby 3 cane sucrose sucrose carbohydrate honey 4 milk milk preserve saccharide dear 5 molasses sweet honeyed
love 6 sucrose dextrose property saccarify dearest 7 wax molasses sorghum sweeten beloved 8 root juice grocer dulcify precious 9 crystalline glucose acetate edulcorate pet 10 confection lactose saccharine dulcorate babe Mark 3.9 6.3 4.3 6.2 4.7 Std Dev 2.0 2.4 2.3 2.9 2.7
||S||F=1 UTU=VTV=Ik UTU=VTV=Ik
The fixed point of ρS=ASBT+ATSB, ρ=||ASBT+ATSB||F corresponds to max ASBT+ATSB , S subject to ||S||F=1 The fixed point of UΣVT=Πopt(AUVTBT+ATUVTB), corresponds to max AUVT BT+ATUVT B , UVT subject to UTU=VTV=Ik This is not an eigenvalue problem anymore but can be computed using iterative techniques with a linear complexity per step
max AUVT BT+ATUVT B , UVT subject to UTU=VTV=Ik Is also equivalent to max UTAU ,VT BV subject to UTU=VTV=Ik UTAU and VT BV can be viewed as kxk “Rayleigh quotients” Linearly converging iteration (truncated SVD) Uk+1 Σk+1 VT
k+1 +U┴ Σ┴ V┴ T = AUkVT k BT + ATUkVT k B + sUkVT k
Graphs with similar structure Correlation is nearly optimal
Fraikin, Nesterov, VD, LAA 07