Interactive Clustering Barna Saha Clustering Learning over Noisy - - PowerPoint PPT Presentation

interactive clustering
SMART_READER_LITE
LIVE PREVIEW

Interactive Clustering Barna Saha Clustering Learning over Noisy - - PowerPoint PPT Presentation

Interactive Clustering Barna Saha Clustering Learning over Noisy Data Learn a classifier or find clusters over noisy/uncertain data Noise comes from using similarity func5onsadd an edge between two images if they represent the same


slide-1
SLIDE 1

Interactive Clustering

Barna Saha

slide-2
SLIDE 2

Clustering

slide-3
SLIDE 3

Learning over Noisy Data

Noise comes from using similarity func5ons—add an edge between two images if they represent the same monument—clusters could be erroneous Learn a classifier or find clusters over noisy/uncertain data

slide-4
SLIDE 4
  • Learn a classifier or do clustering over noisy/uncertain data

Noise comes from inherent data errors/missing a?ributes—clustering collabora5on network

  • btained from DBLP could be erroneous.

Learning over Noisy Data

Learn a classifier or find clusters over noisy/uncertain data

slide-5
SLIDE 5

Further Applications

  • Linking Census Records
  • Public Health
  • Web search
  • Comparison shopping
  • Spam Detec5on
  • Machine Reading
  • IP Aliasing
  • ……..
slide-6
SLIDE 6

Query complexity of optimal strategy?

slide-7
SLIDE 7

Query complexity of optimal strategy?

Davidson, Khanna, Milo, Roy, 2014

slide-8
SLIDE 8

Faulty Oracle

slide-9
SLIDE 9

Faulty Oracle

Repeat the same ques5on. Assuming p=q, repeat each ques5on (say) 24log n/(1-2p)2 5mes

slide-10
SLIDE 10

Faulty Oracle

slide-11
SLIDE 11

Faulty Oracle: No Resampling

  • Find seed nodes for each cluster
  • If we can find 24 log n/(1-2p)2 seed nodes from each cluster

then we are done! [Why?]

….. ………………..

slide-12
SLIDE 12

Faulty Oracle: How to Cind seed nodes?

  • Let N=O(k2log n/(1-2p)4)
  • Select N nodes and ask all possible pairwise queries among

these nodes.

  • Run correla5on clustering algorithm in this small set of nodes
  • Each cluster returned by the correla5on clustering that has

size at least 24 log n/(1-2p)2 act as a seed

slide-13
SLIDE 13

Faulty Oracle: How to Cind seed nodes?

  • Let N=O(k2log n/(1-2p)4)
  • Select N nodes and ask all possible pairwise queries among

these nodes.

  • Run correla5on clustering algorithm in this small set of nodes
  • Each cluster returned by the correla5on clustering that has

size at least 24 log n/(1-2p)2 act as a seed

Some intui5on on the analysis: If we know all the query results, correla5on clustering gives the maximum likelihood es5mator. Moreover, it is an instance of correla5on clustering where errors are random— we know how to solve it!