interactive clustering
play

Interactive Clustering Barna Saha Clustering Learning over Noisy - PowerPoint PPT Presentation

Interactive Clustering Barna Saha Clustering Learning over Noisy Data Learn a classifier or find clusters over noisy/uncertain data Noise comes from using similarity func5onsadd an edge between two images if they represent the same


  1. Interactive Clustering Barna Saha

  2. Clustering

  3. Learning over Noisy Data Learn a classifier or find clusters over noisy/uncertain data Noise comes from using similarity func5ons—add an edge between two images if they represent the same monument—clusters could be erroneous

  4. Learning over Noisy Data Learn a classifier or find clusters over noisy/uncertain data Noise comes from inherent data errors/missing a?ributes—clustering collabora5on network • Learn a classifier or do clustering over noisy/uncertain data obtained from DBLP could be erroneous.

  5. Further Applications • Linking Census Records • Public Health • Web search • Comparison shopping • Spam Detec5on • Machine Reading • IP Aliasing • ……..

  6. Query complexity of optimal strategy?

  7. Query complexity of optimal strategy? Davidson, Khanna, Milo, Roy, 2014

  8. Faulty Oracle

  9. Faulty Oracle Repeat the same ques5on. Assuming p=q, repeat each ques5on (say) 24log n/(1-2p) 2 5mes

  10. Faulty Oracle

  11. Faulty Oracle: No Resampling • Find seed nodes for each cluster • If we can find 24 log n/(1-2p) 2 seed nodes from each cluster then we are done! [Why?] ……………….. …..

  12. Faulty Oracle: How to Cind seed nodes? • Let N=O(k 2 log n/(1-2p) 4 ) • Select N nodes and ask all possible pairwise queries among these nodes. • Run correla5on clustering algorithm in this small set of nodes • Each cluster returned by the correla5on clustering that has size at least 24 log n/(1-2p) 2 act as a seed

  13. Faulty Oracle: How to Cind seed nodes? • Let N=O(k 2 log n/(1-2p) 4 ) • Select N nodes and ask all possible pairwise queries among these nodes. • Run correla5on clustering algorithm in this small set of nodes • Each cluster returned by the correla5on clustering that has size at least 24 log n/(1-2p) 2 act as a seed Some intui5on on the analysis: If we know all the query results, correla5on clustering gives the maximum likelihood es5mator. Moreover, it is an instance of correla5on clustering where errors are random— we know how to solve it!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend