SLIDE 22 Semantic Hashing Existing solutions
Remarks on existing solutions
◮ Learning-based approaches and/or near-duplicates oriented. ◮ Pros:
◮ Near-dupliccate detection (this is the learning objective), ◮ Entropy maximising for some of them (mainly
[Lin et al., 2010, Zhang et al., 2010])
◮ Demonstrates the usefullness of this research for pratical problems.
◮ Cons:
◮ Data-dependency (cold start, generalization, . . . ), ◮ Online learning is difficult for many of these solutions, ◮ Term and frequency term matching is concept-oblivious, ◮ Few propositions
([Zhang et al., 2010, Weiss et al., 2008, Lin et al., 2010]) consider poviding a similarity distance preserving scheme (additionaly to increase collisions chances for similar items).
- C. Gravier, J. Subercaze (Universities of)
HashGraph: Semantic Hashing using external knowledge base. 12 / 43