CS535 Big Data 4/27/2020 Week 14-A Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 BIG DATA
PART B. GEAR SESSIONS
SESSION 5: ALGORITHMIC TECHNICS FOR BIG DATA
Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535
CS535 Big Data | Computer Science | Colorado State University
FAQs
- Please check the announcement for the term project deadlines
CS535 Big Data | Computer Science | Colorado State University
Topics of Todays Class
- Part 1: Locality Sensitive Hashing for Minhash Signatures and The Theory of Locality
Sensitive Functions
- Part 2: LSH Families for Other Distance Measures
- Part 3: Geohash and Bloom filter
CS535 Big Data | Computer Science | Colorado State University
GEAR Session 5. Algorithmic Techniques for Big Data
Lecture 2. Locality Sensitive Hashing
Locality Sensitive Hashing for Minhash Signatures
CS535 Big Data | Computer Science | Colorado State University
Planning the computation
CS535 Big Data | Computer Science | Colorado State University
Row (eleme nt) S1 S2 S3 S4 X+1 mod 5 3x +1 mod 5 1 1 1 1 1 1 2 4 2 1 1 3 2 3 1 1 1 4 4 1 3 S1 S2 S3 S4 h1 1 3 1 h2 2
- Creating DataFrames
- Generating Hash values
- Calculating signature
General LSH Operations in Apache Spark
- Feature Transformation
- Add hashed values as a new column
- Users can specify input and output column names by setting inputCol and outputCol to
adjust the dimensionality
- Supports multiple LSH hash tables
- Users can specify the number of hash tables by setting numHashTables
- Approximate Similarity Join
- Takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller
than a user-defined threshold
- Approximate Nearest Neighbor Search
- Takes a dataset (of feature vectors) and a key (a single feature vector), and it approximately returns a
specified number of rows in the dataset that are closest to the vector
- A distance column will be added to the output dataset to show the true distance between each output
row and the searched key
CS535 Big Data | Computer Science | Colorado State University