Movie & Actor
QI, Xiaoxu CHEN, Guanhao JIN, Yue
Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW - - PowerPoint PPT Presentation
Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor portal to provide user the data of movie and actor from multiple data source. We can also search for the most popular movie (actor) in a
QI, Xiaoxu CHEN, Guanhao JIN, Yue
OVERVIEW
user the data of movie and actor from multiple data source.
(actor) in a specified year or of a specific type. More interesting, we can select a crew for a certain type.
CONTENT
SPECIFICATION
✓ Data source: (1) http://themoviedb.org/ (TMDB) (2) http://www.imdb.com/ (IMDB) (3) https://www.wikipedia.org/ (WIKI) ✓ Data file format: JSON & XML ✓ Database: MongoDB ✓ Programming language: Ruby
Fetching data
movies in the front page as the url seeds and a thread-safe Queue to store urls. Multiple threads are working to extract data from current url and push back the new urls in this page.
1.
Entity Resolution - Attribute Alignment
ATTRIBUTE DATA TYPE title String year Integer rating Float directors Array casts Hash main_casts Array total_time Integer languages Array alias Array country Array genre Array writers Array filming_locations Array keywords Array match_id Integer db_name String
ATTRIBUTE DATA TYPE name String birthday Date gender String place_of_birth String nationality String known_credits Integer adult_actor Boolean years_active String alias Array biography String known_for Array match_id Integer db_name String
movie actor
Entity Resolution – Methods
(i) Blocking movies: Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key.
(i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams. (iii) Transitivity, Exclusive and Functional Dependency.
ab
Array Button Barack Ajson Adam W. Black
Entity Resolution – Methods
(i) Blocking movies: Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key.
(i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams. (iii) Transitivity, Exclusive and Functional Dependency.
name similarity < T1 Not Match T1 <= name similarity < T2 Compute Distance between 2 entries T2 <= name similarity Match same birthday different birthday
1 1 1 2 2 3 3
We can skip many pairwise match calculation if we use transitivity and exclusive.
Entity Resolution – Methods
Data Fusion
Voting with trust worth of different data sources (a) Naive Voting source accuracy: tmdb>imdb>wiki (actor.gender, actor.birthday, movie.year, etc.) (b) Longest String (actor.name, movie.title, etc.) (c) Union (Array of strings) (actor.biography, movie.director, etc.)
Data Portal
Via the data portal, user can get both data before and after data
crew given a specific genre. Finally, user can search for the top 10 popular movies given the genre and year.
Problems Encountered
year with the same director and casts, they will match but shouldn’t match (Scared Movie 2)
birthday: 1960-05-01 & 1860-03-01) which enlarge the distance too much.
so that some data may not be match in ER. However, they should be match.
show or award ceremony. We have not found a good way to solve this problem.
References:
1. ISO 639 Language Code List: https://www.loc.gov/standards/iso639-2/php/code_list.php
Computing Surveys, Vol. 41, No. 1, Article 1_