movie actor
play

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW - PowerPoint PPT Presentation

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor portal to provide user the data of movie and actor from multiple data source. We can also search for the most popular movie (actor) in a


  1. Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue

  2. OVERVIEW ● Goal: build a movie and actor portal to provide user the data of movie and actor from multiple data source. ● We can also search for the most popular movie (actor) in a specified year or of a specific type. More interesting, we can select a crew for a certain type.

  3. CONTENT ● Specification ● Fetching Data ● Entity Resolution ● Data Fusion ● Data Portal ● Conclusion & Reference

  4. SPECIFICATION ✓ Data source: (1) http://themoviedb.org/ (TMDB) (2) http://www.imdb.com/ (IMDB) (3) https://www.wikipedia.org/ (WIKI) ✓ Data file format: JSON & XML ✓ Database: MongoDB ✓ Programming language: Ruby

  5. Fetching data 1

  6. Fetching data ● Crawling strategy: ● TMDB & WIKI: crawl all the data sequentially ; ● IMDB: Use BFS to crawl the data. Use the popular movies in the front page as the url seeds and a thread-safe Queue to store urls. Multiple threads are working to extract data from current url and push back the new urls in this page. ● Raw data statistic: ● TMDB: 20,000+ movies & 20,000+ actors ● IMDB: 10,000+ movies & 11,000+ actors ● WIKI: 5000+ movies & 7000+ actors ● Raw data were stored in JSON or XML format files.

  7. 1.

  8. Entity Resolution 2

  9. Entity Resolution - Attribute Alignment ATTRIBUTE DATA TYPE movie actor ATTRIBUTE DATA TYPE title String name String year Integer birthday Date rating Float directors Array gender String casts Hash place_of_birth String main_casts Array nationality String total_time Integer known_credits Integer languages Array adult_actor Boolean alias Array years_active String country Array alias Array genre Array biography String writers Array known_for Array filming_locations Array match_id Integer keywords Array db_name String match_id Integer db_name String

  10. Entity Resolution – Methods ● Clustering based on Character (i) Blocking movies : Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key. ● Pairwise Matching (i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams . (iii) Transitivity, Exclusive and Functional Dependency.

  11. ab Array Button Barack Ajson Adam W. Black

  12. Entity Resolution – Methods ● Clustering based on Character (i) Blocking movies : Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key. ● Pairwise Matching (i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams . (iii) Transitivity, Exclusive and Functional Dependency.

  13. T2 <= name similarity name similarity < T1 T1 <= name similarity < T2 same different birthday Not Match birthday Compute Distance Match between 2 entries

  14. • • •

  15. 1 2 1 1 2 3 3 We can skip many pairwise match calculation if we use transitivity and exclusive.

  16. Entity Resolution – Methods

  17. Data Fusion 3

  18. Data Fusion ● Methods & algorithms Voting with trust worth of different data sources (a) Naive Voting source accuracy: tmdb>imdb>wiki (actor.gender, actor.birthday, movie.year, etc.) (b) Longest String (actor.name, movie.title, etc.) (c) Union (Array of strings) (actor.biography, movie.director, etc.)

  19. Data portal 4

  20. Data Portal Via the data portal, user can get both data before and after data integration. The interesting part of the portal is that user can build a movie crew given a specific genre. Finally, user can search for the top 10 popular movies given the genre and year.

  21. Problems Encountered ● If two movie has continuation in the same or the next year with the same director and casts, they will match but shouldn’t match (Scared Movie 2) ● Some sources have mistakes in the crucial fields (e.g. birthday: 1960-05-01 & 1860-03-01) which enlarge the distance too much. ● Cannot fully eliminate duplicates in a single source data so that some data may not be match in ER. However, they should be match. ● Some movies are not actually movies, but actually TV show or award ceremony. We have not found a good way to solve this problem.

  22. References: 1. ISO 639 Language Code List: https://www.loc.gov/standards/iso639-2/php/code_list.php 2. Felix Naumann, "Similarity measures" [DPDC_12_Similarity] 3. JENS BLEIHOLDER and FELIX NAUMANN, "Data Fusion", _ACM Computing Surveys, Vol. 41, No. 1, Article 1_

  23. Thank You!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend