Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW - - PowerPoint PPT Presentation

movie actor
SMART_READER_LITE
LIVE PREVIEW

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW - - PowerPoint PPT Presentation

Movie & Actor QI, Xiaoxu CHEN, Guanhao JIN, Yue OVERVIEW Goal: build a movie and actor portal to provide user the data of movie and actor from multiple data source. We can also search for the most popular movie (actor) in a


slide-1
SLIDE 1

Movie & Actor

QI, Xiaoxu CHEN, Guanhao JIN, Yue

slide-2
SLIDE 2

OVERVIEW

  • Goal: build a movie and actor portal to provide

user the data of movie and actor from multiple data source.

  • We can also search for the most popular movie

(actor) in a specified year or of a specific type. More interesting, we can select a crew for a certain type.

slide-3
SLIDE 3

CONTENT

  • Specification
  • Fetching Data
  • Entity Resolution
  • Data Fusion
  • Data Portal
  • Conclusion & Reference
slide-4
SLIDE 4

SPECIFICATION

✓ Data source: (1) http://themoviedb.org/ (TMDB) (2) http://www.imdb.com/ (IMDB) (3) https://www.wikipedia.org/ (WIKI) ✓ Data file format: JSON & XML ✓ Database: MongoDB ✓ Programming language: Ruby

slide-5
SLIDE 5

1 Fetching data

slide-6
SLIDE 6

Fetching data

  • Crawling strategy:
  • TMDB & WIKI: crawl all the data sequentially;
  • IMDB: Use BFS to crawl the data. Use the popular

movies in the front page as the url seeds and a thread-safe Queue to store urls. Multiple threads are working to extract data from current url and push back the new urls in this page.

  • Raw data statistic:
  • TMDB: 20,000+ movies & 20,000+ actors
  • IMDB: 10,000+ movies & 11,000+ actors
  • WIKI: 5000+ movies & 7000+ actors
  • Raw data were stored in JSON or XML format files.
slide-7
SLIDE 7

1.

slide-8
SLIDE 8
slide-9
SLIDE 9

2 Entity Resolution

slide-10
SLIDE 10

Entity Resolution - Attribute Alignment

ATTRIBUTE DATA TYPE title String year Integer rating Float directors Array casts Hash main_casts Array total_time Integer languages Array alias Array country Array genre Array writers Array filming_locations Array keywords Array match_id Integer db_name String

ATTRIBUTE DATA TYPE name String birthday Date gender String place_of_birth String nationality String known_credits Integer adult_actor Boolean years_active String alias Array biography String known_for Array match_id Integer db_name String

movie actor

slide-11
SLIDE 11

Entity Resolution – Methods

  • Clustering based on Character

(i) Blocking movies: Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key.

  • Pairwise Matching

(i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams. (iii) Transitivity, Exclusive and Functional Dependency.

slide-12
SLIDE 12

ab

Array Button Barack Ajson Adam W. Black

slide-13
SLIDE 13

Entity Resolution – Methods

  • Clustering based on Character

(i) Blocking movies: Use the lowercase of the first character of the movie’s title to block the data into different sub-dataset. (ii) Blocking actors: use both the first letter of first name and last name to form a key.

  • Pairwise Matching

(i) Decision tree (ii) Pairwise matching score: Jaro-Winkler; Monge-Elkan; Jaccard Coefficient; N-Grams. (iii) Transitivity, Exclusive and Functional Dependency.

slide-14
SLIDE 14

name similarity < T1 Not Match T1 <= name similarity < T2 Compute Distance between 2 entries T2 <= name similarity Match same birthday different birthday

slide-15
SLIDE 15
slide-16
SLIDE 16

1 1 1 2 2 3 3

We can skip many pairwise match calculation if we use transitivity and exclusive.

slide-17
SLIDE 17

Entity Resolution – Methods

slide-18
SLIDE 18
slide-19
SLIDE 19

3 Data Fusion

slide-20
SLIDE 20

Data Fusion

  • Methods & algorithms

Voting with trust worth of different data sources (a) Naive Voting source accuracy: tmdb>imdb>wiki (actor.gender, actor.birthday, movie.year, etc.) (b) Longest String (actor.name, movie.title, etc.) (c) Union (Array of strings) (actor.biography, movie.director, etc.)

slide-21
SLIDE 21

4 Data portal

slide-22
SLIDE 22

Data Portal

Via the data portal, user can get both data before and after data

  • integration. The interesting part of the portal is that user can build a movie

crew given a specific genre. Finally, user can search for the top 10 popular movies given the genre and year.

slide-23
SLIDE 23

Problems Encountered

  • If two movie has continuation in the same or the next

year with the same director and casts, they will match but shouldn’t match (Scared Movie 2)

  • Some sources have mistakes in the crucial fields (e.g.

birthday: 1960-05-01 & 1860-03-01) which enlarge the distance too much.

  • Cannot fully eliminate duplicates in a single source data

so that some data may not be match in ER. However, they should be match.

  • Some movies are not actually movies, but actually TV

show or award ceremony. We have not found a good way to solve this problem.

slide-24
SLIDE 24
slide-25
SLIDE 25

References:

1. ISO 639 Language Code List: https://www.loc.gov/standards/iso639-2/php/code_list.php

  • 2. Felix Naumann, "Similarity measures" [DPDC_12_Similarity]
  • 3. JENS BLEIHOLDER and FELIX NAUMANN, "Data Fusion", _ACM

Computing Surveys, Vol. 41, No. 1, Article 1_

slide-26
SLIDE 26

Thank You!