MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in - PowerPoint PPT Presentation

Garrett Bingham & Ben Yip Summary and Cleaning June 16, 2017 University of North Carolina Wilmington MORPH-II Dataset

1. Introduction to the Data 2. Inconsistencies in the Data 3. Cleaning the Data 4. New Datasets 5. Dirty Data 6. Conclusion 1 Table of contents

Introduction to the Data

The MORPH-II dataset is composed of mugshots of people from 16 to 77 years of age, with an average of 4 images per person. It is the largest longitudinal face image dataset publicly available. The academic version (which we use) contains roughly 55,000 images taken over 5 years, while the commercial version has about 202,000 images spanning 8 years. See: https://ebill.uncw.edu/C20231_ustores/web/store_main.jsp?STOREID=4 2 MORPH-II: An Overview

3 (B, W, A, H, O) This release contains 11 variables: image filename 6-digit subject identifier not recorded (NULL) subject photo number time since last arrest (days) date of birth (mm/dd/yyyy) not recorded (NULL) date of arrest (mm/dd/yyyy) (M or F) MORPH-II: Metadata morph_2008_nonCommercial.csv gender id_num facial_hair picture_num age integer age ( ⌊ doa − dob ⌋ ) dob age_diff doa glasses race photo

4 36,832 102 13 2,598 5,757 The MORPH-II dataset is a collection of 55,134 mugshots, including 46,645 44 1,667 141 7,961 42,589 8,489 10,559 154 1,769 63 This table was taken from the original MORPH Non-Commercial Release Whitepaper. After cleaning, the total number of images is the same but individual values may be slightly different. dataset. below table summarizes the demographic composition of the many of repeat offenders (providing valuable longitudinal data). The 19 MORPH-II: Demographic Makeup Table 1: Number of Images by Gender and Race B lack W hite A sian H ispanic O ther Total Male Female Total 55,134 Note:

5 per Subject 13,618 vs. 13,617 2159 11,459 Distribution MORPH-II: Summary Info Age Distribution Images per Individual 2000 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4.049 images 4000 Number of Individuals 1500 3000 Frequency 1000 2000 1000 500 0 0 5 10 15 20 25 35 50 15 20 25 30 35 40 45 50 55 60 65 70 75 Number of Images Age Figure 1: Barplot of Images Figure 2: MORPH-II Age Table 2: Number of Distinct Individuals Distinct Individuals Male Female Sum vs. Total

Inconsistencies in the Data

Repeat offenders have multiple entries in the MORPH-II dataset. There are some people with more than one gender, race, and/or birthdate. This causes problems when trying to use the images to predict demographics. Attribute Number of People Gender 1 Race 33 Birthdate 1779 6 Inconsistencies in the Data Table 3: MORPH-II Inconsistencies by Attribute

Cleaning the Data

(a) Female (b) Male (c) Female (d) Female (e) Female (f) Female 7 Cleaning the Data: Gender

(1a) White (1b) Black (1c) White (2a) Asian (2b) White (2c) Black Person 1 has 24 images classified as White and 1 image classified as Black 8 Cleaning the Data: Race

Each of the 33 people with inconsistent race was evaluated on a case by case basis. A final decision was made according to one of the following criteria: All images for a given person were assigned the race that appeared at least 50% of the time. Each person’s images were inspected one at a time. We decided the race only if there was a wide consensus among our team members. For some people (e.g. those of mixed race) it was difficult to guess their race from the photos, and there was substantial variation in 9 Cleaning the Data: Race Simple Majority Visual Estimation Other the original dataset. We set the race of all images to Other .

Similar to cleaning the race data, we were able to use a simple However, the remaining 255 people posed additional problems. For there was no majority, or their birthdates differed by several years. This made it difficult to choose one birthdate over another. 10 Cleaning the Data: Birthdate majority for 1524 of the 1779 people with inconsistent birthdates. some of them, their birthdates were in a multiway tie. For others,

year, we calculated the mean birthdate and assigned this date to all For each person whose birthdates differed by no more than one images. The remaining images were set aside as Not For Training . 11 Cleaning the Data: Birthdate

12 Average Birthdate 230 70 Not For Training 515 185 1906 1524 Simple Majority Cleaning the Data: Birthdate Table 4: Cleaned Data Summary Solution Number of People Number of Images Total 1779 2651

New Datasets

After being cleaned, the data was divided into 3 new files: This file is the same as morph_2008_nonCommercial.csv, but with dob, race, and gender inconsistencies corrected. Individuals with incorrectable birthdates were removed from the above dataset. This leaves all the images with consistent age information that are ready for training and testing age estimation models. These are the images (mentioned above) with incorrectable birthdates. 13 New Datasets morphII_cleaned_v2 morphII_go_for_age morphII_holdout_for_age

Each of the new datasets also has two additional variables: indicator (0-8) The corrected column contains an indicator variable which takes a different value depending on whether or not it was modified. Unchanged observations are labeled as 0, while those that were corrected or marked for hold out take a value between 1 and 8 depending on what was done to them. 14 New Variables corrected age_dec decimal age ( doa − dob ) About corrected

15 1,760 32 8,490 -0 42,577 10,548 153 96 13 -8 -1 -1 +20 -6 -1 99 2,590 -11 +13 +33 -9 -1 -11 -12 +1 36,821 5,756 7,958 140 1,661 64 46,644 -3 -3 New Datasets: Updated Info Table 5: Cleaned Data - Number of Images by Gender and Race B lack W hite A sian H ispanic O ther Total Male Female Total 55,134 Table 6: Net Change in Number of Images by Gender and Race B lack W hite A sian H ispanic O ther Total Male Female Total -0

16 20 2169 4 10332 2704 55 547 13658 30 628 1491 11458 19 507 47 5 6 8829 8 27 535 51 2684 10320 2159 8838 634 2070 49 517 15 11489 28 1494 2056 New Datasets: Updated Info Table 7: Original Data - Number of Distinct Individuals B lack W hite A sian H ispanic O ther Total Male Female Total Table 8: Cleaned Data - Number of Distinct Individuals B lack W hite A sian H ispanic O ther Total Male Female Total 13617

Dirty Data

17 Applications , volume 2, pages 309–314, Dec 2013. This is merely a sampling. Many other articles exist that used an uncleaned version of MORPH-II. Technologies , pages 12–15, Sept 2013. In CVPR 2011 , pages 657–664, June 2011. K. H. Liu, S. Yan, and C. C. J. Kuo. In 2013 Fourth International Conference on Emerging Security IEEE Transactions on Information Forensics and Security , 10(11):2408–2423, Nov 2015. X. Wang, V. Ly, G. Lu, and C. Kambhamettu. G. Guo and G. Mu. D. H. P. Yassin, S. Hoque, and F. Deravi. In 2013 12th International Conference on Machine Learning and Dirty Data: Examples of Research on Uncleaned MORPH-II Simultaneous dimensionality reduction and human age estimation via kernel partial least squares regression. Age estimation via grouping and decision fusion. Can we minimize the influence due to gender and race in age estimation? Age sensitivity of face recognition algorithms.

There will not likely be an enor- mous impact on model performance for gender or race prediction, because the number of gender and race inconsistencies is small. Age estimation models will see a drop in overall performance manifest in a higher Mean Absolute Error (MAE). For some people in the dataset, their birthdates vary enough that their age decreases progression. 18 Dirty Data: Consequences of Using Uncleaned MORPH-II with time . This will significantly affect models concerned with age

Conclusion

Cleaning the data before doing research is vital. This not only preserves the accuracy of one’s results, but also the integrity. Many researchers base their work off of previous results, making it even more important to ensure that one’s own work is accurate. 19 Conclusion: Clean Data Matters

MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in - PowerPoint PPT Presentation

Garrett Bingham & Ben Yip Summary and Cleaning June 16, 2017 University of North Carolina Wilmington MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in the Data 3. Cleaning the Data 4. New Datasets 5. Dirty Data 6.

RBF Morph Advanced Mesh Morphing for optimization and multi-physics Marco Evangelos Biancolini

SUSY morph studies of inclusive spectra 04.09.2009 Max Baak and Stefan Gadatsch Test of morph-

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Face Recognition on the MORPH-II Database Morgan Ferguson University of North Carolina at

RBF Morph Training Agenda Session #1 (May 24, 2:00 PM India Time, Duration - 60mins) General

14. Word form recognition in LA-Morph 14.1 Allo-rules 14.1.1 Abstract format of an allo-rule

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

K-Nearest Neighbors Nicolas Indelicato K-Nearest Neighbors Dataset Background How the

Downtown Osaka Scene Text Dataset Masakazu Iwamura, Takahiro Matsuda Naoyuki Morimoto, Hitomi

http://falconn-lib.org Dataset: n points in R d , r > 0 Dataset: n points in R d , r

An Analysis of Amazon Reviews Joao Carreira Outline Dataset and Methodology

SAMA-VTOL Aerial Image Dataset (SVAID): A New UAV Image Dataset for Advanced Remote Sensing

Database Overview WebVision2.0 dataset 5,000 categories From Flickr & Google 16M

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

CMU LTI @ KBP 2016 Event Track Zhengzhong Liu Jun Araki, Teruko Mitamura, Eduard Hovy Language

POVERTY AND LONG- TERM OUTCOMES: EVIDENCE FROM LINKED ADMINISTRATIVE DATA IN MARYLAND Angela

An API for Reading the MySQL Binary Log Lars Thalmann Mats Kindahl Development Director, MySQL

Computer Science Class XI ( As per CBSE Board) Visit : python.mykvs.in for regular updates

The Western Energy Corridor & Utah: Ensuring North American Energy Security and Regional

Gear-Up First-Year Experience Kick-Off Event Rachel Bingham FirstYear Aggie Connections

Disclaimer T.J. Rodgers is the founding CEO of the Company. Rodgers, J. Daniel McCranie and

Examples, Developments & Future Trends of Simulation in the Oil & Gas Industry Alex Read

Sambuz

Useful Links

Newsletter

Mail Us

MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in - PowerPoint PPT Presentation

Garrett Bingham & Ben Yip Summary and Cleaning June 16, 2017 University of North Carolina Wilmington MORPH-II Dataset 1. Introduction to the Data 2. Inconsistencies in the Data 3. Cleaning the Data 4. New Datasets 5. Dirty Data 6.

RBF Morph Advanced Mesh Morphing for optimization and multi-physics Marco Evangelos Biancolini

SUSY morph studies of inclusive spectra 04.09.2009 Max Baak and Stefan Gadatsch Test of morph-

1 | Core SMA Dataset Review 2020 Core SMA Dataset for TREAT-NMD affiliated Registries First

Face Recognition on the MORPH-II Database Morgan Ferguson University of North Carolina at

RBF Morph Training Agenda Session #1 (May 24, 2:00 PM India Time, Duration - 60mins) General

14. Word form recognition in LA-Morph 14.1 Allo-rules 14.1.1 Abstract format of an allo-rule

Surprise Billing Surprise Billing Dataset Review Dataset Review October 9, October 9, 2019

The Problem I K G J E C H F A D B = dataset In dataset creation, if each step is

Mina Kwon 2020. 04. 09. vs vs Preference Gaze influence Fixation Choice A HIGH B LOW

K-Nearest Neighbors Nicolas Indelicato K-Nearest Neighbors Dataset Background How the

Downtown Osaka Scene Text Dataset Masakazu Iwamura, Takahiro Matsuda Naoyuki Morimoto, Hitomi

http://falconn-lib.org Dataset: n points in R d , r &gt; 0 Dataset: n points in R d , r

An Analysis of Amazon Reviews Joao Carreira Outline Dataset and Methodology

SAMA-VTOL Aerial Image Dataset (SVAID): A New UAV Image Dataset for Advanced Remote Sensing

Database Overview WebVision2.0 dataset 5,000 categories From Flickr &amp; Google 16M

The United Nations Voting Dataset Exploratory Data Analysis: Case Study UN Voting Dataset Roll

CMU LTI @ KBP 2016 Event Track Zhengzhong Liu Jun Araki, Teruko Mitamura, Eduard Hovy Language

POVERTY AND LONG- TERM OUTCOMES: EVIDENCE FROM LINKED ADMINISTRATIVE DATA IN MARYLAND Angela

An API for Reading the MySQL Binary Log Lars Thalmann Mats Kindahl Development Director, MySQL

Computer Science Class XI ( As per CBSE Board) Visit : python.mykvs.in for regular updates

The Western Energy Corridor &amp; Utah: Ensuring North American Energy Security and Regional

Gear-Up First-Year Experience Kick-Off Event Rachel Bingham FirstYear Aggie Connections

Disclaimer T.J. Rodgers is the founding CEO of the Company. Rodgers, J. Daniel McCranie and

Examples, Developments &amp; Future Trends of Simulation in the Oil &amp; Gas Industry Alex Read

Sambuz

Useful Links

Newsletter

Mail Us

http://falconn-lib.org Dataset: n points in R d , r > 0 Dataset: n points in R d , r

Database Overview WebVision2.0 dataset 5,000 categories From Flickr & Google 16M

The Western Energy Corridor & Utah: Ensuring North American Energy Security and Regional

Examples, Developments & Future Trends of Simulation in the Oil & Gas Industry Alex Read