MDP Project #6 Designing Methods for Large-Scale Database Transcription and Digitization: The Robert E. MacLaury Color Categorization Archive
Our Team: Students
Our Team: Mentors
What Did We Do?
The Robert E. MacLaury Mesoamerican Color Survey • Conducted from 1978-1981 • 900 speakers • 116 Languages
The Robert E. MacLaury Mesoamerican Color Survey • Conducted from 1978-1981 • 900 speakers • 116 Languages
Raw Handwritten Datasheet
OPTICAL CHARACTER RECOGNITION APPROACH Yang Jiao and Ram Bhakta
Optical Character Recognition • Automatic identification • Recognize optically processed characters • Convert documents into editable and searchable data
OCR Areas Eikvil, 1993
Components of OCR Eikvil, 1993
Our Challenge
Our Challenge Pw
Our Challenge Pw
Approach Text: Pw Confidence level: 52
Our Results More likely Less likely
CROWDSOURCING APPROACH DESIGN & IMPLEMENTATION Stephanie Chang
Crowdsourcing
Amazon Mechanical Turk
What is Crowdsourcing?
?
http://crowdsource-mcswebsite.rhcloud.com/ Place your screenshot here
Flow of Information Raw Data Transcription Verification
http://crowdsource-mcswebsite.rhcloud.com/
Crowdsourcing
Crowdsourcing
CROWDSOURCING APPROACH EMPIRICAL ANALYSES Prutha Deshpande
Surowiecki, 2004
Surowiecki, 2004
Problem of Data Aggregation Lee et al., 2014
Problem of Data Aggregation Lee et al., 2014
Problem of Data Aggregation Lee et al., 2014
Problem of Data Aggregation Lee et al., 2014
Our Approach to Data Aggregation Cultural Consensus Theory (CCT) ● Family of computational models ● Informants share cultural knowledge - Ethnographic research application ● Correct answer not known a priori
Cultural Consensus Analyses Advantages ● Predict informant competency ● Estimate homogeneity of responses ● Estimate “correct” answers
Does CCT work for our data?
Does CCT work for our data? Pilot Study ● 30 human subject pool participants
Does CCT work for our data? Pilot Study ● 30 human subject pool participants ● Two implementations of CCT - Standard Bayesian CCT model - Alternative CCT model
Does CCT work for our data? Yes! We found CCT to be an appropriate model for aggregating our crowdsourced data.
CCT Statistical Output Oravecz et al., 2014
CCT Transcription Solutions
CCT Transcription Solutions → Archive
DATABASE AND TIKI-WIKI Nathan Benjamin and Zhimin Xiang
Why Build a Database? • Accessibility
Why Build a Database? • Accessibility • Why not simply put photocopies of every page from the archive online?
Why Build a Database? A Relational Database: • Define datatype relationships
Why Build a Database? A Relational Database: • Define datatype relationships • Exclude superfluous results
Why Build a Database? A Relational Database: • Define datatype relationships • Exclude superfluous results • Will remain extensible
The New Problem Relational databases are difficult to traverse without advanced knowledge: Both Structurally
The New Problem Relational databases are difficult to traverse without advanced knowledge: Both Structurally And Syntactically SELECT * FROM Experiment INNER JOIN Language ON Experiment.language#=Language.language# WHERE Language.langName = ‘Korean' ORDER BY Experiment.ExperimentID;
The Solution: We surveyed 9 separate web-frameworks and content management systems.
The Solution: We chose Tiki-Wiki: ● CMS (Content Management System) - Version control - User control - File Access control ● Wiki ● Open source
The Solution: We chose Tiki-Wiki: ● CMS (Content Management System) ● Wiki - Searchable - Provides extensible structure for explanation of project and associated data - Allows for public web access ● Open Source
The Solution: We chose Tiki-Wiki: ● CMS (Content Management System) ● Wiki ● Open Source - Provides Flexibility - Access to Online Mapping Databases
The Solution: We chose Tiki-Wiki: ● CMS (Content Management System) ● Wiki ● Open Source This allowed us to add features and functionality that would inject momentum to research in this area.
In Conclusion Color Categorization Archive (ColCat)
Thank you! Any questions or suggestions? Coming Soon to: Students: Mentors: N. Benjamin ColCat.Calit2.uci.edu S. Gago, PhD H. Bhakta I. Harris, PhD S. Chang Contact us at: K. Jameson, PhD P. Deshpande colcat@calit2.uci.edu Y. Jiao S. Tauber, PhD Z. Xiang Support for the archive project provided by: ● The Multidisciplinary Design Program, 2014-2015 ● The University of California Pacific Rim Research Program, 2010-2015 (K. A. Jameson, PI) ● The National Science Foundation, 2014-2017 (#SMA-1416907, K. A. Jameson, PI)
Recommend
More recommend