Dan Goldberg GIS Research Laboratory Department of Computer Science - PowerPoint PPT Presentation

Presented to: Faculty of the Computer Science Department of the University of Southern California 04-01-2010 Dan Goldberg GIS Research Laboratory Department of Computer Science University of Southern California https://webgis.usc.edu 1

(Very) Brief Background Locational descriptions Geographic representations USC GIS Research Laboratory 3620 South Vermont Ave, Los Angeles, CA Kaprielian Hall, Room 444 Los Angeles, CA 90089-0255 Spatio-Temporal Analyses 2

Motivations • Error introduction/propagation in epidemiological research Relative Error Magnitude Propagation Address Data Incomplete / incorrect Geocode Inaccurate Address Data location Locational Calculate Incorrect Exposure assignment Spatially Referenced Spatial Invalid Analysis association Values Conclusions Misguided Hot actions Spots

Motivations • Exposure misclassification from inaccurate geocoding Misclassified exposed distribution area zip code 1 Misclassified zip code 2 unexposed address range geocode zip centroid geocode point source 4

Motivations • Accessibility mischaracterization from inaccurate geocoding zip code 1 zip code 2 address range geocode zip centroid geocode true shortest path false shortest path The error from geocoding can be larger than the distance traveled 5

Motivations • All geocodes with same “quality” do not have the same accuracy or certainty NAACCR 2: Parcel Centroid Bound Box: Geometric: Weighted: NAACCR 3: Street Address Address range: Uniform lot: Actual lot: X Y Y*d X *d X • Qualities of the feature interpolation matters 6

Motivations • All geocodes with same “quality” do not have the same accuracy or certainty 90089 90011 90275 ~1:10,000 scale ~1:60,000 scale ~1:300,000 scale • Qualities of the reference features matter 7

Motivations – 3620 S. Vermont Ave, Los Angles CA 90089-0255 GEOCODE 34.021906,-118.290385 Accuracy = ?? Match rate of geocoder used = ?? Spatial uncertainty of this geocode = ?? Reference data used to produce this geocode = ?? Interpolation assumptions used to produce this geocode = ?? Average spatial uncertainty for other geocodes in the area = ?? 8

Theoretical and Technical Contributions 1. A theoretical and practical framework for developing, testing, and evaluating geocoding techniques. 2. A derivation of the sources and scales of potential spatial error and uncertainty. 3. A spatially-varying neighborhood metric to dynamically score nearby candidate reference features. 4. A method to combine multiple layers of reference features using uncertainty-, gravitationally-, and topologically based-approaches to derive the most likely candidate region. 5. A rule- and neighborhood-based tie-breaking strategy that deduces correct candidate selection using relationships between and regions surrounding ambiguous candidate reference features. 9

A Theoretical Framework for Geocoding Research How can we model the geocoding process to facilitate an extensible system for describing and reducing spatial uncertainty and error? 10

Theoretical Framework Input 3620 South Vermont Avenue Data Transform input to match reference data format Normalization/ 3620 S VERMONT AVE Standardization Algorithms Find a matching geographic feature in reference data Matching SELECT FromX, FromY, ToX, ToY Algorithms FROM SOURCE WHERE (Start >=3620 AND End <= 3620) AND (Pre = S) AND Reference (Name = VERMONT) AND Data (Suffix = AVE) Use matched geographic Interpolation feature to derive output Algorithms Output Point = (20% * X, 20% * Y) Output Data 11

Component: Input Data Input Error Contribution Data Many different types, forms, and formats: Street Addresses: 3620 South Vermont Ave Postal Codes: Los Angeles, CA 90089-0255 Normalization/ Named Places: USC Kaprielian Hall Standardization Algorithms Intersections: Vermont & 36 th Place Relative Descriptions: b/w Bakersfield & Shafter Matching Different levels of information/certainty: Algorithms Street Addresses: Somewhere on street Postal Codes: Somewhere on postal route Named Places: Absolute location Reference Intersections: Somewhere near intersection Data Relative Descriptions: Somewhere near locations 3260 S Vermont ___ Interpolation Incompleteness: 3620 _ Vermont Ave Algorithms ____ _ Vermont Ave 3620 S Verment Ave Output Inaccuracy: 362_ S Vermont ___ Data 3260 _ Vermont St 12

Component: Input Data Cleaning Input Error Contribution Data - Parsing – Separating components of the address Token-Based: relies on formatting Normalization/ Standardization Algorithms - Normalization – Identifying components of the address Substitution-Based: relies on the token ordering Matching Context-Based: relies on position and schema knowledge Probability-Based: relies on likelihood of occurrence Algorithms - Standardization – Formatting components of the address Reference Schema mapping: must exist for all reference sources Data 3620 South Vermont Ave Los Angeles , 90089 Street Address City Zip Interpolation Algorithms 90089 St Los Angeles St Los Angeles , 90089 Street Address City Zip Output 23 E South St South Los Angeles , 90089 Data Street Address City Zip 13

Component: Matching Algorithms Input Error Contribution Data - Multiple Match Types – Feature selected from reference set Exact: A single perfect match Normalization/ Non-exact: A single non-perfect match Standardization Algorithms Exact ambiguous: Multiple perfect matches Non-exact ambiguous: Multiple non-perfect matches None: No matches Matching Algorithms - Multiple Matching Methods – Ways of selecting features Deterministic: Rule-based, iterative Reference Probabilistic: Likelihood-based, attribute weighting Data - Multiple Fuzzifying Techniques – Alter input data Word Stemming: Porter Stemmer Interpolation Phonetic Algorithms: Soundex Algorithms Attribute Relaxation: Remove attributes and retry match - Multiple Scoring Methods – compute a candidate score Output Data Relative attribute weighting Match-Unmatch weighting 14

Component: Reference Data Error Contribution Input - Multiple Data Types Data Point-based: ZCTA and Place Centroids Linear-Based: Street Centerlines Areal Unit-Based: Parcels, ZCTA and Place Boundaries Normalization/ - Wide spectrum of accuracies/completeness Standardization Algorithms Commercial vs. Public - Attribute accuracy – spatial and non-spatial - Attribute completeness – spatial and non-spatial - Feature complexity – simple vs. polylines Matching Local Scale vs. National Scale Algorithms - Census Place Boundaries vs. Local Neighborhoods - Wide spectrum of cost/availability Free vs. Costly: TIGER/Lines vs. TeleAtlas Reference Available vs. Not: Address points – CA. vs. N. Carolina Data Interpolation Algorithms Output Data Low resolution reference street High resolution reference street 15

Component: Interpolation Algorithms Input Error Contribution Data - Many methods of interpolation Depend on reference feature type Normalization/ Depend on info available (assumptions) Standardization Algorithms X Matching Algorithms Y Y*d X *d Reference X Data Interpolation Algorithms Output Data 16

Component: Interpolation Algorithms Input Error Contribution Data - Lack of Process Transparency Normalization/ - Nothing reported about the decisions made or alternatives Standardization Algorithms Matching - Output Data Type: Only Geographic Coordinates Algorithms - Lose data required for determining true accuracy Reference Data - Output Accuracy: Feature Match Type + Probability Interpolation Algorithms - Nothing that indicates direction - Nothing that indicates distance Output - Nothing that indicates certainty area or surface Data 17

A Spatially-Varying Block-Distance Candidate Scoring Approach Can nearby candidate reference features be used to overcome inaccuracies and incompleteness in reference data sources? 18

Spatially-Varying Block-Distance Feature Scoring - Motivation Problems: 1) Address ranges in reference data files are often inaccurate 2) Leads to false negative non-matches 3) Results in reversion to lower level geographic matches 9800 View Ave, Seattle WA 98117 Address range doesn’t exist Reverts to ZIP 98117 19

Spatially-Varying Block-Distance Feature Scoring - Intuition A better approach: 1) Proportionally weight the closest reference features by their distance away in number of blocks 2) Choose the reference feature with the highest score within the search radius threshold (max number of blocks away) Intuitions: 1) If we exclude the address number from the matching algorithm, we will have a large candidate set of all streets in the region with the correct name and regional attributes (ZIP, city) differing only by their address ranges 2) We can score them based on how many blocks they are from the input address 9300-9400 Block of View Ave is ~ 4 blocks away from 9800 View Ave 20

Dan Goldberg GIS Research Laboratory Department of Computer Science - PowerPoint PPT Presentation

Presented to: Faculty of the Computer Science Department of the University of Southern California 04-01-2010 Dan Goldberg GIS Research Laboratory Department of Computer Science University of Southern California https://webgis.usc.edu 1

A rvin Goldberg rvin Goldberg rvin Goldberg rvin Goldberg A rvin Goldberg rvin Goldberg rvin

32 nd Annual Goldberg Family Lecture and Special 50 th Anniversary Lecture September 23, 2016 1

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

DAN-FORM Denmark Expect Something Different www.dan-form.com DAN-FORM Denmark in short

THE BROWSER IS DEAD Dan North Dan North & Associates LONG LIVE THE BROWSER! Dan North

PR Russia Agency presentation Goldberg & Budinstein PR & Consulting Since 2007

How Deliberate are You in Developing Yourself and Others? Diana Wilson Goldberg, Adam

What does MFA mean? Jeffrey Goldberg jeff@1Password.com What does MFA mean? It

Keepin It Real: Semi-Supervised Learning with Realistic Tuning Andrew B. Goldberg Xiaojin

Legislative Liaison Report Jennifer Carlin-Goldberg S ANTA R OSA J UNIOR C OLLEGE 2/21/2018

The Binary Blocking Flow Algorithm Andrew V. Goldberg Microsoft Research Silicon Valley

CHRO Circle Virtual Discussion February 2020 Facilitor: Edie L. Goldberg, Ph.D. The Inside Gig

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 54 Mayer Goldberg \ Ben-Gurion University Tuesday

Rulesets for Beatty games Lior Goldberg Aviezri S. Fraenkel Games and Graphs Workshop, 2017 To

HMM Can Find Pretty Good POS Taggers (When Given a Good Start) Yoav Goldberg Meni Adler Michael

Spatio-temporal Models Again point-referenced vs. areal unit data Continuous time vs. discretized

HYPERBOLIC CONSERVATION LAWS and SPACETIMES WITH LIMITED REGULARITY Philippe G. LeFloch

On Description of the Yrast Lines in IBM-1 V. Garistov Institute of Nuclear Research and Nuclear

Lecture series on 3d gravity Lecture 1: Geometry of Classical 3d Gravity Quantum Structure of

Atlas Generation: Cutting, Parameterization, Packing Xiao-Ming Fu GCL, USTC Texture Mapping

FGI- Types of Evidence Rational understanding/experience: The requirements supported by this

Primordial Black Holes: the morphology of cosmological perturbations Ilia Musco ( CNRS,

Feature Extraction Manos Baltsavias, Haris Papasaika OVERVIEW APPLICATIONS - Object detection

Dan Goldberg GIS Research Laboratory Department of Computer Science - PowerPoint PPT Presentation

Presented to: Faculty of the Computer Science Department of the University of Southern California 04-01-2010 Dan Goldberg GIS Research Laboratory Department of Computer Science University of Southern California https://webgis.usc.edu 1

A rvin Goldberg rvin Goldberg rvin Goldberg rvin Goldberg A rvin Goldberg rvin Goldberg rvin

32 nd Annual Goldberg Family Lecture and Special 50 th Anniversary Lecture September 23, 2016 1

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

DAN-FORM Denmark Expect Something Different www.dan-form.com DAN-FORM Denmark in short

THE BROWSER IS DEAD Dan North Dan North &amp; Associates LONG LIVE THE BROWSER! Dan North

PR Russia Agency presentation Goldberg &amp; Budinstein PR &amp; Consulting Since 2007

How Deliberate are You in Developing Yourself and Others? Diana Wilson Goldberg, Adam

What does MFA mean? Jeffrey Goldberg jeff@1Password.com What does MFA mean? It

Keepin It Real: Semi-Supervised Learning with Realistic Tuning Andrew B. Goldberg Xiaojin

Legislative Liaison Report Jennifer Carlin-Goldberg S ANTA R OSA J UNIOR C OLLEGE 2/21/2018

The Binary Blocking Flow Algorithm Andrew V. Goldberg Microsoft Research Silicon Valley

CHRO Circle Virtual Discussion February 2020 Facilitor: Edie L. Goldberg, Ph.D. The Inside Gig

Compiler Construction Compiler Construction 1 / 111 Mayer Goldberg \ Ben-Gurion University

Compiler Construction Compiler Construction 1 / 54 Mayer Goldberg \ Ben-Gurion University Tuesday

Rulesets for Beatty games Lior Goldberg Aviezri S. Fraenkel Games and Graphs Workshop, 2017 To

HMM Can Find Pretty Good POS Taggers (When Given a Good Start) Yoav Goldberg Meni Adler Michael

Spatio-temporal Models Again point-referenced vs. areal unit data Continuous time vs. discretized

HYPERBOLIC CONSERVATION LAWS and SPACETIMES WITH LIMITED REGULARITY Philippe G. LeFloch

On Description of the Yrast Lines in IBM-1 V. Garistov Institute of Nuclear Research and Nuclear

Lecture series on 3d gravity Lecture 1: Geometry of Classical 3d Gravity Quantum Structure of

Atlas Generation: Cutting, Parameterization, Packing Xiao-Ming Fu GCL, USTC Texture Mapping

FGI- Types of Evidence Rational understanding/experience: The requirements supported by this

Primordial Black Holes: the morphology of cosmological perturbations Ilia Musco ( CNRS,

Feature Extraction Manos Baltsavias, Haris Papasaika OVERVIEW APPLICATIONS - Object detection

THE BROWSER IS DEAD Dan North Dan North & Associates LONG LIVE THE BROWSER! Dan North

PR Russia Agency presentation Goldberg & Budinstein PR & Consulting Since 2007