Effective Scalable and Integrative Geocoding for Massive Address Datasets
Department of Computer Science, Stony Brook University Sina Rashidian, Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang
November 2017
Massive Address Datasets Department of Computer Science, Stony Brook - - PowerPoint PPT Presentation
Effective Scalable and Integrative Geocoding for Massive Address Datasets Department of Computer Science, Stony Brook University Sina Rashidian , Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang November 2017 INTRODUCTION Open Data
Department of Computer Science, Stony Brook University Sina Rashidian, Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang
November 2017
2/23
Introduction
Background Open Data Sources Integrative Geocoding Results
3/23
– Accessibility of large scale open data
– Health data is widely accessible with government open data initiatives, geo- crowdsourcing, and social media – Using low resolution spatial data e.g., county/zip code, is more often – Lack of high resolution spatial data – Lack of efficient and scalable methods
Introduction
Background Open Data Sources Integrative Geocoding Results
4/23
– Patients’ privacy in public health studies – Protected Health Information (PHI) – Health Insurance Portability and Accountability Act (HIPAA)
– Geocoding 20M records ~ 10K USD
– Daily/monthly transactions limitations, 1M per month or 100K per day – Geocoding 20M records → 200 days!
Introduction
Background Open Data Sources Integrative Geocoding Results
5/23
– Free of charge → Suitable for academic usage – Scalable and fast → Supports high volume of input data – Accurate and robust → Result must be reliable – Local → Respects data sensitivity, such as patients’ privacy
– Lack of a free complete, accurate and, reliable reference – Free data sources could be noisy and incomplete – Different data sources do not share same set of features – When there are multiple possible answers, which one is better?
Introduction
Background Open Data Sources Integrative Geocoding Results
6/23
Introduction
Background
Open Data Sources Integrative Geocoding Results
7/23
– Parsing – Searching
techniques
Parsing Raw Address Tokenizing Tokenized Address Cleaning Clean Tokens Searching Clean Tokens Build Query Answer DataBase Is valid? 12-11 North Stony Brook Road, Stony Brook, NY, 11794 12, 11, North, Stony Brook, Rd, Stony Brook, NY, 11794 12, 11, n, stony brook, rd, stony brook, ny, 11794 Introduction
Background
Open Data Sources Integrative Geocoding Results
8/23
Introduction Background
Open Data Sources
Integrative Geocoding Results
9/23
– Topologically Integrated Geographic Encoding and Referencing (TIGER) – Cons: Missing city information, based on address ranges → Interpolation error
– Tax Parcels – New York Street and Address Maintenance Program (SAM) – OpenStreetMap (OSM) – OpenAddresses – Cons: Incompleteness, having partial information, messiness
Introduction Background
Open Data Sources
Integrative Geocoding Results
10/23
Introduction Background Open Data Sources
Integrative Geocoding
Results
11/23
– Parsing – Oriented Searching – Intelligent Selection
processing task
geocoder (EaserGeocoder)
Introduction Background Open Data Sources
Integrative Geocoding
Results
12/23
dataset’s characteristics
– E.g., TIGER does not have city name – Increasing efficiency – More accurate results
– Finding nearest ones instead of exact match – Expanding the scope iteratively
TIGER
Search Engine Generate Specific Query
Acceptab le ?
Relax Query Candidate Result
OpenAddresses
Search Engine Generate Specific Query
Acceptab le ?
Relax Query Candidate Result
Tax Parcels
Search Engine Generate Specific Query
Acceptab le ?
Relax Query Candidate Result
SAM
Search Engine Generate Specific Query
Close enough?
Relax Query Candidate Result
Introduction Background Open Data Sources
Integrative Geocoding
Results
13/23
– Perfect Match TIGER, spatial error ~450m – Partial Match OpenAddresses, spatial error ~350m
– Perfect Match TIGER, spatial error ~50m – Partial Match OpenAddresses, spatial error ~30km!
answer
Introduction Background Open Data Sources
Integrative Geocoding
Results
14/23
– Learning small predictive models – Decision trees – Learning a model for predicting the best one
Introduction Background Open Data Sources
Integrative Geocoding
Results
15/23
– Classify based on an acceptance threshold
– Some candidates are more correct! – Considering spatial error
– 3 classes between 0 and the threshold – Choosing nearest class
Introduction Background Open Data Sources
Integrative Geocoding
Results
16/23
Introduction Background Open Data Sources Integrative Geocoding
Results
17/23
Type Name Google % Here % MapQuest % Consolidation % Commercial Google
96.55 99.10 Commercial Here 94.88
99.26 Commercial MapQuest 94.52 97.16
Commercial GeoServices 94.68 95.30 94.71 96.43 Non-Commercial EaserGeocoder 96.46 97.12 95.82 97.93 Non-Commercial Nominatim 54.74 54.15 54.08 55.68 Non-Commercial Geonames 82.65 83.20 83.53 84.40 Non-Commercial DataSciToolkit 89.05 89.50 89.52 90.71
Introduction Background Open Data Sources Integrative Geocoding
Results
18/23
Type Name Google (m) Here (m) MapQuest (m) Consolidation (m) Commercial Google
55.52 18.37 Commercial Here 24.29
17.67 Commercial MapQuest 55.85 55.20
Commercial GeoServices 35.75 31.08 51.16 29.02 Non-Commercial EaserGeocoder 31.08 26.85 53.30 26.90 Non-Commercial Nominatim 65.55 63.88 57.98 58.09 Non-Commercial Geonames 93.38 91.78 75.32 83.58 Non-Commercial DataSciToolkit 87.42 85.47 71.36 77.39
Introduction Background Open Data Sources Integrative Geocoding
Results
19/23
Number of Addresses=18,890 Introduction Background Open Data Sources Integrative Geocoding
Results
20/23
Introduction Background Open Data Sources Integrative Geocoding
Results
21/23
http://bmidb.cs.stonybrook.edu/easergeocoder/
Introduction Background Open Data Sources Integrative Geocoding
Results
22/23
– Problem, Motivation, Goal
– Classic Geocoding Model
– Linear-Based, Point-Based, Community Contributed
– Integrative Geocoding Model, Oriented Searching, Intelligent Answer Selection
– Accuracy, Spatial Error, Spatial Accuracy Variation, Scalability
Introduction Background Open Data Sources Integrative Geocoding Results
23/23
24/23
25/23
26/23
– Maximizing coverage and accuracy – Paying no cost for data
– Unchanged, except pre-data cleaning and standardization – Extracting candidates, most similar matches from sources
– By using machine learning techniques
Introduction
Background Open Data Sources Integrative Geocoding Results
27/23
estimating the location
– Density – Estimation error
geocoding systems
Introduction
Background
Open Data Sources Integrative Geocoding Results
28/23
Geographic Encoding and Referencing (TIGER)
▪ Does not consist of exact building locations address → Linear Interpolation ▪ Parcel homogeneity ▪ Offset from beginning and end
Introduction Background
Open Data Sources
Integrative Geocoding Results
29/23
Tax Parcels
purpose of providing a precise unit of taxation in support of taxpayers
map
– Missing many buildings addresses – Partial information for many records
SAM
Maintenance Program
Generation 911 emergency systems
– Limited coverage, 51 counties of 62 counties in New York
Introduction Background
Open Data Sources
Integrative Geocoding Results
30/23
OpenAddresses (OA)
address locations around the world
worldwide OpenStreetMap (OSM)
extensive collaborative contribution from 2 million users
▪ Missing buildings addresses ▪ Partial information for many records ▪ Formatting problems, need massive cleaning ▪ Not 100% reliable
Introduction Background
Open Data Sources
Integrative Geocoding Results
31/23
– The proportion of input data, a geocoding system was capable of successfully geocoding – Lack of standard definition of successful geocoding/acceptable result
– Average distance between true locations and the computed geocoded locations
– Distribution of positional errors
– Processing time for fixed number of inputs
32/23
– Street network map data source versus Parcel data – Rural versus Suburban versus Urban
SNM Parcel Urban 379 62 Suburban 1219 177 Rural 5706 582 Positional error of 99% of addresses in meters
33/23
– Unavailable for public access – Not containing every building
34/23
City Name County Name Zip Code 15/75 – House Number 60/75 – Street Address 5/92 – Prefix 6/92 – PreType 70/92 – Street Name 6/92 – SufType 5/92 – Suffix
35/23
underlying reference dataset
– Linear-Based
geolocation points
– Polygon-Based
buildings are given
– Point-Based
single geolocation
36/23
37/23
– Interpolation algorithms – Parcel homogeneity – Offset
– Parcel Centroids
TIGER SAM Parcel OpenAddresses 51.4m 39.8m 40.08m
23 Hammond Lane 39 Hammond Lane 3 Hammond Lane
38/23
– Text oriented search engine – Automatic indexing for efficient searching – Intelligent caching
– Limited text search capabilities – Generating lots of SQL queries
39/23
– Some queries are Modified to find a match – Some queries do not have any responses
40/23
physician addresses in New York State, provided by The Centers for Medicare and Medicaid Services (CMS)
– Publicly available – Up-to-date and validated – Population considered implicitly
data sources, generating four candidates
70,000 Physicians’ Addresses in NYS
Introduction Background Open Data Sources
Integrative Geocoding
Results
41/23
– Building Numbers
– Street, City Name and County Name
– Zip Code
– Base Number, Prefix and Postfix
42/23
Normalized Edit Distance between main tokens <0.121324 Zip codes are same Building number difference <51.5
0.08608 County names are same
0.013467 Zip codes are same Building number difference <5.5
0.10384 Reference zip code is available 0.12309 0.027925
Yes Yes Yes Yes Yes Yes Yes No No No No No No No
43/23
– Coordinator thread as master
– Slave threads responsible for geocoding addresses
– Setup database server on each node – Avoid communication overhead – Pre-load into Hadoop Distributed File System.
Introduction Background Open Data Sources
Integrative Geocoding
Results
44/23
– Google, Here, MapQuest and, Texas A&M (GeoServices) – Using commercial maps, costing thousands of dollars for each state
– Nominatim, Geonames and, DataScienceToolkit (DST)
2.20GHz) and 88 virtual cores, 5TB hard drive at 7200rpm and 128GB memory
Introduction Background Open Data Sources Integrative Geocoding
Results
45/23
Name Geolocation Level Accuracy (%) Spatial Error (m) TIGER Street Level 91.51 91.20 Tax Parcels Rooftop 71.93 35.08 OA Rooftop 65.96 32.01 SAM Rooftop 69.72 35.72 EaserGeocoder Combined 96.46 31.08