Massive Address Datasets Department of Computer Science, Stony Brook - - PowerPoint PPT Presentation

massive address datasets
SMART_READER_LITE
LIVE PREVIEW

Massive Address Datasets Department of Computer Science, Stony Brook - - PowerPoint PPT Presentation

Effective Scalable and Integrative Geocoding for Massive Address Datasets Department of Computer Science, Stony Brook University Sina Rashidian , Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang November 2017 INTRODUCTION Open Data


slide-1
SLIDE 1

Effective Scalable and Integrative Geocoding for Massive Address Datasets

Department of Computer Science, Stony Brook University Sina Rashidian, Xinyu Dong, Amogh Avadhni, Prachi Poddar, Fusheng Wang

November 2017

slide-2
SLIDE 2

2/23

INTRODUCTION

Introduction

Background Open Data Sources Integrative Geocoding Results

slide-3
SLIDE 3

3/23

  • Spatial big data analysis increasing everyday

– Accessibility of large scale open data

  • Public health studies

– Health data is widely accessible with government open data initiatives, geo- crowdsourcing, and social media – Using low resolution spatial data e.g., county/zip code, is more often – Lack of high resolution spatial data – Lack of efficient and scalable methods

Motivations

Introduction

Background Open Data Sources Integrative Geocoding Results

slide-4
SLIDE 4

4/23

  • Sensitivity of data, for instance:

– Patients’ privacy in public health studies – Protected Health Information (PHI) – Health Insurance Portability and Accountability Act (HIPAA)

  • High cost for commercial geocoding

– Geocoding 20M records ~ 10K USD

  • Limited scalability of current geocoding systems

– Daily/monthly transactions limitations, 1M per month or 100K per day – Geocoding 20M records → 200 days!

Problems

Introduction

Background Open Data Sources Integrative Geocoding Results

slide-5
SLIDE 5

5/23

Goal

  • Geocoding system with following features:

– Free of charge → Suitable for academic usage – Scalable and fast → Supports high volume of input data – Accurate and robust → Result must be reliable – Local → Respects data sensitivity, such as patients’ privacy

  • Challenges

– Lack of a free complete, accurate and, reliable reference – Free data sources could be noisy and incomplete – Different data sources do not share same set of features – When there are multiple possible answers, which one is better?

Introduction

Background Open Data Sources Integrative Geocoding Results

slide-6
SLIDE 6

6/23

BACKGROUND

Introduction

Background

Open Data Sources Integrative Geocoding Results

slide-7
SLIDE 7

7/23

Classic Geocoding Model

  • Consist of two major parts:

– Parsing – Searching

  • Fixed scoring system based on
  • nly text similarity
  • Improvements are based on

techniques

Parsing Raw Address Tokenizing Tokenized Address Cleaning Clean Tokens Searching Clean Tokens Build Query Answer DataBase Is valid? 12-11 North Stony Brook Road, Stony Brook, NY, 11794 12, 11, North, Stony Brook, Rd, Stony Brook, NY, 11794 12, 11, n, stony brook, rd, stony brook, ny, 11794 Introduction

Background

Open Data Sources Integrative Geocoding Results

slide-8
SLIDE 8

8/23

DATA SOURCES

Introduction Background

Open Data Sources

Integrative Geocoding Results

slide-9
SLIDE 9

9/23

  • Linear Based

– Topologically Integrated Geographic Encoding and Referencing (TIGER) – Cons: Missing city information, based on address ranges → Interpolation error

  • Polygon/Point Based

– Tax Parcels – New York Street and Address Maintenance Program (SAM) – OpenStreetMap (OSM) – OpenAddresses – Cons: Incompleteness, having partial information, messiness

Data Sources

Introduction Background

Open Data Sources

Integrative Geocoding Results

slide-10
SLIDE 10

10/23

INTEGRATIVE GEOCODING

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-11
SLIDE 11

11/23

Integrative Geocoding Model

  • Consist of three major parts:

– Parsing – Oriented Searching – Intelligent Selection

  • Parallel processing approach
  • Training Section as a pre-

processing task

  • Effective and scalable integrative

geocoder (EaserGeocoder)

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-12
SLIDE 12

12/23

Oriented Searching

  • 1. Generate queries based on each

dataset’s characteristics

– E.g., TIGER does not have city name – Increasing efficiency – More accurate results

  • 2. Search Database
  • 3. Relaxed Search

– Finding nearest ones instead of exact match – Expanding the scope iteratively

TIGER

Search Engine Generate Specific Query

Acceptab le ?

Relax Query Candidate Result

OpenAddresses

Search Engine Generate Specific Query

Acceptab le ?

Relax Query Candidate Result

Tax Parcels

Search Engine Generate Specific Query

Acceptab le ?

Relax Query Candidate Result

SAM

Search Engine Generate Specific Query

Close enough?

Relax Query Candidate Result

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-13
SLIDE 13

13/23

  • “21 Airport Rd, Binghamton, NY 13901”

– Perfect Match TIGER, spatial error ~450m – Partial Match OpenAddresses, spatial error ~350m

  • “510 Main St, Oneida, NY 13421”

– Perfect Match TIGER, spatial error ~50m – Partial Match OpenAddresses, spatial error ~30km!

  • Partial match is similar, only zip code is missing in both cases
  • Preset rules for choosing better reference leads to non-optimal or even wrong

answer

  • The state-of-art is to choose both of them optimally!

Intelligent Answer Selection – Case Study

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-14
SLIDE 14

14/23

  • Why just text similarity is not enough?
  • 1. Each source could be more accurate in one specific region
  • 2. Implicit factors such as population density
  • Machine learning based approach
  • Gradient tree boosting

– Learning small predictive models – Decision trees – Learning a model for predicting the best one

Intelligent Answer Selection

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-15
SLIDE 15

15/23

  • Originally binary classification

– Classify based on an acceptance threshold

  • Treat all correct candidates same

– Some candidates are more correct! – Considering spatial error

  • Muli-class classifier

– 3 classes between 0 and the threshold – Choosing nearest class

Classification

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-16
SLIDE 16

16/23

RESULTS

Introduction Background Open Data Sources Integrative Geocoding

Results

slide-17
SLIDE 17

17/23

Type Name Google % Here % MapQuest % Consolidation % Commercial Google

  • 97.18

96.55 99.10 Commercial Here 94.88

  • 97.17

99.26 Commercial MapQuest 94.52 97.16

  • 98.82

Commercial GeoServices 94.68 95.30 94.71 96.43 Non-Commercial EaserGeocoder 96.46 97.12 95.82 97.93 Non-Commercial Nominatim 54.74 54.15 54.08 55.68 Non-Commercial Geonames 82.65 83.20 83.53 84.40 Non-Commercial DataSciToolkit 89.05 89.50 89.52 90.71

Accuracy

Introduction Background Open Data Sources Integrative Geocoding

Results

  • 18,890 residential addresses crawled from one real estate website
  • Defined 400 meters as the threshold for accuracy
slide-18
SLIDE 18

18/23

Type Name Google (m) Here (m) MapQuest (m) Consolidation (m) Commercial Google

  • 24.29

55.52 18.37 Commercial Here 24.29

  • 55.20

17.67 Commercial MapQuest 55.85 55.20

  • 35.86

Commercial GeoServices 35.75 31.08 51.16 29.02 Non-Commercial EaserGeocoder 31.08 26.85 53.30 26.90 Non-Commercial Nominatim 65.55 63.88 57.98 58.09 Non-Commercial Geonames 93.38 91.78 75.32 83.58 Non-Commercial DataSciToolkit 87.42 85.47 71.36 77.39

Spatial Error

Introduction Background Open Data Sources Integrative Geocoding

Results

slide-19
SLIDE 19

19/23

Spatial Accuracy Variation - Histogram

Number of Addresses=18,890 Introduction Background Open Data Sources Integrative Geocoding

Results

slide-20
SLIDE 20

20/23

Scalability Tests

Introduction Background Open Data Sources Integrative Geocoding

Results

slide-21
SLIDE 21

21/23

  • The system is available from:

http://bmidb.cs.stonybrook.edu/easergeocoder/

  • Completely free for academic usage!

EaserGeocoder

Introduction Background Open Data Sources Integrative Geocoding

Results

slide-22
SLIDE 22

22/23

  • Introduction

– Problem, Motivation, Goal

  • Background

– Classic Geocoding Model

  • Open Data Sources

– Linear-Based, Point-Based, Community Contributed

  • Integrative Geocoding

– Integrative Geocoding Model, Oriented Searching, Intelligent Answer Selection

  • Results

– Accuracy, Spatial Error, Spatial Accuracy Variation, Scalability

Summary

Introduction Background Open Data Sources Integrative Geocoding Results

slide-23
SLIDE 23

23/23

Thank a lot for your attention

slide-24
SLIDE 24

24/23

Any Questions?

slide-25
SLIDE 25

25/23

EXTRA MATERIALS

slide-26
SLIDE 26

26/23

  • Utilizing multiple free open reference sources

– Maximizing coverage and accuracy – Paying no cost for data

  • Searching each data source based on its characteristics

– Unchanged, except pre-data cleaning and standardization – Extracting candidates, most similar matches from sources

  • Choosing best answer among candidates

– By using machine learning techniques

Overview - Integrative Geocoding Model

Introduction

Background Open Data Sources Integrative Geocoding Results

slide-27
SLIDE 27

27/23

Traditional Geocoding Method

  • Street Network Map as the source
  • Interpolation methods for

estimating the location

  • Accuracy depends on:

– Density – Estimation error

  • The most common method in

geocoding systems

Introduction

Background

Open Data Sources Integrative Geocoding Results

slide-28
SLIDE 28

28/23

Linear-Based Dataset

  • Topologically Integrated

Geographic Encoding and Referencing (TIGER)

  • Street Network Map
  • Vast coverage and reliable
  • Cons:

▪ Does not consist of exact building locations address → Linear Interpolation ▪ Parcel homogeneity ▪ Offset from beginning and end

Introduction Background

Open Data Sources

Integrative Geocoding Results

slide-29
SLIDE 29

29/23

Tax Parcels

  • Legal division of land with the main

purpose of providing a precise unit of taxation in support of taxpayers

  • Known as real property or real estate

map

  • Cons:

– Missing many buildings addresses – Partial information for many records

SAM

  • New York Street and Address

Maintenance Program

  • Collected data by state for Next-

Generation 911 emergency systems

  • Data is clean and organized
  • Cons:

– Limited coverage, 51 counties of 62 counties in New York

Point-Based Datasets

Introduction Background

Open Data Sources

Integrative Geocoding Results

slide-30
SLIDE 30

30/23

OpenAddresses (OA)

  • Collection of authoritative data for

address locations around the world

  • More than 466 million addresses

worldwide OpenStreetMap (OSM)

  • Large-scale world map project through

extensive collaborative contribution from 2 million users

  • Consist of points, lines and polygons
  • More than 2.7 billion objects
  • Cons:

▪ Missing buildings addresses ▪ Partial information for many records ▪ Formatting problems, need massive cleaning ▪ Not 100% reliable

Community Contributed Datasets

Introduction Background

Open Data Sources

Integrative Geocoding Results

slide-31
SLIDE 31

31/23

Metrics

  • Match Rate

– The proportion of input data, a geocoding system was capable of successfully geocoding – Lack of standard definition of successful geocoding/acceptable result

  • Spatial (Positional) Accuracy

– Average distance between true locations and the computed geocoded locations

  • Spatial accuracy variation

– Distribution of positional errors

  • Performance

– Processing time for fixed number of inputs

slide-32
SLIDE 32

32/23

Case Study

  • Positional error between

– Street network map data source versus Parcel data – Rural versus Suburban versus Urban

SNM Parcel Urban 379 62 Suburban 1219 177 Rural 5706 582 Positional error of 99% of addresses in meters

slide-33
SLIDE 33

33/23

Geocoding Model/Method Problems

  • Was based on a single source
  • Studies were focused on just a small region
  • Parcel data was limited

– Unavailable for public access – Not containing every building

  • Fixed scoring method
  • Is parcel data always better than Street Network Map?
slide-34
SLIDE 34

34/23

  • Fixed scoring system
  • Each attribute has a weight, sum of weights is score
  • 30 / 100 – Zone
  • 70 / 100 –Address

City Name County Name Zip Code 15/75 – House Number 60/75 – Street Address 5/92 – Prefix 6/92 – PreType 70/92 – Street Name 6/92 – SufType 5/92 – Suffix

Scoring

slide-35
SLIDE 35

35/23

Data Representations Types

  • Geocoding systems rely on

underlying reference dataset

– Linear-Based

  • Each street is represented as a set of

geolocation points

– Polygon-Based

  • Geolocations of boundary points of

buildings are given

– Point-Based

  • A single building is represented by a

single geolocation

slide-36
SLIDE 36

36/23

Visualization

slide-37
SLIDE 37

37/23

  • Outdated data
  • Error in gathering data
  • Linear-based

– Interpolation algorithms – Parcel homogeneity – Offset

  • Polygon-Based

– Parcel Centroids

  • Choosing non-optimal result

Spatial Error in References

TIGER SAM Parcel OpenAddresses 51.4m 39.8m 40.08m

23 Hammond Lane 39 Hammond Lane 3 Hammond Lane

slide-38
SLIDE 38

38/23

  • Major task is text searching and information extraction from data sources
  • Data Management System is specialized for this purpose
  • Apache Solr

– Text oriented search engine – Automatic indexing for efficient searching – Intelligent caching

  • Relational Databased Approach

– Limited text search capabilities – Generating lots of SQL queries

  • Each data source needs a separate core in Solr

Search Engine

slide-39
SLIDE 39

39/23

  • Major challenge in this problem
  • Impossible to collect the geolocations manually, due to the overwhelming scale
  • Approximated approach by using results from Google Geocoding
  • It could give rise to another problems

– Some queries are Modified to find a match – Some queries do not have any responses

  • By replacing or removing these ones, we will avoid problematic addresses

“Ground Truth”

slide-40
SLIDE 40

40/23

Training Set

  • 19,095 random samples of

physician addresses in New York State, provided by The Centers for Medicare and Medicaid Services (CMS)

  • Reasons:

– Publicly available – Up-to-date and validated – Population considered implicitly

  • Each address is geocoded by four

data sources, generating four candidates

70,000 Physicians’ Addresses in NYS

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-41
SLIDE 41

41/23

  • Absolute Distance

– Building Numbers

  • Normalized Levenshtein Distance (Edit Distance)

– Street, City Name and County Name

  • Equality and Availability

– Zip Code

  • Equality

– Base Number, Prefix and Postfix

  • Reference type (TIGER, OA, Parcels and OSM)
  • Input zip code

Distance Function

slide-42
SLIDE 42

42/23

Normalized Edit Distance between main tokens <0.121324 Zip codes are same Building number difference <51.5

  • 0.10587

0.08608 County names are same

  • 0.09639

0.013467 Zip codes are same Building number difference <5.5

  • 0.01144

0.10384 Reference zip code is available 0.12309 0.027925

Yes Yes Yes Yes Yes Yes Yes No No No No No No No

Decision Tree Sample

slide-43
SLIDE 43

43/23

  • Master-Slave approach

– Coordinator thread as master

  • Creates and initializes other threads and processes their output
  • Maximizes usage, for boosting the throughput

– Slave threads responsible for geocoding addresses

  • Have direct access to search engine
  • MapReduce

– Setup database server on each node – Avoid communication overhead – Pre-load into Hadoop Distributed File System.

Scalable Design

Introduction Background Open Data Sources

Integrative Geocoding

Results

slide-44
SLIDE 44

44/23

  • 18,890 residential addresses crawled from one real estate website
  • Defined 400 meters as the threshold for accuracy
  • Commercial Geocoding Systems:

– Google, Here, MapQuest and, Texas A&M (GeoServices) – Using commercial maps, costing thousands of dollars for each state

  • Non-commercial Geocoding Systems:

– Nominatim, Geonames and, DataScienceToolkit (DST)

  • Cluster node with 22 physical cores (2x Intel(R) Xeon(R) processor E52699 v3 at

2.20GHz) and 88 virtual cores, 5TB hard drive at 7200rpm and 128GB memory

  • Apache Solr 6.2.0.

Experimental Setup

Introduction Background Open Data Sources Integrative Geocoding

Results

slide-45
SLIDE 45

45/23

Name Geolocation Level Accuracy (%) Spatial Error (m) TIGER Street Level 91.51 91.20 Tax Parcels Rooftop 71.93 35.08 OA Rooftop 65.96 32.01 SAM Rooftop 69.72 35.72 EaserGeocoder Combined 96.46 31.08

Different Sources Comparison