Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, - - PowerPoint PPT Presentation

neural embeddings for
SMART_READER_LITE
LIVE PREVIEW

Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, - - PowerPoint PPT Presentation

Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, Pedro Szekely USC Information Sciences Institute Motivation: feature extraction from locations Essential for machine learning problems involving locations Machine learning


slide-1
SLIDE 1

Neural Embeddings for Populated GeoNames Locations

Mayank Kejriwal, Pedro Szekely USC Information Sciences Institute

slide-2
SLIDE 2

Motivation: feature extraction from locations

  • Essential for machine learning problems involving locations
slide-3
SLIDE 3

Machine learning applications

  • Toponym resolution
  • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also

got extracted in a similar context

  • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text)

e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA

slide-4
SLIDE 4

Machine learning applications

  • Toponym resolution
  • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also

got extracted in a similar context

  • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text)
  • Named entity disambiguation

e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA e.g. Was ‘Charlotte’ extracted as a name or a location?

slide-5
SLIDE 5

Motivation: feature extraction from locations

  • Essential for machine learning problems

Why not use latitude-longitude directly?

slide-6
SLIDE 6

What makes for a ‘good’ feature space?

  • Captures proximity semantics
  • Real-valued, not very high-dimensional
  • Not too sensitive (1.0 vs. 1.001)
  • Extensible
  • Does not necessarily require manual tuning
  • Generic i.e. can be visualized in some way
slide-7
SLIDE 7

Do lat-long points capture proximity semantics?

  • Only in a very dense,

non-linear space

slide-8
SLIDE 8

More formally...

  • dist(lat1, long1, lat2, long2) is well-approximated using the Haversine

formula

slide-9
SLIDE 9

Do lat-long points capture proximity semantics?

  • Discontinuous (in linear space)!
slide-10
SLIDE 10

Do lat-long points capture proximity semantics?

  • Sensitive (more than other features typically used in machine learning

pipelines)

slide-11
SLIDE 11

What makes for a good feature space?

  • Captures proximity semantics
  • Real-valued, not very high-dimensional
  • Not too sensitive (1.0 vs. 1.001)
  • Extensible
  • Does not necessarily require manual tuning
  • Generic i.e. can be visualized in some way
slide-12
SLIDE 12

Id Idea: ‘Embed’ Geonames as a weighted, directed network...

  • ...in a vector space!
  • Vector similarities (using dot product similarity) depend inversely on

geodesic distances 2-dimensional un-normalized embeddings (latitude- longitude) in complex, sensitive space 100-dimensional normalized embeddings in dot product space

slide-13
SLIDE 13

Step 1: Determine set of nodes in network

  • Nodes in Geonames

identifies by following feature codes: [`PPL', `PPLA', `PPLA2', `PPLA3', `PPLA4', `PPLC', `PPLCH', `PPLF', `PPLG', `PPLH', `PPLL', `PPLQ', `PPLR', `PPLS', `PPLW', `PPLX', `STLMT']

~4.4 million nodes

slide-14
SLIDE 14

Step 2: Determine edges and weights

  • Pairwise in the worst

case

  • Slide a window over

nodes sorted by latitude or longitude,

  • nly form edges

between nodes in the same window.

  • Postprocess by

removing nodes with 0 population

~357,000 nodes ~9 million edges

slide-15
SLIDE 15

Step 3: Run DeepWalk on network

  • DeepWalk (Perozzi et

al., 2014) is a powerful neural network algorithm for embedding nodes in graphs; has achieved powerful results

  • Very fast!
slide-16
SLIDE 16

Example in paper: North Dakota

slide-17
SLIDE 17

Vectors, code and raw data all on GitHub (also, figshare)

https://github.com/mayankkejriwal/Geonames-embeddings