neural embeddings for
play

Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, - PowerPoint PPT Presentation

Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, Pedro Szekely USC Information Sciences Institute Motivation: feature extraction from locations Essential for machine learning problems involving locations Machine learning


  1. Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, Pedro Szekely USC Information Sciences Institute

  2. Motivation: feature extraction from locations • Essential for machine learning problems involving locations

  3. Machine learning applications • Toponym resolution e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also got extracted in a similar context • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text)

  4. Machine learning applications • Toponym resolution e.g. "Boston" in England, UK vs. "Boston" in Massachusetts, USA • Much more likely to be Boston, MA if ‘New York’ and ‘Martha’s Vineyard’ also got extracted in a similar context • Features are hybrid i.e. must encode both location and ‘context’ (e.g. text) • Named entity disambiguation e.g. Was ‘Charlotte’ extracted as a name or a location?

  5. Motivation: feature extraction from locations • Essential for machine learning problems Why not use latitude-longitude directly?

  6. What makes for a ‘good’ feature space? • Captures proximity semantics • Real-valued, not very high-dimensional • Not too sensitive (1.0 vs. 1.001) • Extensible • Does not necessarily require manual tuning • Generic i.e. can be visualized in some way

  7. Do lat-long points capture proximity semantics? • Only in a very dense, non-linear space

  8. More formally... • dist(lat 1 , long 1 , lat 2 , long 2 ) is well-approximated using the Haversine formula

  9. Do lat-long points capture proximity semantics? • Discontinuous (in linear space)!

  10. Do lat-long points capture proximity semantics? • Sensitive (more than other features typically used in machine learning pipelines)

  11. What makes for a good feature space? • Captures proximity semantics • Real-valued, not very high-dimensional • Not too sensitive (1.0 vs. 1.001) • Extensible • Does not necessarily require manual tuning • Generic i.e. can be visualized in some way

  12. Id Idea: ‘Embed’ Geonames as a weighted, directed network... • ...in a vector space! • Vector similarities (using dot product similarity) depend inversely on geodesic distances 2-dimensional un-normalized 100-dimensional embeddings (latitude- normalized embeddings in longitude) in complex, dot product space sensitive space

  13. Step 1: Determine set of nodes in network • Nodes in Geonames identifies by following feature codes: [`PPL', `PPLA', `PPLA2', `PPLA3', `PPLA4', `PPLC', `PPLCH', `PPLF', `PPLG', `PPLH', `PPLL', `PPLQ', `PPLR', `PPLS', `PPLW', `PPLX', ~4.4 million nodes `STLMT']

  14. Step 2: Determine edges and weights • Pairwise in the worst case • Slide a window over nodes sorted by latitude or longitude, only form edges between nodes in the same window. ~357,000 nodes • Postprocess by removing nodes with ~9 million edges 0 population

  15. Step 3: Run DeepWalk on network • DeepWalk (Perozzi et al., 2014) is a powerful neural network algorithm for embedding nodes in graphs; has achieved powerful results • Very fast!

  16. Example in paper: North Dakota

  17. Vectors, code and raw data all on GitHub (also, figshare) https://github.com/mayankkejriwal/Geonames-embeddings

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend