Learning to read urls Finding the word boundaries in multi-word - PowerPoint PPT Presentation

Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles

Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically

The Problem Given a domain name: 'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com' Find the concatenated sentence: 'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'

Why is this useful? How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'? How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'? Domains resolved into words can be compared on a semantic level, not simply as strings.

Primary use case Given 500 domains in a market, what are the themes?

Scope of project As part of our internal idea incubation Adthena labs , this approach was developed during a one- day hack to determine if such an approach could be useful to the business.

Adthena's Data > 10 million unique domains > 50 million unique search terms 3rd Party Data Project Gutenberg (https://www.gutenberg.org/) Google ngram viewer datasets (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

Process 1. Learn some words 2. Find where words occur in a domain name 3. Choose the most likely set of words

1. Learn some words Build a dictionary using suitable documents. Documents: search terms In [2]: import pandas , os search_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv')) search_terms = search_terms['SearchTerm'].dropna().str.lower() search_terms.iloc[1000000::2000000] 1000000 new 2014 mercedes benz b200 cdi Out[2]: 3000000 weight watchers in glynneath 5000000 property for rent in batlow nsw 7000000 us plug adaptor for uk 9000000 which features mobile is best for purchase Name: SearchTerm, dtype: object In [125]: from sklearn.feature_extraction.text import CountVectorizer def build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())

In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001) dictionary_size = len(st_dictionary) print('{} words found'.format(num_fmt(dictionary_size))) sorted(st_dictionary)[dictionary_size//20::dictionary_size//10] 21.4k words found ['430', Out[126]: 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']

We have 21 thousand words in our base dictionary . We can augment this with some books from project gutenberg: In [127]: dictionary = st_dictionary for fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear i n 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.11k words found in a_christmas_carol.txt 1.65k words found in alice_in_wonderland.txt 3.71k words found in huckleberry_finn.txt 4.09k words found in pride_and_predudice.txt 4.52k words found in sherlock_holmes.txt 26.4k words in dictionary

Actually, scrap that... ... and use the google ngram viewer datasets:

In [212]: dictionary = set() ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')] for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2 ) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv" 12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv" 5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv" 4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv" 3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv" 2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv" 2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv" 2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv" 2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv" 2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv" 61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv" 55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv" 72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv" 46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv" 36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv" 32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv" 36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv" 37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv" 30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv" 12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv" 31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv" 36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv" 63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"

That takes us to ~1M words! We even get some good two-letter words to work with: In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2}))) print(sorted({w for w in dictionary if len(w) == 2})) 142 2-letter words ['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']

In [144]: choice(list(dictionary), size=40) array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', Out[144]: 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')

2. Find where words occur in a domain name Find all substrings of a domain that are in our dictionary , along with their start and end indicies.

In [149]: def find_words_in_string(string, dictionary, longest_word= None ): if longest_word is None : longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)

In [234]: domain = 'powerwasherchicago' words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)}) print(len(words)) print(words) 39 ['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'ic ago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was' , 'wash', 'washe', 'washer', 'we', 'wer']

Learning to read urls Finding the word boundaries in multi-word - PowerPoint PPT Presentation

Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically The Problem Given a

URLs K. Cooper 1 1 Department of Mathematics Washington State University 2014 URLs Introduction

WarningBird: Detecting Suspicious URLs in Twitter Stream NDSS 2012 Sangho Lee and Jong Kim

Magic URLs in an XML Universe | Contents | 2 Contents Magic URLs in an XML

How Using NNNetwork simplified i18n config Objectives - Configure iOS app for IT/ES (help urls,

Uniform Resource Locators (URLs) Scheme Port Number Query

LECTURE 34 REQUESTING URLS IN PYTHON MCS 260 Fall 2020 David Dumas / REMINDERS Worksheet 12

Read Write Inc. Phonics MISS CASBAN About Read Write Inc Phonics

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Kindergarten Reading Getting Ready for Kindergarten Oregon Trail School District Read, Read

112 quotes & text 1920x1080 72 URLs & citations 72 code{:;} 36 credits

GUATEMALA Universidad Rafael Landvar Rafael Landivar University URLs main task is to

Brihaspti-3 Project Yatindra Nath Singh IIT Kanpur http://home.iitk.ac.in/~ynsingh URLs

S MART G EN : Exposing Server URLs of Mobile Apps with Selective Symbolic Execution Chaoshun Zuo

URLs every page and every image on a website has a URL (or Uniform Resource Locator) the URL is

Server-side Scripting Slides courtesy of Xenia Mountrouidou URLs and web servers 2

HTTP Web eb and d URLs Web page consists of objects Addressable by a URL Can be HTML

Rina & Demshin, Plateau State with PV Solar-Hybrid Mini-Grids Support from Nigerian Energy

Evolution of the scattering screen of PSR B0834+06 Dana Simard, Caltech J.-P. Macquart, Ue-Li

Public Private Partnerships August 21, 2014 SSF Archived Climate Solutions Webinar Series

New Markets Tax Credit Program 2013/2014 NMTC Allocation Round Application Overview PREPARED ON

A Natural Language Database Interface using Fuzzy Semantics ...wild speculation about the nature

Recent trends in offshore wind finance Smart Energies Summit, Paris, 17 June 2019 A specialist

Analyst Meet Nov 2010 Contents BOPET Film Industry Polyplex group Polyplex

Higgs Boson Searches at the Tevatron Harald Fox Department of Physics h.fox@lancaster.ac.uk

Learning to read urls Finding the word boundaries in multi-word - PowerPoint PPT Presentation

Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically The Problem Given a

URLs K. Cooper 1 1 Department of Mathematics Washington State University 2014 URLs Introduction

WarningBird: Detecting Suspicious URLs in Twitter Stream NDSS 2012 Sangho Lee and Jong Kim

Magic URLs in an XML Universe | Contents | 2 Contents Magic URLs in an XML

How Using NNNetwork simplified i18n config Objectives - Configure iOS app for IT/ES (help urls,

Uniform Resource Locators (URLs) Scheme Port Number Query

LECTURE 34 REQUESTING URLS IN PYTHON MCS 260 Fall 2020 David Dumas / REMINDERS Worksheet 12

Read Write Inc. Phonics MISS CASBAN About Read Write Inc Phonics

Read Write Inc. Phonics Parents Meeting Who is Read Write Inc. Phonics for? Read Write Inc.

Kindergarten Reading Getting Ready for Kindergarten Oregon Trail School District Read, Read

112 quotes &amp; text 1920x1080 72 URLs &amp; citations 72 code{:;} 36 credits

GUATEMALA Universidad Rafael Landvar Rafael Landivar University URLs main task is to

Brihaspti-3 Project Yatindra Nath Singh IIT Kanpur http://home.iitk.ac.in/~ynsingh URLs

S MART G EN : Exposing Server URLs of Mobile Apps with Selective Symbolic Execution Chaoshun Zuo

URLs every page and every image on a website has a URL (or Uniform Resource Locator) the URL is

Server-side Scripting Slides courtesy of Xenia Mountrouidou URLs and web servers 2

HTTP Web eb and d URLs Web page consists of objects Addressable by a URL Can be HTML

Rina &amp; Demshin, Plateau State with PV Solar-Hybrid Mini-Grids Support from Nigerian Energy

Evolution of the scattering screen of PSR B0834+06 Dana Simard, Caltech J.-P. Macquart, Ue-Li

Public Private Partnerships August 21, 2014 SSF Archived Climate Solutions Webinar Series

New Markets Tax Credit Program 2013/2014 NMTC Allocation Round Application Overview PREPARED ON

A Natural Language Database Interface using Fuzzy Semantics ...wild speculation about the nature

Recent trends in offshore wind finance Smart Energies Summit, Paris, 17 June 2019 A specialist

Analyst Meet Nov 2010 Contents BOPET Film Industry Polyplex group Polyplex

Higgs Boson Searches at the Tevatron Harald Fox Department of Physics h.fox@lancaster.ac.uk

112 quotes & text 1920x1080 72 URLs & citations 72 code{:;} 36 credits

Rina & Demshin, Plateau State with PV Solar-Hybrid Mini-Grids Support from Nigerian Energy