learning to read urls
play

Learning to read urls Finding the word boundaries in multi-word - PowerPoint PPT Presentation

Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically The Problem Given a


  1. Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles

  2. Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically

  3. The Problem Given a domain name: 'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com' Find the concatenated sentence: 'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'

  4. Why is this useful? How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'? How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'? Domains resolved into words can be compared on a semantic level, not simply as strings.

  5. Primary use case Given 500 domains in a market, what are the themes?

  6. Scope of project As part of our internal idea incubation Adthena labs , this approach was developed during a one- day hack to determine if such an approach could be useful to the business.

  7. Adthena's Data > 10 million unique domains > 50 million unique search terms 3rd Party Data Project Gutenberg (https://www.gutenberg.org/) Google ngram viewer datasets (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

  8. Process 1. Learn some words 2. Find where words occur in a domain name 3. Choose the most likely set of words

  9. 1. Learn some words Build a dictionary using suitable documents. Documents: search terms In [2]: import pandas , os search_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv')) search_terms = search_terms['SearchTerm'].dropna().str.lower() search_terms.iloc[1000000::2000000] 1000000 new 2014 mercedes benz b200 cdi Out[2]: 3000000 weight watchers in glynneath 5000000 property for rent in batlow nsw 7000000 us plug adaptor for uk 9000000 which features mobile is best for purchase Name: SearchTerm, dtype: object In [125]: from sklearn.feature_extraction.text import CountVectorizer def build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())

  10. In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001) dictionary_size = len(st_dictionary) print('{} words found'.format(num_fmt(dictionary_size))) sorted(st_dictionary)[dictionary_size//20::dictionary_size//10] 21.4k words found ['430', Out[126]: 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']

  11. We have 21 thousand words in our base dictionary . We can augment this with some books from project gutenberg: In [127]: dictionary = st_dictionary for fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear i n 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.11k words found in a_christmas_carol.txt 1.65k words found in alice_in_wonderland.txt 3.71k words found in huckleberry_finn.txt 4.09k words found in pride_and_predudice.txt 4.52k words found in sherlock_holmes.txt 26.4k words in dictionary

  12. Actually, scrap that... ... and use the google ngram viewer datasets:

  13. In [212]: dictionary = set() ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')] for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2 ) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv" 12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv" 5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv" 4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv" 3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv" 2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv" 2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv" 2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv" 2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv" 2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv" 61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv" 55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv" 72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv" 46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv" 36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv" 32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv" 36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv" 37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv" 30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv" 12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv" 31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv" 36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv" 63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"

  14. That takes us to ~1M words! We even get some good two-letter words to work with: In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2}))) print(sorted({w for w in dictionary if len(w) == 2})) 142 2-letter words ['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']

  15. In [144]: choice(list(dictionary), size=40) array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', Out[144]: 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')

  16. 2. Find where words occur in a domain name Find all substrings of a domain that are in our dictionary , along with their start and end indicies.

  17. In [149]: def find_words_in_string(string, dictionary, longest_word= None ): if longest_word is None : longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)

  18. In [234]: domain = 'powerwasherchicago' words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)}) print(len(words)) print(words) 39 ['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'ic ago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was' , 'wash', 'washe', 'washer', 'we', 'wer']

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend