Learning to read urls Finding the word boundaries in multi-word - - PowerPoint PPT Presentation

learning to read urls
SMART_READER_LITE
LIVE PREVIEW

Learning to read urls Finding the word boundaries in multi-word - - PowerPoint PPT Presentation

Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically The Problem Given a


slide-1
SLIDE 1

Learning to read urls

Finding the word boundaries in multi-word domain names with python and sklearn.

Calvin Giles

slide-2
SLIDE 2

Who am I?

Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically

slide-3
SLIDE 3

The Problem

Given a domain name:

'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com'

Find the concatenated sentence:

'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'

slide-4
SLIDE 4

Why is this useful?

How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'? How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'? Domains resolved into words can be compared on a semantic level, not simply as strings.

slide-5
SLIDE 5

Primary use case

Given 500 domains in a market, what are the themes?

slide-6
SLIDE 6

Scope of project

As part of our internal idea incubation Adthena labs, this approach was developed during a one- day hack to determine if such an approach could be useful to the business.

slide-7
SLIDE 7

Adthena's Data

> 10 million unique domains > 50 million unique search terms

3rd Party Data

Project Gutenberg (https://www.gutenberg.org/) Google ngram viewer datasets (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

slide-8
SLIDE 8

Process

  • 1. Learn some words
  • 2. Find where words occur in a domain name
  • 3. Choose the most likely set of words
slide-9
SLIDE 9
  • 1. Learn some words

Build a dictionary using suitable documents. Documents: search terms

In [2]: import pandas, os search_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv')) search_terms = search_terms['SearchTerm'].dropna().str.lower() search_terms.iloc[1000000::2000000] Out[2]: 1000000 new 2014 mercedes benz b200 cdi 3000000 weight watchers in glynneath 5000000 property for rent in batlow nsw 7000000 us plug adaptor for uk 9000000 which features mobile is best for purchase Name: SearchTerm, dtype: object In [125]: from sklearn.feature_extraction.text import CountVectorizer def build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())

slide-10
SLIDE 10

In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001) dictionary_size = len(st_dictionary) print('{} words found'.format(num_fmt(dictionary_size))) sorted(st_dictionary)[dictionary_size//20::dictionary_size//10] Out[126]: 21.4k words found ['430', 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']

slide-11
SLIDE 11

We have 21 thousand words in our base dictionary. We can augment this with some books from project gutenberg:

In [127]: dictionary = st_dictionary for fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear i n 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.11k words found in a_christmas_carol.txt 1.65k words found in alice_in_wonderland.txt 3.71k words found in huckleberry_finn.txt 4.09k words found in pride_and_predudice.txt 4.52k words found in sherlock_holmes.txt 26.4k words in dictionary

slide-12
SLIDE 12

Actually, scrap that...

... and use the google ngram viewer datasets:

slide-13
SLIDE 13

In [212]: dictionary = set() ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')] for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2 ) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv" 12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv" 5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv" 4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv" 3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv" 2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv" 2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv" 2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv" 2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv" 2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv" 61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv" 55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv" 72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv" 46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv" 36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv" 32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv" 36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv" 37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv" 30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv" 12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv" 31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv" 36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv" 63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"

slide-14
SLIDE 14

That takes us to ~1M words! We even get some good two-letter words to work with:

In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2}))) print(sorted({w for w in dictionary if len(w) == 2})) 142 2-letter words ['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']

slide-15
SLIDE 15

In [144]: choice(list(dictionary), size=40) Out[144]: array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')

slide-16
SLIDE 16
  • 2. Find where words occur in a domain name

Find all substrings of a domain that are in our dictionary, along with their start and end indicies.

slide-17
SLIDE 17

In [149]: def find_words_in_string(string, dictionary, longest_word=None): if longest_word is None: longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)

slide-18
SLIDE 18

In [234]: domain = 'powerwasherchicago' words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)}) print(len(words)) print(words) 39 ['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'ic ago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was' , 'wash', 'washe', 'washer', 'we', 'wer']

slide-19
SLIDE 19

In [235]: domain = 'catholiccommentaryonsacredscripture' words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)}) print(len(words)) print(words) 101 ['acr', 'acre', 'acred', 'ary', 'aryo', 'at', 'ath', 'atho', 'athol', 'atholic', 'cat', 'cath', 'catho', 'cathol', 'catholi', 'catholic', 'cco', 'ccom', 'co', 'com', 'comm', 'comme', 'commen', 'comment', 'commenta', 'commentar', 'commentary', 'cre', 'cred', 'cr eds', 'cri', 'crip', 'cript', 'dsc', 'dscr', 'ed', 'eds', 'en', 'ent', 'enta', 'entar', 'entary', 'hol', 'holi', 'holic', 'icc', 'icco', 'ipt', 'lic', 'me', 'men', 'ment', 'm enta', 'mentar', 'mentary', 'mm', 'mme', 'mment', 'nsa', 'nsac', 'nta', 'ntar', 'ntary' , 'oli', 'olic', 'omm', 'omme', 'ommen', 'omment', 'on', 'ons', 'ptu', 'pture', 're', ' red', 'reds', 'rip', 'ript', 'ryo', 'ryon', 'ryons', 'sac', 'sacr', 'sacre', 'sacred', 'scr', 'scri', 'scrip', 'script', 'scriptur', 'scripture', 'tar', 'tary', 'tho', 'thol' , 'tholic', 'tur', 'ture', 'ure', 'yon', 'yons']

slide-20
SLIDE 20
  • 3. Choose the most likely set of words

Simple approach to do this:

  • 1. Find all subsets of the set of words found
  • 2. Determine if that subset if non-overlapping
  • 3. Decide how likely is the domain given a particular subset
  • 4. Decide how likely it is that the subset would occur overall
  • 5. Determine best subset

P(d|s) P(s) P(s|d) argmaxs

slide-21
SLIDE 21

We need some domain name data for the next part...

In [153]: domains = pandas.read_csv(os.path.join(data_directory, 'domains.csv')) domains = domains['Domain'].str.lower() domains = domains[domains.str.endswith(".com")] domains = domains.str.replace("\.com$", "") domains = domains.str.replace("^https?\:\/\/", "") domains = domains.str.replace("^www\d?\.", "") num_fmt(len(domains)) Out[153]: '3.8M'

slide-22
SLIDE 22

In [224]: choice(domains, size=20) Out[224]: array(['1topchannel', 'scales-chords', 'marcusmajestic', 'mylyfestart', 'bluediamondturlock', 'bedfordvisionclinic', 'justinmccain', 'miniot-online', 'chelseabarracksbarracks', 'zeroeasy', 'newlookupholstery', 'radcliffehealth', 'embracingthemundane', 'immunityassist', 'simplynostretchmarks', 'teachmetoswim', 'thetford-europe', 'charlesallenford', 'china-chargermanufacturer', 'coolbabykid'], dtype=object)

slide-23
SLIDE 23
  • 1. Find all subsets of the set of words found

There are different sentences that can be constructed from n substrings, including the empty

  • sentence. We can get an idea how bad that will be with a sample of the data.

2n

slide-24
SLIDE 24

In [53]: longest_word = max(len(word) for word in dictionary) # speeds up search def find_n_words_in_string(domain): return len(set(find_words_in_string(domain, dictionary, longest_word))) In [56]: import numpy n_words = domains.tail(1000).apply(find_n_words_in_string) (n_words).describe().apply(num_fmt) Out[56]: count 1k mean 28.3 std 15.8 min 1 25% 17 50% 26 75% 38 max 93 Name: Domain, dtype: object In [227]: num_fmt(2**28), 2**93 Out[227]: ('268M', 9903520314283042199192993792)

slide-25
SLIDE 25

So the worst case in a sample of 1000 domains is permutations to test!

293

slide-26
SLIDE 26

Combine steps 1 and 2

  • 1. Find all subsets of the set of words found
  • 2. Determine if that subset if non-overlapping

becomes:

  • 1. Find all subsets with non-overlapping words
  • 2. Do nothing :-)
slide-27
SLIDE 27

3.1 Find all subsets with non-overlapping words Build a tree of subsets of non-overlapping words by sorting the words by their start index. ...and only return the "best" few cases anyway It seems intuitive that sentences that match more of the domain are better. This is not infalable, but we can achieve som significant if we only consider sentences at least half as long as the best match. In practice, this does not appear to have any impact on the results but prevents an explosion of sentences with particularly long domains.

slide-28
SLIDE 28

A little more code...

In [147]: def find_sentences(string, words, part_sentence, sentences, threshold=0.0, current_idx=0, current_score=0, best_score=0): """ Return sentences made of words that are common substrings of `string`. `words` MUST be ordered by start index or the results will be wrong! """ current_threshold = int(best_score * threshold) if ((current_idx >= len(string))

  • r current_score + len(string) - current_idx < current_threshold):

return sentences, best_score for i, (word, start_idx, end_idx) in enumerate(words): if current_idx > start_idx: continue new_score = current_score + len(word) best_score = max(best_score, new_score) new_part_sentence = part_sentence + [word] if new_score + len(string) - end_idx >= current_threshold: sentences.append((new_part_sentence, new_score)) sentences, best_score = find_sentences(string=string, words=words[i+1:], part_sentence=new_part_sentence, sentences=sentences, threshold=threshold, current_idx=end_idx, current_score=new_score, best_score=best_score) return sentences, best_score

slide-29
SLIDE 29

Add a wrapper

In [148]: def get_sentences(domain, thresh=0.95): words = set(find_words_in_string(domain, dictionary, longest_word)) words = sorted(words, key=lambda x:(x[1], -x[2], x[0])) sentences, best_score = find_sentences(domain, words, [], [], thresh) return [sentence for sentence, score in sentences if score >= int(best_score * thresh)]

slide-30
SLIDE 30

In [64]: sentences = get_sentences('powerwasherchicago') print(len(sentences)) choice(sentences, size=15) Out[64]: 245 array([['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'ashe', 'chica', 'go'], ['power', 'was', 'her', 'chica', 'go'], ['power', 'was', 'he', 'rch', 'cago'], ['power', 'was', 'her', 'chicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'ash', 'erc', 'hicago'], ['ower', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'was', 'her', 'chi', 'cago'], ['power', 'was', 'her', 'chic', 'ago'], ['power', 'as', 'he', 'rch', 'ica', 'go'], ['ower', 'washer', 'chicago'], ['owe', 'rwas', 'he', 'rch', 'ica', 'go'], ['power', 'washer', 'chic', 'go']], dtype=object)

slide-31
SLIDE 31

In [65]: sentences = get_sentences('catholiccommentaryonsacredscripture') print(len(sentences)) choice(sentences, size=15) Out[65]: 540428 array([['cat', 'holi', 'ccom', 'me', 'nta', 'ryon', 'sacr', 'ed', 'scrip', 're'], ['catholic', 'co', 'mm', 'en', 'aryo', 'nsac', 'ed', 'scri', 'pture'], ['catholic', 'omm', 'enta', 'ryon', 'sacr', 'eds', 'crip', 'tur'], ['cathol', 'icc', 'ommen', 'tar', 'on', 'sacr', 'ed', 'script', 'ure'], ['at', 'holic', 'omme', 'ntary', 'ons', 'acred', 'scri', 'pture'], ['cathol', 'icc', 'omm', 'ntar', 'yons', 'creds', 'crip', 'ture'], ['cat', 'hol', 'icc', 'omm', 'entary', 'ons', 'acr', 'eds', 'cri', 'ptu', 're'] , ['cath', 'lic', 'com', 'me', 'ntar', 'yon', 'sac', 're', 'dsc', 'ript', 'ure'], ['cathol', 'icco', 'mm', 'ntary', 'on', 'sac', 're', 'dsc', 'rip', 'ture'], ['catholic', 'co', 'mm', 'enta', 'ryon', 'sac', 're', 'dsc', 'rip', 'tur'], ['cat', 'holic', 'com', 'me', 'ntar', 'yon', 'sac', 'reds', 'cript', 're'], ['cat', 'holic', 'com', 'menta', 'ryon', 'acr', 'ed', 'cript', 'ure'], ['cat', 'oli', 'ccom', 'mentary', 'nsac', 'red', 'scri', 'pture'], ['cathol', 'icc', 'ommen', 'tary', 'on', 'sacr', 'ed', 'cri', 'ture'], ['cat', 'hol', 'ccom', 'me', 'ntar', 'on', 'sac', 'red', 'scripture']], dtype=o bject)

slide-32
SLIDE 32

In [71]: tail_sentences = domains.tail(1000).apply(get_sentences).apply(len) In [155]: tail_sentences.describe().apply(int).apply(num_fmt) Out[155]: count 1k mean 1.18k std 10.7k min 1 25% 12 50% 39 75% 145 max 280k Name: Domain, dtype: object

slide-33
SLIDE 33

In [73]: domains.tail(1000)[tail_sentences <= 1].values Out[73]: array(['cizerl', 'sahoko', 'pes-llc', 'mp3fil', 'wyzli', 'buypsa', 'ylqhjt', 'sblgnt', 'axbet', 'eirnyc', 'wsl', 'kms88', 'paknic', 'mrojp', 'irozho', 'bienve'], dtype=object)

slide-34
SLIDE 34

In [74]: domains.tail(1000)[tail_sentences > 10000].values Out[74]: array(['studentdebtreductioncenter', 'inspiredholisticwellness', 'forensicaccountingexpert', 'medicalintuitivetraining', 'lavidamassagesandyspringsga', 'thirdgenerationshootingsupply', 'commercialrefrigerationrepairmiami', 'athenatrainingacademy', 'business-leadership-qualities', 'casaquetzalsanmigueldeallende', 'landscapedesignimagingsoftware', 'southcaliforniauniversity', 'replacementtractorpartsforsale', 'reinventinghealthcareinfo', 'shoppingforpowerinvertersnow', 'cambriaheightschristianacademy', 'californiaconstructionjobs', 'margaritavilleislandhotel', 'whatstoressellgarciniacambogia'], dtype=object) In [75]: [' '.join(sentence) for sentence in get_sentences('replacementtractorpartsforsale ')[:10]] Out[75]: ['replacement tractor parts forsale', 'replacement tractor parts forsa', 'replacement tractor parts forsa le', 'replacement tractor parts fors ale', 'replacement tractor parts fors al', 'replacement tractor parts fors le', 'replacement tractor parts for sale', 'replacement tractor parts for sal', 'replacement tractor parts for ale', 'replacement tractor parts for al']

slide-35
SLIDE 35

3.2 Decide how likely is the domain given a particular subset A first approach would be to say that the probability decreasses as each letter in the domain is

  • mmited from the sentence. We could model this in an unnormalised way by counting the

sentence length. To sort by this probability, we can therefore use the following:

P(d|s)

In [77]: def score_d_given_s(sentence, domain): domain_length = len(domain) sentence_length = sum(len(word) for word in sentence) return sentence_length / domain_length, 1.0 / (1 + len(sentence))

slide-36
SLIDE 36

In [78]: domain = 'powerwasherchicago' sentences = get_sentences(domain) sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1][:15] Out[78]: [['power', 'washer', 'chicago'], ['pow', 'erw', 'asher', 'chicago'], ['powe', 'rwa', 'sher', 'chicago'], ['powe', 'rwas', 'her', 'chicago'], ['powe', 'rwas', 'herc', 'hicago'], ['power', 'was', 'her', 'chicago'], ['power', 'was', 'herc', 'hicago'], ['power', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'washe', 'rch', 'icago'], ['power', 'washer', 'chi', 'cago'], ['power', 'washer', 'chic', 'ago'], ['power', 'washer', 'chica', 'go'], ['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'as', 'herc', 'hicago']]

slide-37
SLIDE 37

In [79]: domain = 'catholiccommentaryonsacredscripture' sentences = get_sentences(domain) sorted(sentences, key=lambda s:score_d_given_s(s, domain))[:-15:-1] Out[79]: [['catholic', 'commenta', 'ryon', 'sacred', 'scripture'], ['catholic', 'commenta', 'ryons', 'acred', 'scripture'], ['catholic', 'commentar', 'yon', 'sacred', 'scripture'], ['catholic', 'commentar', 'yons', 'acred', 'scripture'], ['catholic', 'commentary', 'on', 'sacred', 'scripture'], ['catholic', 'commentary', 'ons', 'acred', 'scripture'], ['cat', 'holic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cat', 'holic', 'commenta', 'ryons', 'acred', 'scripture'], ['cat', 'holic', 'commentar', 'yon', 'sacred', 'scripture'], ['cat', 'holic', 'commentar', 'yons', 'acred', 'scripture'], ['cat', 'holic', 'commentary', 'on', 'sacred', 'scripture'], ['cat', 'holic', 'commentary', 'ons', 'acred', 'scripture'], ['cath', 'olic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cath', 'olic', 'commenta', 'ryons', 'acred', 'scripture']]

slide-38
SLIDE 38

Let's see the top guesses for a selection of domains:

slide-39
SLIDE 39

In [105]: import re def flesh_out_sentence(sentence, domain): if sum(len(w) for w in sentence) == len(domain): return sentence full_sentence = [] for word in sentence: start, end = re.search(re.escape(word), domain).span() if start > 0: full_sentence.append(domain[:start]) full_sentence.append(word) domain = domain[end:] if len(domain) > 0: full_sentence.append(domain) return full_sentence

slide-40
SLIDE 40

In [ ]: def guess(d, n_guesses=25): guesses = [] sentences = get_sentences(d) sentences = sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1] i = 0 for i, s in enumerate(sentences[:n_guesses]): s = flesh_out_sentence(s, d) guesses.append(' '.join(s)) for _ in range(i + 1, n_guesses): guesses.append('') return pandas.Series(guesses)

slide-41
SLIDE 41

In [238]: subset = domains.iloc[len(domains)//200::len(domains)//100] df = pandas.DataFrame(subset.apply(guess).values, index=(subset+'.com').values) # df.to_csv(os.path.join(data_directory, 'predictions.csv')) df = df.iloc[:10, :3] df['correct'] = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0] # Correct guess for first 10 domains

  • r -1

df[['correct'] + list(range(3))] Out[238]:

correct 0 1 2 hedgefundupdate.com hedge fund update hedge fundu pdate he dge fund update traveldailynews.com 3 travel dailynews tra vel dailynews trav el dailynews miriamkhalladi.com

  • 1

miria mkh alladi miriam khal ladi mir iam khal ladi poolheatpumpstore.com pool heat pump store pool heat pumps tore poo lhe at pump store blogorganization.com blog

  • rganization

blo go rganization blo gor ganization smallcapvoice.com 2 smallcap voice smal lcap voice small cap voice cefcorp.com cef corp c efc orp cef c orp lightandmotionphotography.com 3 light andmotion photography lightand motion photography ligh tand motion photography uggbootrepairs.com ugg boot repairs ugg boo tre pairs ugg boo trep airs abundancesecrets.com abundance secrets abun dance secrets abund ance secrets

slide-42
SLIDE 42

In [239]: %matplotlib inline import matplotlib.pyplot as plt import seaborn correct = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0, 4, 1, 0, 4, 0, 0, -1, 0, 0, -1, 1, 8, 0, 0, 0, 0, 8, 0, -1, -1, 0, -1, 0, 3, 16, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, -1, 0, 2, 4, 13, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, -1]

slide-43
SLIDE 43

In [240]: pandas.Series(correct).hist(bins=range(-5, 25), normed=True, figsize=(12, 5)) plt.xlabel('correct guess no. or -1 if incorrect');

slide-44
SLIDE 44

In a test of 100 samples, the first guess was correct 65 times and one of the first 25 were correct 87 times.

slide-45
SLIDE 45

Is this good enough?

Primary use case: given 500 domains in a market, what are the themes? Expect ~325 domains in theme clusters and ~175 distributed randomly. This will probably still require human sanity checks.

slide-46
SLIDE 46

What can be done?

So far, we only consider the likelyhood of a domain given a sentence. But how likely is the sentence? The next hack day is to develop a model for sentence likelyhood .

P(s)

slide-47
SLIDE 47

Determine the best sentence From Bayes: Since is the same for all sentences, this can be ignored when finding the argmax:

P(s|d) argmaxs P(s|d) = P(d|s)P(s) P(d) P(d) P(s|d) = P(d|s)P(s) argmaxs argmaxs

slide-48
SLIDE 48

What was done

Trained dictionary using google ngram viewer data Found word substrings in domain Built sentences from words with applied crude cuts Ordered predictions based on crude score function Measured performance on 100 labelled domains

slide-49
SLIDE 49

What I used

Inspiration: Peter Norvig's Libraries: pandas, numpy, re sklearn.feature_extraction.text.CountVectorizer Functions: spell-correct (http://norvig.com/spell-correct.html)

build_dictionary(corpus, min_df=0) find_words_in_string(string, dictionary, longest_word=None) find_sentences(string, words, part_sentence, sentences, threshold=0.0) get_sentences(domain, thresh=0.95) score_d_given_s(sentence, domain) guess(d, n_guesses=25)

slide-50
SLIDE 50

After training, it can be used like this:

In [211]: guess('powerwasherchicago')[0] Out[211]: 'power washer chicago'

slide-51
SLIDE 51

What still needs to be done for performance

Performance needs to be tested against a larger labelled dataset including robust train-develop-test splits. Sentences need to be compared based on the likelyhood of that sentence construction, i.e. Additional words need to be incorporated into the dictionary Threshold hyper-parameters need tuning

P(s)

slide-52
SLIDE 52

...and to make it usable

Replace custom code with library functions where possible Extend remaining code to support array and dataframe inputs Make compatible with sklearn pipeline Improve .com, .co.uk etc. handling so it can be used on a wider set of domains Optimise substring search

slide-53
SLIDE 53

Think you can do better?

Get in touch: calvin.giles@gmail.com @calvingiles

slide-54
SLIDE 54

In [122]: import math def num_fmt(num): i_offset = 12 # change this if you extend the symbols!!! prec = 3 fmt = '.{p}g'.format(p=prec) symbols = [#'Y', 'Z', 'E', 'P', 'T', 'G', 'M', 'k', '', 'm', 'u', 'n'] try: e = math.log10(abs(num)) except ValueError: return repr(num) if e >= i_offset + 3: return '{:{fmt}}'.format(num, fmt=fmt) for i, sym in enumerate(symbols): e_thresh = i_offset - 3 * i if e >= e_thresh: return '{:{fmt}}{sym}'.format(num/10.**e_thresh, fmt=fmt, sym=sym) return '{:{fmt}}'.format(num, fmt=fmt)