Learning to read urls Finding the word boundaries in multi-word - - PowerPoint PPT Presentation
Learning to read urls Finding the word boundaries in multi-word - - PowerPoint PPT Presentation
Learning to read urls Finding the word boundaries in multi-word domain names with python and sklearn. Calvin Giles Who am I? Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically The Problem Given a
Who am I?
Data Scientist at Adthena PyData Co-Organiser Physicist Like to solve problems pragmatically
The Problem
Given a domain name:
'powerwasherchicago.com' 'catholiccommentaryonsacredscripture.com'
Find the concatenated sentence:
'power washer chicago (.com)' 'catholic commentary on sacred scripture (.com)'
Why is this useful?
How similar are 'powerwasherchicago.com' and 'extreme-tyres.co.uk'? How similar are 'power washer chicago (.com)' and 'extreme tyres (.co.uk)'? Domains resolved into words can be compared on a semantic level, not simply as strings.
Primary use case
Given 500 domains in a market, what are the themes?
Scope of project
As part of our internal idea incubation Adthena labs, this approach was developed during a one- day hack to determine if such an approach could be useful to the business.
Adthena's Data
> 10 million unique domains > 50 million unique search terms
3rd Party Data
Project Gutenberg (https://www.gutenberg.org/) Google ngram viewer datasets (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)
Process
- 1. Learn some words
- 2. Find where words occur in a domain name
- 3. Choose the most likely set of words
- 1. Learn some words
Build a dictionary using suitable documents. Documents: search terms
In [2]: import pandas, os search_terms = pandas.read_csv(os.path.join(data_directory, 'search_terms.csv')) search_terms = search_terms['SearchTerm'].dropna().str.lower() search_terms.iloc[1000000::2000000] Out[2]: 1000000 new 2014 mercedes benz b200 cdi 3000000 weight watchers in glynneath 5000000 property for rent in batlow nsw 7000000 us plug adaptor for uk 9000000 which features mobile is best for purchase Name: SearchTerm, dtype: object In [125]: from sklearn.feature_extraction.text import CountVectorizer def build_dictionary(corpus, min_df=0): vec = CountVectorizer(min_df=min_df, token_pattern=r'(?u)\b\w{2,}\b') # Require 2+ characters vec.fit(corpus) return set(vec.get_feature_names())
In [126]: st_dictionary = build_dictionary(corpus=search_terms, min_df=0.00001) dictionary_size = len(st_dictionary) print('{} words found'.format(num_fmt(dictionary_size))) sorted(st_dictionary)[dictionary_size//20::dictionary_size//10] Out[126]: 21.4k words found ['430', 'benson', 'colo', 'es1', 'hd7', 'leed', 'nikon', 'razors', 'springs', 'vinyl']
We have 21 thousand words in our base dictionary. We can augment this with some books from project gutenberg:
In [127]: dictionary = st_dictionary for fname in os.listdir(os.path.join(data_directory, 'project_gutenberg')): if not fname.endswith('.txt'): continue with open(os.path.join(data_directory, 'project_gutenberg', fname)) as f: book = pandas.Series(f.readlines()) book = book.str.strip() book = book[book != ''] book_dictionary = build_dictionary(corpus=book, min_df=2) # keep words that appear i n 0.001% of documents dictionary_size = len(book_dictionary) print('{} words found in {}'.format(num_fmt(dictionary_size), fname)) dictionary |= book_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.11k words found in a_christmas_carol.txt 1.65k words found in alice_in_wonderland.txt 3.71k words found in huckleberry_finn.txt 4.09k words found in pride_and_predudice.txt 4.52k words found in sherlock_holmes.txt 26.4k words in dictionary
Actually, scrap that...
... and use the google ngram viewer datasets:
In [212]: dictionary = set() ngram_files = [fn for fn in os.listdir(ngram_data_directory) if 'googlebooks' in fn and fn.endswith('_processed.csv')] for fname in ngram_files: ngrams = pandas.read_csv(os.path.join(ngram_data_directory, fname)) ngrams = ngrams[(ngrams.match_count > 10*1000*1000) & (ngrams.ngram.str.len() == 2 ) | (ngrams.match_count > 1000) & (ngrams.ngram.str.len() > 2) ] ngrams = ngrams.ngram ngrams = ngrams.str.lower() ngrams = ngrams[ngrams != ''] ngrams_dictionary = set(ngrams) dictionary_size = len(ngrams_dictionary) print('{} valid words found in "{}"'.format(num_fmt(dictionary_size), fname)) dictionary |= ngrams_dictionary print('{} words in dictionary'.format(num_fmt(len(dictionary)))) 2.93k valid words found in "googlebooks-eng-all-1gram-20120701-0_processed.csv" 12.7k valid words found in "googlebooks-eng-all-1gram-20120701-1_processed.csv" 5.58k valid words found in "googlebooks-eng-all-1gram-20120701-2_processed.csv" 4.09k valid words found in "googlebooks-eng-all-1gram-20120701-3_processed.csv" 3.28k valid words found in "googlebooks-eng-all-1gram-20120701-4_processed.csv" 2.72k valid words found in "googlebooks-eng-all-1gram-20120701-5_processed.csv" 2.52k valid words found in "googlebooks-eng-all-1gram-20120701-6_processed.csv" 2.18k valid words found in "googlebooks-eng-all-1gram-20120701-7_processed.csv" 2.08k valid words found in "googlebooks-eng-all-1gram-20120701-8_processed.csv" 2.5k valid words found in "googlebooks-eng-all-1gram-20120701-9_processed.csv" 61.6k valid words found in "googlebooks-eng-all-1gram-20120701-a_processed.csv" 55.2k valid words found in "googlebooks-eng-all-1gram-20120701-b_processed.csv" 72k valid words found in "googlebooks-eng-all-1gram-20120701-c_processed.csv" 46.1k valid words found in "googlebooks-eng-all-1gram-20120701-d_processed.csv" 36.2k valid words found in "googlebooks-eng-all-1gram-20120701-e_processed.csv" 32.4k valid words found in "googlebooks-eng-all-1gram-20120701-f_processed.csv" 36k valid words found in "googlebooks-eng-all-1gram-20120701-g_processed.csv" 37.9k valid words found in "googlebooks-eng-all-1gram-20120701-h_processed.csv" 30.3k valid words found in "googlebooks-eng-all-1gram-20120701-i_processed.csv" 12.3k valid words found in "googlebooks-eng-all-1gram-20120701-j_processed.csv" 31.4k valid words found in "googlebooks-eng-all-1gram-20120701-k_processed.csv" 36.7k valid words found in "googlebooks-eng-all-1gram-20120701-l_processed.csv" 63.6k valid words found in "googlebooks-eng-all-1gram-20120701-m_processed.csv"
That takes us to ~1M words! We even get some good two-letter words to work with:
In [130]: print('{} 2-letter words'.format(len({w for w in dictionary if len(w) == 2}))) print(sorted({w for w in dictionary if len(w) == 2})) 142 2-letter words ['00', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', 'ad', 'al', 'am', 'an', 'as', 'at', 'be', 'by', 'cm', 'co', 'de', 'di', 'do', 'du', 'ed', 'el', 'en', 'et', 'ex', 'go', 'he', 'if', 'ii', 'in', 'is', 'it', 'iv', 'la', 'le', 'me', 'mg', 'mm', 'mr', 'my', 'no', 'of', 'oh', 'on', 'op', 'or', 're', 'se', 'so', 'st', 'to', 'un', 'up', 'us', 'vi', 'we', 'ye']
In [144]: choice(list(dictionary), size=40) Out[144]: array(['fades', 'archaeocyatha', 'subss', 'bikanir', 'fitn', 'cockley', 'chinard', 'curtus', 'quantitiative', 'obfervation', 'poplin', 'xciv', 'hanrieder', 'macaura', 'nakum', 'teuira', 'humphrey', 'improvisationally', 'enforeed', 'caillie', 'plachter', 'feirer', 'atomico', 'jven', 'ujvari', 'rekonstruieren', 'viverra', 'genéticos', 'layn', 'dryl', 'thonis', 'legítimos', 'latts', 'radames', 'bwlch', 'lanzamiento', 'quea', 'dumnoniorum', 'matu', 'conoció'], dtype='<U81')
- 2. Find where words occur in a domain name
Find all substrings of a domain that are in our dictionary, along with their start and end indicies.
In [149]: def find_words_in_string(string, dictionary, longest_word=None): if longest_word is None: longest_word = max(len(word) for word in dictionary) substring_indicies = ((start, start + length) for start in range(len(string)) for length in range(1, longest_word + 1)) for start, end in substring_indicies: substring = string[start:end] if substring in dictionary: # use len(substring) in case we sliced beyond the end yield substring, start, start + len(substring)
In [234]: domain = 'powerwasherchicago' words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)}) print(len(words)) print(words) 39 ['ago', 'as', 'ash', 'ashe', 'asher', 'cag', 'cago', 'chi', 'chic', 'chica', 'chicag', 'chicago', 'erc', 'erch', 'erw', 'go', 'he', 'her', 'herc', 'hic', 'hicago', 'ica', 'ic ago', 'owe', 'ower', 'pow', 'powe', 'power', 'rch', 'rwa', 'rwas', 'she', 'sher', 'was' , 'wash', 'washe', 'washer', 'we', 'wer']
In [235]: domain = 'catholiccommentaryonsacredscripture' words = sorted({w for w, *_ in find_words_in_string(domain, dictionary)}) print(len(words)) print(words) 101 ['acr', 'acre', 'acred', 'ary', 'aryo', 'at', 'ath', 'atho', 'athol', 'atholic', 'cat', 'cath', 'catho', 'cathol', 'catholi', 'catholic', 'cco', 'ccom', 'co', 'com', 'comm', 'comme', 'commen', 'comment', 'commenta', 'commentar', 'commentary', 'cre', 'cred', 'cr eds', 'cri', 'crip', 'cript', 'dsc', 'dscr', 'ed', 'eds', 'en', 'ent', 'enta', 'entar', 'entary', 'hol', 'holi', 'holic', 'icc', 'icco', 'ipt', 'lic', 'me', 'men', 'ment', 'm enta', 'mentar', 'mentary', 'mm', 'mme', 'mment', 'nsa', 'nsac', 'nta', 'ntar', 'ntary' , 'oli', 'olic', 'omm', 'omme', 'ommen', 'omment', 'on', 'ons', 'ptu', 'pture', 're', ' red', 'reds', 'rip', 'ript', 'ryo', 'ryon', 'ryons', 'sac', 'sacr', 'sacre', 'sacred', 'scr', 'scri', 'scrip', 'script', 'scriptur', 'scripture', 'tar', 'tary', 'tho', 'thol' , 'tholic', 'tur', 'ture', 'ure', 'yon', 'yons']
- 3. Choose the most likely set of words
Simple approach to do this:
- 1. Find all subsets of the set of words found
- 2. Determine if that subset if non-overlapping
- 3. Decide how likely is the domain given a particular subset
- 4. Decide how likely it is that the subset would occur overall
- 5. Determine best subset
P(d|s) P(s) P(s|d) argmaxs
We need some domain name data for the next part...
In [153]: domains = pandas.read_csv(os.path.join(data_directory, 'domains.csv')) domains = domains['Domain'].str.lower() domains = domains[domains.str.endswith(".com")] domains = domains.str.replace("\.com$", "") domains = domains.str.replace("^https?\:\/\/", "") domains = domains.str.replace("^www\d?\.", "") num_fmt(len(domains)) Out[153]: '3.8M'
In [224]: choice(domains, size=20) Out[224]: array(['1topchannel', 'scales-chords', 'marcusmajestic', 'mylyfestart', 'bluediamondturlock', 'bedfordvisionclinic', 'justinmccain', 'miniot-online', 'chelseabarracksbarracks', 'zeroeasy', 'newlookupholstery', 'radcliffehealth', 'embracingthemundane', 'immunityassist', 'simplynostretchmarks', 'teachmetoswim', 'thetford-europe', 'charlesallenford', 'china-chargermanufacturer', 'coolbabykid'], dtype=object)
- 1. Find all subsets of the set of words found
There are different sentences that can be constructed from n substrings, including the empty
- sentence. We can get an idea how bad that will be with a sample of the data.
2n
In [53]: longest_word = max(len(word) for word in dictionary) # speeds up search def find_n_words_in_string(domain): return len(set(find_words_in_string(domain, dictionary, longest_word))) In [56]: import numpy n_words = domains.tail(1000).apply(find_n_words_in_string) (n_words).describe().apply(num_fmt) Out[56]: count 1k mean 28.3 std 15.8 min 1 25% 17 50% 26 75% 38 max 93 Name: Domain, dtype: object In [227]: num_fmt(2**28), 2**93 Out[227]: ('268M', 9903520314283042199192993792)
So the worst case in a sample of 1000 domains is permutations to test!
293
Combine steps 1 and 2
- 1. Find all subsets of the set of words found
- 2. Determine if that subset if non-overlapping
becomes:
- 1. Find all subsets with non-overlapping words
- 2. Do nothing :-)
3.1 Find all subsets with non-overlapping words Build a tree of subsets of non-overlapping words by sorting the words by their start index. ...and only return the "best" few cases anyway It seems intuitive that sentences that match more of the domain are better. This is not infalable, but we can achieve som significant if we only consider sentences at least half as long as the best match. In practice, this does not appear to have any impact on the results but prevents an explosion of sentences with particularly long domains.
A little more code...
In [147]: def find_sentences(string, words, part_sentence, sentences, threshold=0.0, current_idx=0, current_score=0, best_score=0): """ Return sentences made of words that are common substrings of `string`. `words` MUST be ordered by start index or the results will be wrong! """ current_threshold = int(best_score * threshold) if ((current_idx >= len(string))
- r current_score + len(string) - current_idx < current_threshold):
return sentences, best_score for i, (word, start_idx, end_idx) in enumerate(words): if current_idx > start_idx: continue new_score = current_score + len(word) best_score = max(best_score, new_score) new_part_sentence = part_sentence + [word] if new_score + len(string) - end_idx >= current_threshold: sentences.append((new_part_sentence, new_score)) sentences, best_score = find_sentences(string=string, words=words[i+1:], part_sentence=new_part_sentence, sentences=sentences, threshold=threshold, current_idx=end_idx, current_score=new_score, best_score=best_score) return sentences, best_score
Add a wrapper
In [148]: def get_sentences(domain, thresh=0.95): words = set(find_words_in_string(domain, dictionary, longest_word)) words = sorted(words, key=lambda x:(x[1], -x[2], x[0])) sentences, best_score = find_sentences(domain, words, [], [], thresh) return [sentence for sentence, score in sentences if score >= int(best_score * thresh)]
In [64]: sentences = get_sentences('powerwasherchicago') print(len(sentences)) choice(sentences, size=15) Out[64]: 245 array([['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'ashe', 'chica', 'go'], ['power', 'was', 'her', 'chica', 'go'], ['power', 'was', 'he', 'rch', 'cago'], ['power', 'was', 'her', 'chicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'ash', 'erc', 'hicago'], ['ower', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'was', 'her', 'chi', 'cago'], ['power', 'was', 'her', 'chic', 'ago'], ['power', 'as', 'he', 'rch', 'ica', 'go'], ['ower', 'washer', 'chicago'], ['owe', 'rwas', 'he', 'rch', 'ica', 'go'], ['power', 'washer', 'chic', 'go']], dtype=object)
In [65]: sentences = get_sentences('catholiccommentaryonsacredscripture') print(len(sentences)) choice(sentences, size=15) Out[65]: 540428 array([['cat', 'holi', 'ccom', 'me', 'nta', 'ryon', 'sacr', 'ed', 'scrip', 're'], ['catholic', 'co', 'mm', 'en', 'aryo', 'nsac', 'ed', 'scri', 'pture'], ['catholic', 'omm', 'enta', 'ryon', 'sacr', 'eds', 'crip', 'tur'], ['cathol', 'icc', 'ommen', 'tar', 'on', 'sacr', 'ed', 'script', 'ure'], ['at', 'holic', 'omme', 'ntary', 'ons', 'acred', 'scri', 'pture'], ['cathol', 'icc', 'omm', 'ntar', 'yons', 'creds', 'crip', 'ture'], ['cat', 'hol', 'icc', 'omm', 'entary', 'ons', 'acr', 'eds', 'cri', 'ptu', 're'] , ['cath', 'lic', 'com', 'me', 'ntar', 'yon', 'sac', 're', 'dsc', 'ript', 'ure'], ['cathol', 'icco', 'mm', 'ntary', 'on', 'sac', 're', 'dsc', 'rip', 'ture'], ['catholic', 'co', 'mm', 'enta', 'ryon', 'sac', 're', 'dsc', 'rip', 'tur'], ['cat', 'holic', 'com', 'me', 'ntar', 'yon', 'sac', 'reds', 'cript', 're'], ['cat', 'holic', 'com', 'menta', 'ryon', 'acr', 'ed', 'cript', 'ure'], ['cat', 'oli', 'ccom', 'mentary', 'nsac', 'red', 'scri', 'pture'], ['cathol', 'icc', 'ommen', 'tary', 'on', 'sacr', 'ed', 'cri', 'ture'], ['cat', 'hol', 'ccom', 'me', 'ntar', 'on', 'sac', 'red', 'scripture']], dtype=o bject)
In [71]: tail_sentences = domains.tail(1000).apply(get_sentences).apply(len) In [155]: tail_sentences.describe().apply(int).apply(num_fmt) Out[155]: count 1k mean 1.18k std 10.7k min 1 25% 12 50% 39 75% 145 max 280k Name: Domain, dtype: object
In [73]: domains.tail(1000)[tail_sentences <= 1].values Out[73]: array(['cizerl', 'sahoko', 'pes-llc', 'mp3fil', 'wyzli', 'buypsa', 'ylqhjt', 'sblgnt', 'axbet', 'eirnyc', 'wsl', 'kms88', 'paknic', 'mrojp', 'irozho', 'bienve'], dtype=object)
In [74]: domains.tail(1000)[tail_sentences > 10000].values Out[74]: array(['studentdebtreductioncenter', 'inspiredholisticwellness', 'forensicaccountingexpert', 'medicalintuitivetraining', 'lavidamassagesandyspringsga', 'thirdgenerationshootingsupply', 'commercialrefrigerationrepairmiami', 'athenatrainingacademy', 'business-leadership-qualities', 'casaquetzalsanmigueldeallende', 'landscapedesignimagingsoftware', 'southcaliforniauniversity', 'replacementtractorpartsforsale', 'reinventinghealthcareinfo', 'shoppingforpowerinvertersnow', 'cambriaheightschristianacademy', 'californiaconstructionjobs', 'margaritavilleislandhotel', 'whatstoressellgarciniacambogia'], dtype=object) In [75]: [' '.join(sentence) for sentence in get_sentences('replacementtractorpartsforsale ')[:10]] Out[75]: ['replacement tractor parts forsale', 'replacement tractor parts forsa', 'replacement tractor parts forsa le', 'replacement tractor parts fors ale', 'replacement tractor parts fors al', 'replacement tractor parts fors le', 'replacement tractor parts for sale', 'replacement tractor parts for sal', 'replacement tractor parts for ale', 'replacement tractor parts for al']
3.2 Decide how likely is the domain given a particular subset A first approach would be to say that the probability decreasses as each letter in the domain is
- mmited from the sentence. We could model this in an unnormalised way by counting the
sentence length. To sort by this probability, we can therefore use the following:
P(d|s)
In [77]: def score_d_given_s(sentence, domain): domain_length = len(domain) sentence_length = sum(len(word) for word in sentence) return sentence_length / domain_length, 1.0 / (1 + len(sentence))
In [78]: domain = 'powerwasherchicago' sentences = get_sentences(domain) sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1][:15] Out[78]: [['power', 'washer', 'chicago'], ['pow', 'erw', 'asher', 'chicago'], ['powe', 'rwa', 'sher', 'chicago'], ['powe', 'rwas', 'her', 'chicago'], ['powe', 'rwas', 'herc', 'hicago'], ['power', 'was', 'her', 'chicago'], ['power', 'was', 'herc', 'hicago'], ['power', 'wash', 'erc', 'hicago'], ['power', 'wash', 'erch', 'icago'], ['power', 'washe', 'rch', 'icago'], ['power', 'washer', 'chi', 'cago'], ['power', 'washer', 'chic', 'ago'], ['power', 'washer', 'chica', 'go'], ['pow', 'erw', 'as', 'her', 'chicago'], ['pow', 'erw', 'as', 'herc', 'hicago']]
In [79]: domain = 'catholiccommentaryonsacredscripture' sentences = get_sentences(domain) sorted(sentences, key=lambda s:score_d_given_s(s, domain))[:-15:-1] Out[79]: [['catholic', 'commenta', 'ryon', 'sacred', 'scripture'], ['catholic', 'commenta', 'ryons', 'acred', 'scripture'], ['catholic', 'commentar', 'yon', 'sacred', 'scripture'], ['catholic', 'commentar', 'yons', 'acred', 'scripture'], ['catholic', 'commentary', 'on', 'sacred', 'scripture'], ['catholic', 'commentary', 'ons', 'acred', 'scripture'], ['cat', 'holic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cat', 'holic', 'commenta', 'ryons', 'acred', 'scripture'], ['cat', 'holic', 'commentar', 'yon', 'sacred', 'scripture'], ['cat', 'holic', 'commentar', 'yons', 'acred', 'scripture'], ['cat', 'holic', 'commentary', 'on', 'sacred', 'scripture'], ['cat', 'holic', 'commentary', 'ons', 'acred', 'scripture'], ['cath', 'olic', 'commenta', 'ryon', 'sacred', 'scripture'], ['cath', 'olic', 'commenta', 'ryons', 'acred', 'scripture']]
Let's see the top guesses for a selection of domains:
In [105]: import re def flesh_out_sentence(sentence, domain): if sum(len(w) for w in sentence) == len(domain): return sentence full_sentence = [] for word in sentence: start, end = re.search(re.escape(word), domain).span() if start > 0: full_sentence.append(domain[:start]) full_sentence.append(word) domain = domain[end:] if len(domain) > 0: full_sentence.append(domain) return full_sentence
In [ ]: def guess(d, n_guesses=25): guesses = [] sentences = get_sentences(d) sentences = sorted(sentences, key=lambda s:score_d_given_s(s, domain))[::-1] i = 0 for i, s in enumerate(sentences[:n_guesses]): s = flesh_out_sentence(s, d) guesses.append(' '.join(s)) for _ in range(i + 1, n_guesses): guesses.append('') return pandas.Series(guesses)
In [238]: subset = domains.iloc[len(domains)//200::len(domains)//100] df = pandas.DataFrame(subset.apply(guess).values, index=(subset+'.com').values) # df.to_csv(os.path.join(data_directory, 'predictions.csv')) df = df.iloc[:10, :3] df['correct'] = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0] # Correct guess for first 10 domains
- r -1
df[['correct'] + list(range(3))] Out[238]:
correct 0 1 2 hedgefundupdate.com hedge fund update hedge fundu pdate he dge fund update traveldailynews.com 3 travel dailynews tra vel dailynews trav el dailynews miriamkhalladi.com
- 1
miria mkh alladi miriam khal ladi mir iam khal ladi poolheatpumpstore.com pool heat pump store pool heat pumps tore poo lhe at pump store blogorganization.com blog
- rganization
blo go rganization blo gor ganization smallcapvoice.com 2 smallcap voice smal lcap voice small cap voice cefcorp.com cef corp c efc orp cef c orp lightandmotionphotography.com 3 light andmotion photography lightand motion photography ligh tand motion photography uggbootrepairs.com ugg boot repairs ugg boo tre pairs ugg boo trep airs abundancesecrets.com abundance secrets abun dance secrets abund ance secrets
In [239]: %matplotlib inline import matplotlib.pyplot as plt import seaborn correct = [0, 3, -1, 0, 0, 2, 0, 3, 0, 0, 4, 1, 0, 4, 0, 0, -1, 0, 0, -1, 1, 8, 0, 0, 0, 0, 8, 0, -1, -1, 0, -1, 0, 3, 16, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, -1, -1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, -1, 0, 0, -1, 0, 2, 4, 13, 0, -1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, -1]
In [240]: pandas.Series(correct).hist(bins=range(-5, 25), normed=True, figsize=(12, 5)) plt.xlabel('correct guess no. or -1 if incorrect');
In a test of 100 samples, the first guess was correct 65 times and one of the first 25 were correct 87 times.
Is this good enough?
Primary use case: given 500 domains in a market, what are the themes? Expect ~325 domains in theme clusters and ~175 distributed randomly. This will probably still require human sanity checks.
What can be done?
So far, we only consider the likelyhood of a domain given a sentence. But how likely is the sentence? The next hack day is to develop a model for sentence likelyhood .
P(s)
Determine the best sentence From Bayes: Since is the same for all sentences, this can be ignored when finding the argmax:
P(s|d) argmaxs P(s|d) = P(d|s)P(s) P(d) P(d) P(s|d) = P(d|s)P(s) argmaxs argmaxs
What was done
Trained dictionary using google ngram viewer data Found word substrings in domain Built sentences from words with applied crude cuts Ordered predictions based on crude score function Measured performance on 100 labelled domains
What I used
Inspiration: Peter Norvig's Libraries: pandas, numpy, re sklearn.feature_extraction.text.CountVectorizer Functions: spell-correct (http://norvig.com/spell-correct.html)
build_dictionary(corpus, min_df=0) find_words_in_string(string, dictionary, longest_word=None) find_sentences(string, words, part_sentence, sentences, threshold=0.0) get_sentences(domain, thresh=0.95) score_d_given_s(sentence, domain) guess(d, n_guesses=25)
After training, it can be used like this:
In [211]: guess('powerwasherchicago')[0] Out[211]: 'power washer chicago'
What still needs to be done for performance
Performance needs to be tested against a larger labelled dataset including robust train-develop-test splits. Sentences need to be compared based on the likelyhood of that sentence construction, i.e. Additional words need to be incorporated into the dictionary Threshold hyper-parameters need tuning
P(s)
...and to make it usable
Replace custom code with library functions where possible Extend remaining code to support array and dataframe inputs Make compatible with sklearn pipeline Improve .com, .co.uk etc. handling so it can be used on a wider set of domains Optimise substring search
Think you can do better?
Get in touch: calvin.giles@gmail.com @calvingiles
In [122]: import math def num_fmt(num): i_offset = 12 # change this if you extend the symbols!!! prec = 3 fmt = '.{p}g'.format(p=prec) symbols = [#'Y', 'Z', 'E', 'P', 'T', 'G', 'M', 'k', '', 'm', 'u', 'n'] try: e = math.log10(abs(num)) except ValueError: return repr(num) if e >= i_offset + 3: return '{:{fmt}}'.format(num, fmt=fmt) for i, sym in enumerate(symbols): e_thresh = i_offset - 3 * i if e >= e_thresh: return '{:{fmt}}{sym}'.format(num/10.**e_thresh, fmt=fmt, sym=sym) return '{:{fmt}}'.format(num, fmt=fmt)