DATA ANALYSIS WITH
VECTOR FUNCTIONAL PROGRAMMING
A tour of the Q programming language
VECTOR FUNCTIONAL PROGRAMMING A tour of the Q programming language - - PowerPoint PPT Presentation
DATA ANALYSIS WITH VECTOR FUNCTIONAL PROGRAMMING A tour of the Q programming language HISTORY OF VECTOR LANGUAGES Vectors (arrays), not scalars, are the principle data type Not a new idea ( APL, 1965 ) Ok maybe new compared to
A tour of the Q programming language
HISTORY OF VECTOR LANGUAGES
➤ Vectors (arrays), not scalars, are the principle data type ➤ Not a new idea (APL, 1965) ➤ Ok… maybe new compared to functional programming (λ-
calculus, 1930s)
➤ Ken Iverson’s Iverson Notation ➤ Notation as a tool of thought ➤ Notation for people first, computers later ➤ Influenced: Mathematica, Matlab, R, Julia ➤ Descendents: I.N. → APL, J, A+, K, Q
The basic concepts
FUNCTION APPLICATION
➤ Monadic functions have a word name and take argument to the right
abs -1 1 1 + 2 3 til 10 0 1 2 3 4 5 6 7 8 9 abs @ -1 1 (-) . 1 2
9 mod 3
➤ Dyadic verbs appear between the arguments ➤ Function application is a verb
ATOMIC FUNCTIONS
➤ Primitive functions (and verbs) are atomic (apply to atoms) ➤ Evaluation is always right-to-left ➤ Typically read top-down (left-to-right)
5 * 10 + til 5 50 55 60 65 70
0 -1 -2 -3 -4 5 * (1; 2 3; (4; 5 6); 7 8; 9) (5; 10 15; (20; 25 30); 35 40; 45)
LIST VERBS
➤ List primitives (we have them too, just use less characters):
2#til 10 0 1
8 9 (til 4) , til 4 0 1 2 3 0 1 2 3 0 3 6 _ til 9 0 1 2 3 4 5 6 7 8 take (#) join (,) split (_)
MAPPING A LIST - FP 101
count each 0 3 6 _ til 9 3 3 3 0 3 6 _ til 9 0 1 2 3 4 5 6 7 8 3#0 0 0 0 3 3 3#'0 1 2 0 0 0 1 1 1 2 2 2
➤ If dyadic, combine with an adverb (a pairing operator) ➤ eg, each-both (‘) take (#) + each-both (‘) = take-each-both (#’)
→ →
But Wait! There’s More!
ADVERBS
noun verbadverb noun
3 3 3 #' 0 1 2
FOLD AND SCAN ARE ADVERBS … MORE FP 101
➤ Fold (/) is an adverb, we call it over
0 +/ til 5 10 0 +\ til 5 0 1 3 6 10
➤ Scan (\) returns the incremental values of over (left-to-right)
A plus reduction over 0 1 2 3 4 Partial sums of 0 1 2 3 4
FLEXIBLE MAPPING WITH ADVERBS
➤ Only 6 adverbs, but they come up all the time
(floor;ceiling) @\: 5.5 5 6 max @/: 0 3 6 _ til 9 2 5 8 0 -': til 5 0 1 1 1 1 (min;max) @\:/: 0 3 6 _ til 9 0 2 3 5 6 8 each-left (\:) each-right (/:) each-prior (‘:) compose: each-left-each-right (\:/:)
Prime Numbers
THINKING IN ARRAYS - NO STINKING LOOPS*
function isPrime (n) { if (n < 2) return false; var q = Math.floor(Math.sqrt(n)); for (var i = 2; i <= q; i++) { if (n % i == 0) { return false; } } return true; }
*
Steve Apter nsl.com
THINKING IN ARRAYS
x mod y 1 .. 100
THINKING IN ARRAYS
x mod y = 0
THINKING IN ARRAYS
y = 1 y = x
THINKING IN ARRAYS
primes
THINKING IN ARRAYS
THE RESULT
➤ Extremely concise, 111 bytes ➤ 29 characters left for emojis when tweeting it!
p : {n where 2=sum 0=n mod/: n:1+til x} rle : {(count;first)@\:/:(where not =‘:[x])_x} expand : {(),/(#).’x}
rle : {(count;first)@\:/:(where not =‘:[x])_x} Only short programs have any hope of being correct
~ Arthur Whitney
HOW CAN WE USE Q FOR DATA ANALYSIS?
➤ Q has dictionaries (associations) and tables (flipped dictionaries) ➤ Tables are first-class and columnar, operations on columns are
fast and efficient
➤ It is actually the scripting language for kdb+ ➤ Has an integrated sql-like query language called q-sql
select avg price by sym from trades where date > .z.d - 5
➤ Has really nice temporal types, temporal arithmetic, and temporal
joins
STEP 1. GET SOME DATA
// System commands start with \ \wget .../pantheon.tsv \wget .../pageviews_2008-2013.tsv -O pageviews.tsv // ETL in Q people : ("iSiSSSSSffsissssiffiiff"; enlist "\t") 0: `:pantheon.tsv; pageviews : ("iSSiSisssss",72#"i"; enlist "\t") 0: `:pageviews.tsv;
Monthly page visit information for people on WikiPedia We have a short fat table, want a long skinny table… Each month is a single column File name Tab separated Column types
STEP 2. CLEAN THE DATA!
// All of the months months : "M"$ssr[;"-";"."] each string 11_cols pageviews; // Create a new table of the months flattened monthly : ungroup 2!([] id : pageviews`id; lang : pageviews`lang; month : (count pageviews)#enlist months; clicks : flip pageviews c:11_cols pageviews) // Left-Join click information with person information clickinfo : monthly lj `id`lang xkey people;
id name occupation lang
307 Abraham Lincoln POLITICIAN am 307 Abraham Lincoln POLITICIAN an 307 Abraham Lincoln POLITICIAN ang 307 Abraham Lincoln POLITICIAN ar 307 Abraham Lincoln POLITICIAN arz … id lang month clicks
307 af 2008.02 5 307 af 2008.03 0 307 af 2008.04 5 307 af 2008.05 5 307 af 2008.06 1 …
Month values Long skinny table 4 columns Left join
STEP 3. ASK SOME QUESTIONS
select from clickinfo where occupation like “COMPUTER SCIENTIST”
STEP 3. ASK SOME QUESTIONS
select from clickinfo where occupation like “COMPUTER SCIENTIST”
STEP 4…CLEAN THE DATA… AGAIN…
file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process each years; doodles : ungroup 1!flip `month`name!(key;value)@\:results;
month name
2010.01 Django Reinhard 2010.01 Anton Chekhov 2010.02 2010 Winter Olympics …
<p>On <b>Tuesday, July 6, 2010</b>, the birth
Kahlo">Frida Kahlo</a> was celebrated with a gold Google logo wrapped with vines, flowers, and a painting of herself in her painting styles.<sup id="cite_ref-18" class="reference"><a href="#cite_note-18">[18]</a></sup></p>
→
PARALLELIZATION IN Q
file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process each years; doodles : ungroup 1!flip `month`name!(key;value)@\:results;
PARALLELIZATION IN Q
file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process each years; doodles : ungroup 1!flip `month`name!(key;value)@\:results;
PARALLELIZATION IN Q
file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process peach years; doodles : ungroup 1!flip `month`name!(key;value)@\:results;
PARALLELIZATION IN Q
file : {"List_of_Google_Doodles_in_",string `year$x}; wget : {system "wget https://en.wikipedia.org/wiki/",file x}; process : { values : (string `January`February`March`April`May`June`July`Auguest`September`October`November`December)!til 12; doc : read0 hsym `$file x; pars: where doc like\: "<p>*"; celebrated : `$first @/:/: "\"" vs/:/: (@).' flip (d; where@/: not (d : "title=\"" vs/: doc pars) like\:\: "<p>*"); headings : {[doc;x] first pos where (doc pos : x + neg til 10) like\: "<h3>*"}[doc] each pars; months : x + values first @/: "_" vs/: first @‘ "\"" vs/: ("id=\"" vs/: doc headings)@'1; : raze each celebrated group months; }; years : 2010.01 2011.01 2012.01 2013.01m; wget each years; results : raze process peach years; doodles : ungroup 1!flip `month`name!(key;value)@\:results;
Done!
STEP 5. ASK SOME MORE QUESTIONS!
// Annotate the doodled months from in the main table clickinfo: update doodle:(date,’name) in doodles from clickinfo; // Get the average and median ratio between the max monthly clicks (with and without // the doodled month) and the min monthly clicks — exclude 0-click months (avg;med) @\: { exec (%) . (max clicks where doodle; max clicks where not doodle) - min clicks from flip x where not clicks = 0 } each select clicks, doodle by name from clickinfo where name in doodles`name 58.34461 10.30895
Average: 58x Median: 10x
name |
Winsor McCay | 508.5705 Albert Szent-Györgyi | 404.2465 Nicolas Steno | 360.9331 Gideon Sundback | 340.303 Mary Leakey | 337.1806 Dennis Gabor | 274.4389 Grace Hopper | 220.8074
…and summary
WHY SHOULD YOU CARE?
➤ High-level expressive notation ➤ Not just someones pet project ➤ Developed by Kx Systems (since 1993) ➤ Practical (dicts, tables, q-sql, temporals, etc…) ➤ Very fast ➤ memory is getting larger, vector operations getting faster (SIMD,
SSE, AVX2, AVX512, …)
➤ …benchmarks available online ➤ It’s interesting, different, and will change how you think
l:{(3=not[x]*n)or(or). 3 4=\:x*n:2{flip+':[x]+1_x,0b}/x} ➤ Some references: ➤ Two books: ➤ Q Tips - Nick Psaris ➤ Q for Mortals - Jeff Borror ➤ code.kx.com ➤ kx.com ➤ /software-download.php ➤ /community.php ➤ Notation as a Tool of Thought - K. Iverson’s
Turing Award Paper
@timthornton6