listening to big data
play

Listening(to(big(data( ( - PowerPoint PPT Presentation

Overview( Listening(to(big(data( ( Is(clone(analysis(/(empirical(SE(a(Big(Data(problem?( (and(should(we(care?( Or,(philately(will(get(you(everywhere( Looking(hard(for(the(Big(Picture( And(why(someJmes(that(can(be(a(bad(idea(


  1. Overview( Listening(to(big(data( ( • Is(clone(analysis(/(empirical(SE(a(Big(Data(problem?( – …(and(should(we(care?( Or,(philately(will(get(you(everywhere( • Looking(hard(for(the(Big(Picture( – And(why(someJmes(that(can(be(a(bad(idea( Mike(Godfrey (( So<ware(Architecture(Group( • Let's(go(swimming(with(the(data!( University(of(Waterloo( – Some(experiences(and(some(advice( (More(data(+(simple(algorithms)(( "Big(data"( >>((complex(algorithms)( • Three(Vs(( • FantasJc(talk(by(Peter(Norvig(of(Google:( – Volume,(Velocity,(Variety( "The(unreasonable(effecJveness(of(data"( h[p://www.youtube.com/watch?v=yvDCzhbjYWs( • Why?( – Enhanced(decision(making,(insight(discovery,(and( • "Every'(me'I'fire'a'linguist,'my'scores'get'be8er."'' process(opJmizaJon( – [Fred(Jelinek,(paraphrased]( • Common(problems:( • But(does(that(work(for(clone(detecJon(/(ESE(too?( – Capture,(curaJon,(storage,(search,(sharing,(transfer,( – Should(we(all(use(Ncgram(algorithms?( analysis, ( and(visualizaJon(

  2. Data(quality( (Big(data(+(simple(algorithms)?( • NLP,(for(example,(analyzes(unstructured(prose( – Much(variaJon:(intent,(word(ordering,(relaJonships,(…( – NLP(o<en(does(some(precprocessing(e.g.,(stemming( • ESE(examines(development(arJfacts(with(lots(of(internal(structure(+( external(linkage,(implicit(and(explicit( – Source(code(text,(including(comments(( – Version(control(metacdata( – Bug(reports( – …( • When(you(have(reliable(structure,(exploit(it!( – Yes?( – So(maybe(big(ESE(data(isn't(really(big(data(…( Looking(for(the(Big(Picture( Looking(for(the(Big(Picture ' A(selecJve(a[enJon(test( Trials(and(Errors:(Why(Science(is(Failing(Us( ( Wired(Magazine,(December(2011( "I'used'to'think'that'the'brain'was'the'most'wonderful'organ' in'my'body.'Then'I'realized'who'was'telling'me'this."' by(Jonah(Lehrer( '—'Emo'Philips' ' h[p://www.youtube.com/watch?v=vJG698U2Mvo(

  3. Tim(Minchin( h[p://www.upworthy.com/thisciscthecmostcinspiringcyetcdepressingcyetchilariousc yetchorrifyingcyetcheartwarmingcgradcspeech( "Physics'is'the'only'real'science.'' 'The'rest'are'just'stamp'collec(ng."' ' Ernest Rutherford (1871-1937) Father of atomic physics Nobel prize for … chemistry

  4. The("S"(curve(of(successful(growth( The � S � curve of successful growth size time Linux(kernel:( Linux(kernel:( Growth of Linux kernel source tree Growth(of(kernel(src(tree((#(of(files)( Average(/(median( .h (file(size( (# of src files) ! 140 y = .21*x 2 + 252*x + 90,055 r2=.997 6000 120 5000 Development releases (1.1, 1.3, 2.1, 2.3) 100 # of source code files (*.[ch] ) Stable releases (1.0, 1.2, 2.0, 2.2) Uncommented LOC 4000 80 3000 60 2000 40 Average .h file size -- dev. releases Average .h file size -- stable releases 1000 20 Median .h file size -- dev. releases Median .h file size -- stable releases 0 0 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001 Jan 1993 Jun 1994 Oct 1995 Mar 1997 Jul 1998 Dec 1999 Apr 2001

  5. 'Cloning(considered(harmful � ( Source code cloning considered(harmful( � Number one in the stink parade is 1. Forking 3. Post-hoc customizing duplicated code. If you see the – Hardware variation – Bug workarounds same code structure in more than e.g., Linux SCSI drivers one place, you can be sure that – Replicate + specialize – Platform variation your program will be better if you – Experimental variation find a way to unify them. � – “Bad Smells” 2. Templating [Beck/Fowler in Refactoring ] – Boilerplating – API / library protocols – Generalized programming idioms – Parameterized code Cloning harmfulness: What(to(do?( Two open source case studies • Swim (with(the(data( Apache Gnumeric Group Pattern Good Harmful Good Harmful Forking Hardware variation 0 0 0 0 • Be (the(gorilla(in(the(mist( Forking Platform variation 10 0 0 0 Forking Experimental variation 4 0 0 0 Templating Boiler-plating 5 0 6 7 Templating API 0 0 0 9 • Look'for'lumps' under(the(carpet(&(ask("Why?"( Templating Idioms 0 12 1 1 Templating Parameterized code 5 12 10 34 Customizing Replicate + specialize 12 4 15 16 Customizing Bug workarounds 0 0 0 0 Total 36 28 32 67 Apache httpd 2.2.4 - 60 Tokens Gnumeric 1.6.3 - 60 Tokens

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend