Learning from evolving streams Online triage of bug reports Grzegorz Chrupa� la Spoken Language Systems Saarland University EACL 2012 G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 1 / 23
Issue trackers Used to track bugs or feature requests in software projects May receive hundreds of reports per day Need to be triaged : labeled and assigned developers Domain-specific challenges G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 2 / 23
G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 3 / 23
Automate Predict project subcomponent labels Predict developers assigned to bugs G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 4 / 23
As social media Issue trackers: ◮ very specialized social media Practices (labeling, triage) ◮ Negotiated explicitly ◮ Emerging via imitation ◮ Influenced by automation G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 5 / 23
Concept drift Practices evolve Software projects mature People involved come and go For a learner, input and output change over time. G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 6 / 23
Contributions Collect data from modern software projects Analyze concept drift Apply state-of-the-art online learning and improve on current approaches G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 7 / 23
Data Alternate items assigned to dev and test Dev sets sizes: Tracker Output # Items # Labels Chromium Subcomponent 31,953 75 Chromium Assigned 16,154 591 Android Subcomponent 888 12 Android Assigned 718 72 Firefox Assigned 12,733 503 Launchpad Assigned 18,634 1,970 G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 8 / 23
Evolving class distribution Chromium Subcomponent G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 9 / 23
Evolving class distribution Launchpad Assigned G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 10 / 23
Progressive validation For i = 1 to ∞ ◮ Send input i to learner ◮ Receive prediction i and record error i ◮ Send true output i to learner Error ( n ) = � n i =1 error ( i ) G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 11 / 23
Evaluation of ranking Triage assistant ◮ show user a ranked list of suggested targets Mean reciprocal rank N MRR = 1 � rank( i ) − 1 N i =1 G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 12 / 23
Features Title unigram and bigram counts Description unigram and bigram counts Author ID Year, month and day of submission G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 13 / 23
Baselines Window frequency ◮ Relative class frequencies in previous k ∈ { 100 , 1000 } items SVM minibatch ◮ Retrain every n = 100 steps on previous k = 1000 items Perceptron ◮ Single pass, constant learning rate G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 14 / 23
Bugzie Tamrawi et al. 2011, Fuzzy set and cache-based approach for bug triaging. Based on a fuzzy set membership function: � n ( y, x ) � � µ ( y, X ) = 1 − 1 − n ( y, · ) + n ( · , x ) − n ( y, x ) x ∈ X Counts n ( · , · ) updated incrementally Feature cache: keep track of k most significant features G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 15 / 23
→ Regression SGD SGD with square loss as basic learner. w ( t +1) = w ( t ) − η ( t ) ∇ L ( y ( t ) , w ( t ) T x ( t ) ) y ) 2 L ( y, ˆ y ) = ( y − ˆ Adaptive, per-feature learning rate (Duchi et al. 2010, Streeter and McMahan 2010) Learning rate larger for infrequent features. G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 16 / 23
Reduction from multiclass One-versus-all reduction T ( x , y ) = { ( x ′ , I ( y = y ′ )) | y ′ ∈ Y, x ′ h ( i,y ′ ) = x i } h ( i, y ′ ) composes the index i with the label y ′ by hashing. G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 17 / 23
Summary of results (test) 0.7 Win Svm Perc 0.6 Bugz Regr 0.5 0.4 MRR 0.3 0.2 0.1 0.0 CS AS CA AA FA LA G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 18 / 23
Chromium Subcomponent G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 19 / 23
Firefox Assigned Data becomes more difficult around 9.000 G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 20 / 23
Launchpad Assigned Little concept drift ≈ 2000 labels: hashing collisions G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 21 / 23
Best improvement over Window 0.4 0.3 0.2 0.1 0.0 ChS AnS ChA AnA FiA LaA G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 22 / 23
To conclude Concept drift is a crucial concern Modern online learner successfully tracks stream evolution Data available at: www.lsv.uni-saarland.de/resources.htm Ready to go beyond bag-of-words G. Chrupa� la (Saarland Uni) Learning from streams EACL 2012 23 / 23
Recommend
More recommend