Learning from evolving streams Online triage of bug reports - - PowerPoint PPT Presentation

learning from evolving streams
SMART_READER_LITE
LIVE PREVIEW

Learning from evolving streams Online triage of bug reports - - PowerPoint PPT Presentation

Learning from evolving streams Online triage of bug reports Grzegorz Chrupa la Spoken Language Systems Saarland University EACL 2012 G. Chrupa la (Saarland Uni) Learning from streams EACL 2012 1 / 23 Issue trackers Used to track


slide-1
SLIDE 1

Learning from evolving streams

Online triage of bug reports Grzegorz Chrupa la

Spoken Language Systems Saarland University

EACL 2012

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 1 / 23

slide-2
SLIDE 2

Issue trackers

Used to track bugs or feature requests in software projects May receive hundreds of reports per day Need to be triaged: labeled and assigned developers Domain-specific challenges

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 2 / 23

slide-3
SLIDE 3
  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 3 / 23

slide-4
SLIDE 4

Automate

Predict project subcomponent labels Predict developers assigned to bugs

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 4 / 23

slide-5
SLIDE 5

As social media

Issue trackers:

◮ very specialized social media

Practices (labeling, triage)

◮ Negotiated explicitly ◮ Emerging via imitation ◮ Influenced by automation

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 5 / 23

slide-6
SLIDE 6

Concept drift

Practices evolve Software projects mature People involved come and go For a learner, input and output change over time.

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 6 / 23

slide-7
SLIDE 7

Contributions

Collect data from modern software projects Analyze concept drift Apply state-of-the-art online learning and improve on current approaches

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 7 / 23

slide-8
SLIDE 8

Data

Alternate items assigned to dev and test Dev sets sizes:

Tracker Output # Items # Labels Chromium Subcomponent 31,953 75 Chromium Assigned 16,154 591 Android Subcomponent 888 12 Android Assigned 718 72 Firefox Assigned 12,733 503 Launchpad Assigned 18,634 1,970

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 8 / 23

slide-9
SLIDE 9

Evolving class distribution

Chromium Subcomponent

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 9 / 23

slide-10
SLIDE 10

Evolving class distribution

Launchpad Assigned

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 10 / 23

slide-11
SLIDE 11

Progressive validation

For i = 1 to ∞

◮ Send input i to learner ◮ Receive prediction i and record error i ◮ Send true output i to learner

Error(n) = n

i=1 error(i)

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 11 / 23

slide-12
SLIDE 12

Evaluation of ranking

Triage assistant

◮ show user a ranked list of suggested targets

Mean reciprocal rank MRR = 1 N

N

  • i=1

rank(i)−1

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 12 / 23

slide-13
SLIDE 13

Features

Title unigram and bigram counts Description unigram and bigram counts Author ID Year, month and day of submission

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 13 / 23

slide-14
SLIDE 14

Baselines

Window frequency

◮ Relative class frequencies in previous

k ∈ {100, 1000} items

SVM minibatch

◮ Retrain every n = 100 steps on previous

k = 1000 items

Perceptron

◮ Single pass, constant learning rate

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 14 / 23

slide-15
SLIDE 15

Bugzie

Tamrawi et al. 2011, Fuzzy set and cache-based approach for bug triaging. Based on a fuzzy set membership function:

µ(y, X) = 1 −

  • x∈X
  • 1 −

n(y, x) n(y, ·) + n(·, x) − n(y, x)

  • Counts n(·, ·) updated incrementally

Feature cache: keep track of k most significant features

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 15 / 23

slide-16
SLIDE 16

→ Regression SGD

SGD with square loss as basic learner. w(t+1) = w(t) − η(t)∇L(y(t), w(t)Tx(t)) L(y, ˆ y) = (y − ˆ y)2 Adaptive, per-feature learning rate (Duchi et

  • al. 2010, Streeter and McMahan 2010)

Learning rate larger for infrequent features.

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 16 / 23

slide-17
SLIDE 17

Reduction from multiclass

One-versus-all reduction

T(x, y) = {(x′, I(y = y′)) | y′ ∈ Y, x′

h(i,y′) = xi}

h(i, y′) composes the index i with the label y′ by hashing.

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 17 / 23

slide-18
SLIDE 18

Summary of results (test)

CS AS CA AA FA LA Win Svm Perc Bugz Regr MRR 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 18 / 23

slide-19
SLIDE 19

Chromium Subcomponent

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 19 / 23

slide-20
SLIDE 20

Firefox Assigned

Data becomes more difficult around 9.000

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 20 / 23

slide-21
SLIDE 21

Launchpad Assigned

Little concept drift ≈ 2000 labels: hashing collisions

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 21 / 23

slide-22
SLIDE 22

Best improvement over Window

ChS AnS ChA AnA FiA LaA 0.0 0.1 0.2 0.3 0.4

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 22 / 23

slide-23
SLIDE 23

To conclude

Concept drift is a crucial concern Modern online learner successfully tracks stream evolution Data available at: www.lsv.uni-saarland.de/resources.htm Ready to go beyond bag-of-words

  • G. Chrupa

la (Saarland Uni) Learning from streams EACL 2012 23 / 23