Presented by - Karan Kurani and Jason Marcell (Some slides adapted - - PowerPoint PPT Presentation

presented by karan kurani and jason marcell some slides
SMART_READER_LITE
LIVE PREVIEW

Presented by - Karan Kurani and Jason Marcell (Some slides adapted - - PowerPoint PPT Presentation

Presented by - Karan Kurani and Jason Marcell (Some slides adapted from presentation on 12 th November) Karan Jason Theo Kiyan Bistra Goal Datasets Software Engineering Latent Dirichlet Allocation Methodology Results


slide-1
SLIDE 1

Presented by - Karan Kurani and Jason Marcell (Some slides adapted from presentation on 12th November)

slide-2
SLIDE 2

Jason Karan Kiyan Bistra Theo

slide-3
SLIDE 3

 Goal  Datasets  Software Engineering  Latent Dirichlet Allocation  Methodology  Results  Future Work

slide-4
SLIDE 4

 Find people who are doing Comp Sust. But who are not

aware about it or we don’t know about them.

 Techniques –

 Citation Network Analysis (Not implemented yet)  Similarity Measure  Combination of both.

slide-5
SLIDE 5

 CS Based - DBLP, arnetminer.org, CiteSeerX.  Multidisciplinary – BASE, Bioone, ChemSeerX, Crossref for citation.  Currently Used –

slide-6
SLIDE 6

Revision Control Logging Unit Testing Object-Relational Mapping Integrated Development Environment

slide-7
SLIDE 7

 DBLP Stats:

 Total docs: 1632441  With abstract text: 653507  With references: 316559

 Possible approaches included –

 LSA, pLSA and LDA.  All of them make a bag of words model.

slide-8
SLIDE 8

*From the review paper “Topic Models” - David M. Blei, Princeton University. John D. Lafferty, Carnegie Mellon University

slide-9
SLIDE 9

Social ne l networks ks d data (Airoldi et al.,2007). Images (Fei-Fei and Perona, 2005; Russell et al., 2006; Blei and Jordan, 2003; Barnard et al., 2003), Population genetics data (Pritchard et al., 2000), Survey data (Erosheva et al., 2007),

slide-10
SLIDE 10

DBLP Data Set CompSust Keyword Filter Stop Words Filter MAHOUT LDA Extract corpus and seed paper topic distributions Squared Euclidean Distance Cosine Distance Symmetric KL- divergence distance

slide-11
SLIDE 11
slide-12
SLIDE 12

 Evolving results set can be browsed on the

web: http://www.cs.cornell.edu/~kiyan/compsust- sn/

slide-13
SLIDE 13

 Noisy but Encouraging (Most of the results are recent (2006-2010.) )  Reasons -  Many false positives because of alternate uses of keywords.  Over fitting because of sub optimal parameters for LDA.

slide-14
SLIDE 14

Dynamic Topic Models Correlated Topic Models

slide-15
SLIDE 15

 Add additional data sources.  Customized web crawler.  Incorporate network analysis (Author – topic model, Link-

LDA)

slide-16
SLIDE 16