DEFINE RESEARCH QUESTIONS Over 11 million users on Stack Overflow - - PowerPoint PPT Presentation

define research questions
SMART_READER_LITE
LIVE PREVIEW

DEFINE RESEARCH QUESTIONS Over 11 million users on Stack Overflow - - PowerPoint PPT Presentation

Motivation Research Questions An empirical study on negative Methodology answers at Stack Overflow Experiment Result Threat of Validity Ethan Wang & Sherry Zhu Future Work Conclusion DEFINE RESEARCH QUESTIONS Over 11 million


slide-1
SLIDE 1

An empirical study

  • n negative

answers at Stack Overflow

Ethan Wang & Sherry Zhu

Motivation Research Questions Methodology Experiment Result Threat of Validity Future Work Conclusion

▪ Over 11 million users on Stack

Overflow

▪ Many people experience Stack

Overflow as a hostile place, especially newer coders, women, people of color.

DEFINE RESEARCH QUESTIONS

▪ What is the distribution of positive /

neutral / negative replies?

▪ What kind of reasons for a respondent

to give negative answers, and what is the distribution across the reasons?

slide-2
SLIDE 2

▪ StackExchange Data Explore (From

2019.1.1 ~ 2019.10.31)

▪ Post Table Schema ▪ Random sampling using NewID()

▪ Senti4SD Requirements:

  • 1. Contains only normal text
  • 2. All text from single answer should be in
  • ne line.

▪ Text Sanitization Algorithm:

▪ Replace all new lines and extra spaces. ▪ Remove all characters not in ASCII visible

range.

▪ Parse the HTML tag from the text, remove all

the HTML tags

▪ Remove all code blocks and links while

parsing the HTML tags

slide-3
SLIDE 3

▪ Divide data into smaller segments

(5000 each)

▪ Filter all negative answers and

random sample 200 records

▪ Use online card sort tool

called ”UsabiliTEST”

1.

Preparation

2.

Execution

3.

Interpretation

▪ Regular Expression

  • Extract common patterns for each group

▪ K-means Clustering & Support Vector Machine

  • 1. Data Cleaning (lowercase, removed punctuation)
  • 2. Stop Words Removing (the, he,she…)
  • 3. Text Vectorization (TF-IDF)

▪ SVM:

  • Model_1 classifies 'neutral' and 'negative'
  • Model_2 classifies 'negative' into multiple groups

▪ Five themes:

▪ Neutral ▪ Vague ▪ Undetermined ▪ Cold ▪ Irreproducible

K-means

  • Silhouette

Coefficient = 0.002808 Regex

  • Precision

= 76.86%

  • Recall =

45.49% SVM

  • Precision

= 86.89%

  • Recall =

48.38%

slide-4
SLIDE 4

▪ Regex

SVM

THREAT TO VALIDITY

▪ Native subjectiveness on manual process ▪ Accuracy of the Senti4SD tool

STACK OVERFLOW COMMENTS INTERVIEWS AND SURVEYS POSITIVE FEEDBACK

slide-5
SLIDE 5

Hostility takes only a tiny portion of overall replies

1

The environment on Stack Overflow is satisfactory in general.

2