Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple - - PowerPoint PPT Presentation

indian statistical institute kolkata at pr soco 2016 a
SMART_READER_LITE
LIVE PREVIEW

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple - - PowerPoint PPT Presentation

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach Kripabandhu Ghosh 1 , 2 Swapan Kumar Parui 1 1 Indian Statistical Institute, Kolkata, India 2 Indian Institute of Technology, Kanpur, India . . .


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach

Kripabandhu Ghosh1,2 Swapan Kumar Parui1

1Indian Statistical Institute, Kolkata, India 2Indian Institute of Technology, Kanpur, India

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach1 / 39

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Objective To predict the BIG5 personality traits of a person from her Java program code

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach2 / 39

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Programming

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach3 / 39

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Programming and personality

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach4 / 39

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Programming and personality

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach5 / 39

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Programming and personality

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach6 / 39

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

BIG5 personality

2

Features

3

Methodology

4

Results

5

Analysis

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach7 / 39

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIG5 personality

BIG5 personality traits

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach8 / 39

slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIG5 personality : Neuroticism

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach9 / 39

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIG5 personality : Neuroticism

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 10 / 39

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIG5 personality : Neuroticism

Motivation

Neurotics exhibit low emotional stability and so is likely to be less methodical in writing a code.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 11 / 39

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIG5 personality : Extroversion

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 12 / 39

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIG5 personality : Extroversion

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 13 / 39

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIG5 personality : Extroversion

Motivation

Extroverts are likely to express themselves and possibly provide meaningful comments in their code.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 14 / 39

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

BIG5 personality

2

Features

3

Methodology

4

Results

5

Analysis

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 15 / 39

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features

FEATURES

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 16 / 39

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features

Determining factors

Readibility Efficiency

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 17 / 39

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Multi-line comments (MLC)

The number of genuine comment words in multi-line comments, i.e., between /* and */ found in the program code. We have not considered the cases where lines of code were commented. Eliminate code lines – E.g., using [a-zA-Z][a-zA-Z]*[ ]*( matching System.out.println(“Even”); used in a Java code. This feature value was normalized by dividing it by the total number of words in the program file Indicator of code readability and meticulousness of the coder.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 18 / 39

slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Multi-line comments (MLC)

Feature Positive example Negative example MLC /** /*System.out.println(“Even”); * Make the hash table logically empty. printQ(qEven); */ System.out.println(“Odd”); printQ(qOdd);*/ SLC // Create a new double-sized, empty table //String[] ss = linea.readLine().split(“ ”); NES for (int i=1; i<=casos; i++) for (int i = 1; i< = casos; i++) IS import java.io.FileNotFoundException import java.io.*

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 19 / 39

slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Single-line comments (SLC)

This is the number of genuine single-line comment words in single line comments, i.e., comments following “//”. We have not considered the cases where lines of code were commented. Eliminate code lines – same as MLC. This feature value was normalized by dividing it by the total number of words in the program file. Indicator of code readability and meticulousness of the coder.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 20 / 39

slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Single-line comments (SLC)

Feature Positive example Negative example MLC /** /*System.out.println(“Even”); * Make the hash table logically empty. printQ(qEven); */ System.out.println(“Odd”); printQ(qOdd);*/ SLC // Create a new double-sized, empty table //String[] ss = linea.readLine().split(“ ”); NES for (int i=1; i<=casos; i++) for (int i = 1; i< = casos; i++) IS import java.io.FileNotFoundException import java.io.*

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 21 / 39

slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Non-existent spaces (NES)

This is the number of lines containing non-existent spaces i=1; i<=casos; as opposed to i = 1; i< = casos; Regular expression [a-z][a-z]* [a-z][a-z]*[=<>+] (e.g., int i=1) This feature value was normalized by dividing it by the total number of lines in the program file. Indicator of code readability and meticulousness of the coder.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 22 / 39

slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Non-existent spaces (NES)

Feature Positive example Negative example MLC /** /*System.out.println(“Even”); * Make the hash table logically empty. printQ(qEven); */ System.out.println(“Odd”); printQ(qOdd);*/ SLC // Create a new double-sized, empty table //String[] ss = linea.readLine().split(“ ”); NES for (int i=1; i<=casos; i++) for (int i = 1; i< = casos; i++) IS import java.io.FileNotFoundException import java.io.*

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 23 / 39

slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Import Specific (IS)

This is the number of instances where the programmer exported the specific libraries only E.g., cases of import java.io.FileNotFoundException as opposed to import java.io.* This feature value was normalized by dividing it by the total number of lines in the program file. Indicator of code efficiency as well as experience, prudence of the coder

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 24 / 39

slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Features : Import Specific (IS)

Feature Positive example Negative example MLC /** /*System.out.println(“Even”); * Make the hash table logically empty. printQ(qEven); */ System.out.println(“Odd”); printQ(qOdd);*/ SLC // Create a new double-sized, empty table //String[] ss = linea.readLine().split(“ ”); NES for (int i=1; i<=casos; i++) for (int i = 1; i< = casos; i++) IS import java.io.FileNotFoundException import java.io.*

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 25 / 39

slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

BIG5 personality

2

Features

3

Methodology

4

Results

5

Analysis

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 26 / 39

slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Methodology

METHODOLOGY

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 27 / 39

slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Method : Multiple Linear Regression

Four features – explanatory variables Each of the five BIG Five traits is the dependent variable.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 28 / 39

slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Method : Multiple Linear Regression

For a program code p, given as follows: scoreBIG5(p) =α + β1MLC(p) + β2SLC(p) + β3NES(p) + β4IS(p) (1) We calculate the values of α and βi, i = 1, 2, 3, 4 from the training data using the linear regression implementation in R.a

ahttps:

//www.r-bloggers.com/r-tutorial-series-multiple-linear-regression/

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 29 / 39

slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

BIG5 personality

2

Features

3

Methodology

4

Results

5

Analysis

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 30 / 39

slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Results

RESULTS

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 31 / 39

slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Results

1

Run1.txt: The values of the dependent variables were generated on the test data using the regression equation (1) learned from the training data.

2

Run2.txt: For this run, for each BIG Five trait, we calculated the values of the dependent variables given by the linear regression equation (1) on the training set. We then calculated the error and removed the files in the training set with the three highest error values. We then trained the linear regression on the new training set and calculated the coefficients. Finally, values of the dependent variables were calculated on the test data. The purpose of this run is to remove some outliers from the training set.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 32 / 39

slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Results : RMSE

Method NEUROTICISM EXTROVERSION OPENNESS AGREEABLENESS CONSCIENTIOUSNESS Run1.txt 10.22 8.60 7.16 9.60 9.99 Run2.txt 10.04 10.17 7.36 9.55 10.16 Baseline (bow) 10.29 9.06 7.74 9.00 8.47 Baseline (mean) 10.26 9.06 7.57 9.04 8.54 Reported best 9.78 8.60 6.95 8.79 8.38

Table : Root Mean Squared Error (RMSE). The best result produced by our submitted runs when compared to all the submitted runs is shown in bold.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 33 / 39

slide-34
SLIDE 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Results : PC

Method NEUROTICISM EXTROVERSION OPENNESS AGREEABLENESS CONSCIENTIOUSNESS Run1.txt 0.36 0.35 0.33 0.09

  • 0.20

Run2.txt 0.27 0.04 0.27 0.11

  • 0.13

Baseline (bow) 0.06 0.12

  • 0.17

0.20 0.17 Baseline (mean) 0.00 0.00 0.00 0.00 0.00 Reported best 0.36 0.47 0.62 0.38 0.33

Table : Pearson Product-Moment Correlation (PC). The best result produced by our submitted runs when compared to all the submitted runs is shown in bold.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 34 / 39

slide-35
SLIDE 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Outline

1

BIG5 personality

2

Features

3

Methodology

4

Results

5

Analysis

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 35 / 39

slide-36
SLIDE 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

ANALYSIS

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 36 / 39

slide-37
SLIDE 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis

BIG5 α β1 β2 β3 β4 Trait (Intercept) (MLC) (SLC) (NES) (IS) Neuroticism 55.30 10.82

  • 331.58
  • 57.15
  • 282.14

Extroversion 39.58 50.49 261.44 67.38 163.28 Openness 46.63 46.07 98.92 28.20 49.48 Agreeableness 42.521

  • 1.103

78.905 90.909 196.740 Conscientiousness

  • 1.708
  • 1.708

225.988

  • 67.633

135.353 Table : The regression coefficients for Run1

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 37 / 39

slide-38
SLIDE 38

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysis : Insights

The negative value of high magnitude of β2 indicates that a person who frequently provides Single Line Comments (SLC) in her code is likely to exhibit low Neuroticism. The negative value of high magnitude of β4 indicates that a person who tends to import libraries selectively, is likely to have low Neuroticism The positive values of β1, β2 and β4 indicates that a person who tends to provide genuine comments (both Multi Line and Single Line) and import specific libraries in her code is likely to have high Extrovertion. The observations for Openness are almost identical to those for Extroversion.

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 38 / 39

slide-39
SLIDE 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thank you!

  • K. Ghosh (Indian Statistical Institute, Kolkata, India, Indian Institute of Technology, Kanpur, India)

Indian Statistical Institute, Kolkata at PR-SOCO 2016 : A Simple Linear Regression Based Approach 39 / 39