CS 215: Data Interpretation and Analysis Fall 2017 Instructors: - - PowerPoint PPT Presentation

cs 215 data interpretation and analysis
SMART_READER_LITE
LIVE PREVIEW

CS 215: Data Interpretation and Analysis Fall 2017 Instructors: - - PowerPoint PPT Presentation

CS 215: Data Interpretation and Analysis Fall 2017 Instructors: Ajit Rajwade & Suyash Awate Where all do you analyze and interpret data? (1) In Medicine: Examples Pathology reports, Epidemiology studies


slide-1
SLIDE 1

Fall 2017 Instructors: Ajit Rajwade & Suyash Awate

CS 215: Data Interpretation and Analysis

slide-2
SLIDE 2

Where all do you analyze and interpret data?

(1) In Medicine: Examples

  • Pathology reports,
  • Epidemiology studies

https://ethnomed.org/clinical/tuberculosis/firlan d/epidemiology-of-tb

slide-3
SLIDE 3

Where all do you analyze and interpret data?

(2) In Sports

  • Tournament data
  • Player data
  • Questions like: which is the best team?

Which is the best batsman? Which is the best batsman from so and so age-group? http://i.dawn.com/primary/2 015/02/54d32f884dfd0.jpg?r =1999182479

slide-4
SLIDE 4

Where all do you analyze and interpret data?

(3) In Economics and Finance:

  • Country-wise data

List by the International Monetary Fund (2014 Rank Country/Region GDP (Millions of US$) World 1 United States 17,418,925 2 China 10,380,380[n 2] 3 Japan 4,616,335 4 Germany 3,859,547 5 United Kingdom2,945,146 6 France 2,846,889 7 Brazil 2,353,025 8 Italy 2,147,952 9 India 2,049,501 10 Russia 1,857,461[n 3] 11 Canada 1,788,717 12 Australia 1,444,189 13 South Korea 1,416,949 14 Spain 1,406,855 15 Mexico 1,282,725 16 Indonesia 888,648 17 Netherlands 866,354 18 Turkey 806,108 19 Saudi Arabia 752,459 20 Switzerland 712,050

Gross Domestic Product (GDP) is the broadest quantitative measure

  • f a nation's total economic
  • activity. More specifically, GDP

represents the monetary value of all goods and services produced within a nation's geographic borders over a specified period of time.

http://www.investinganswer s.com/financial- dictionary/economics/gross- domestic-product-gdp-1223

slide-5
SLIDE 5

Where all do you analyze and interpret data?

(3) In Economics and Finance:

  • Country-wise data

http://ihds.umd.edu/IHDS_files/02HDinIndia.pdf

slide-6
SLIDE 6

Where all do you analyze and interpret data?

(3 ) In Economics and Finance:

  • Region-wise data within a country

GDP of Indian states and union territories in 2014–15

  • over ₹14 lakh crore (US$220 billion)
  • ₹10 lakh crore (US$160 billion) to ₹14 lakh crore (US$220 billion)
  • ₹8 lakh crore (US$120 billion) to ₹10 lakh crore (US$160 billion)
  • ₹6 lakh crore (US$93 billion) to ₹8 lakh crore(US$120 billion)
  • ₹4 lakh crore (US$62 billion) to ₹6 lakh crore(US$93 billion)
  • ₹2 lakh crore (US$31 billion) to ₹4 lakh crore(US$62 billion)
  • ₹1 lakh crore (US$16 billion) to ₹2 lakh crore(US$31 billion)
  • ₹0.5 lakh crore (US$7.8 billion) to ₹1 lakh crore (US$16 billion)
  • ₹0.25 lakh crore (US$3.9 billion) to ₹0.50 lakh crore (US$7.8 billion)
  • less than ₹0.25 lakh crore (US$3.9 billion)

Source: wikipedia article

slide-7
SLIDE 7

Where all do you analyze and interpret data?

(5) In many other fields:

  • Weather forecasting
  • Psephology
  • Stock markets
  • Industrial testing
  • Market research (eg: in industry and storehouses)
slide-8
SLIDE 8

So what’s this course all about?

 Sounds like everything under the

http://www.clipartpanda.com/clipart_images/clipart-sun-rays-clipart-1587813

slide-9
SLIDE 9

What’s this course all about?

 A beginning course on probability and statistics  A very useful base for future courses in machine

learning, data mining, statistics, image processing and computer vision.

slide-10
SLIDE 10

What’s this course all about? Three sections

 Data analysis: Process of gathering,

displaying/visualizing and summarizing the data

 Probability: The “chance” that something happens  Statistical Inference: The science of drawing precise

inferences from the data gathered using tools from probability

slide-11
SLIDE 11

Example in Toxicology

 Imagine I invent two new medicines (say) to reduce

blood pressure (BP).

 I test the two medicines on two groups of rats – A and

B – respectively.

 I will then periodically measure BP of rats in groups A

and B.

 And seek to determine which medicine is “better”.

slide-12
SLIDE 12

Example in Toxicology: Data Analysis

 What should be the size of A and B?  How should I pick the members of A and B? Example:

can A be all males, B be all females? Can A be all white rats and B be all black rats?

 Once I acquire the BP measurements, how do I display

them succinctly? How do I compute averages?

slide-13
SLIDE 13

Example in Toxicology: Data Interpretation (or Statistical Inference)

 Let’s say the average BP of A was much lower than

that of B after feeding the two drugs.

 Does this mean the first medicine is more effective?  Or was this just a matter of chance? (Example: If I flip

an unbiased coin 50 times, I could land up with 30 heads – just by chance!)

slide-14
SLIDE 14

One more example

 Suppose your friend performs 10,000 independent

tosses of an unbiased coin.

 He reports 5200 heads.  Is (s)he serious or joking?

slide-15
SLIDE 15

Course Information

 Instructors: Ajit Rajwade (first half) and Suyash Awate

(second half)

 Lecture venue: CDEEP EEG 401 (GG Building 4th

Floor), timings: Slot 10, Tue and Fri, 2:00 to 3:25 pm (i.e. post lunch - and strong coffee ). The class will be broadcast live to IIT Goa.

 Course webpage (for the first half):

http://www.cse.iitb.ac.in/~ajitvr/CS215_Fall2017/

slide-16
SLIDE 16

Fall 2017 Instructor: Ajit Rajwade

Descriptive Statistics

16

slide-17
SLIDE 17

Topic Overview

 Some important terminology  Methods of data representation: frequency tables,

graphs, pie-charts, scatter-plots

 Data mean, median, mode, quantiles  Chebyshev’s inequality  Correlation coefficient

17

slide-18
SLIDE 18

Terminology

 Population: The collection of all elements which we wish to

study, example: data about occurrence of tuberculosis all over the world

 In this case, “population” refers to the set of people in the entire

world.

 The population is often too large to examine/study.  So we study a subset of the population – called as a sample.  In an experiment, we basically collect values for attributes of

each member of the sample – also called as a sample point.

 Example of a relevant attribute in the tuberculosis study would

be whether or not the patient yielded a positive result on the serum TB Gold test.

 See http://www.who.int/tb/publications/global_report/en/ for

more information.

18

slide-19
SLIDE 19

Terminology

 Discrete data: Data whose values are restricted to a

finite set. Eg: letter grades at IITB, genders, marital status (single, married, divorced), income brackets in India for tax purposes

 Continuous data: Data whose values belong to an

uncountably infinite set (Eg: a person’s height, temperature of a place, speed of a car at a time instant).

19

slide-20
SLIDE 20

Methods of Data Representation/Visualization

20

slide-21
SLIDE 21

Frequency Tables

 For discrete data having a relatively small number of

values, one can use a frequency table.

 Each row of the table lists the data value followed by

the number of sample points with that value (frequency

  • f that value).

 The values need not always be numeric!

Grade Number of students AA 100 AB BB BC CC The definition of an ideal course (per student perspective) at IITB ;-)

21

slide-22
SLIDE 22

Frequency Tables

 The frequency table can be visualized using a line

graph or a bar graph or a frequency polygon.

Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20

50 60 70 80 90 5 10 15 20 25 30 35 Marks Number of students

A bar graph plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a thick vertical bar!

22

slide-23
SLIDE 23

Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20

50 55 60 65 70 75 80 85 90 5 10 15 20 25 30 35 Marks Number of students

A line diagram plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a vertical line!

23

slide-24
SLIDE 24

50 55 60 65 70 75 80 85 90 5 10 15 20 25 30 35 Marks Number of students

Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20 A frequency polygon plots the frequency of each data value on the Y axis, and connects consecutive plotted points by means of a line.

24

slide-25
SLIDE 25

Relative frequency tables

 Sometimes the actual frequencies are not important.  We may be interested only in the percentage or

fraction of those frequencies for each data value – i.e. relative frequencies.

Grade Fraction of number of students AA 0.05 AB 0.10 BB 0.30 BC 0.35 CC 0.20

25

slide-26
SLIDE 26

Pie charts

 For a small number of distinct data values which are

non-numerical, one can use a pie-chart (it can also be used for numerical values).

 It consists of a circle divided into sectors

corresponding to each data value.

 The area of each sector = relative frequency for that

data value.

Population of native English speakers: https://en.wikipedia.org/wiki/Pie_chart

26

slide-27
SLIDE 27

Pie charts can be confusing

A big no-no with too many categories. http://stephenturbek.com/articles/2009/06/better-charts-from-simple-questions.html

27

slide-28
SLIDE 28

Dealing with continuous data

 Many a time the data can acquire continuous values

(eg: temperature of a place at a time instant, speed of a car at a given time instant, weight or height of an animal, etc.)

 In such cases, the data values are divided into intervals

called as bins.

 The frequency now refers to the number of sample

points falling into each bin.

 The bins are often taken to be of equal length, though

that is not strictly necessary.

28

slide-29
SLIDE 29

Dealing with continuous data

 Let the sample points be {xi}, 1 <= i <= N.  Let there be some K (K << N) bins, where the jth bin

has interval [aj,bj).

 Thus frequency fj for the jth bin is defined as follows:  Such frequency tables are also called histograms and

they can also be used to store relative frequency instead of frequency.

| } 1 , : { | N i b x a x f

j i j i j

    

29

slide-30
SLIDE 30

Example of a histogram: in image processing

 A grayscale image is a 2D array of size (say) H x W.  Each entry of this array is called a pixel and is indexed

as (x,y) where x is the column index and y is the row index.

 At each pixel, we have an intensity value which tells

us how bright the pixel is (smaller values = darker shades, larger value = brighter shades).

 Commonly, pixel values in grayscale photographic are

8 bit (ranging from 0 to 255).

 Histograms are widely used in image processing – in

fact a histogram is often used in image retrieval.

30

slide-31
SLIDE 31

Example: histogram of the well-known “barbara image”, using bins of length 10. This image has values from 0 to 255 and hence there are 26 bins.

31

slide-32
SLIDE 32

Cumulative frequency plot

 The cumulative (relative) frequency plot (also called

  • give) tells you the (proportion) number of sample

points whose value is less than or equal to a given data value.

The cumulative frequency plot for the frequency plot on the previous slide!

32

slide-33
SLIDE 33

Digression: A curious looking histogram in image processing

 Given the image I(x,y), let’s say we compute the x-

gradient image in the following manner:

 And we plot the histogram of the absolute values of

the x-gradient image.

 The next slide shows you how these histograms

typically look! What do you observe?

) , ( ) , 1 ( ) , ( , 1 , 1 , , y x I y x I y x I H y W x y x

x

       

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

35

slide-36
SLIDE 36

Summarizing the Data

36

slide-37
SLIDE 37

Summarizing a sample-set

 There are some values that can be considered

“representative” of the entire sample-set. Such values are called as a “statistic”.

 The most common statistic is the sample (arithmetic)

mean:

 It is basically what is commonly regarded as “average

value”.

N i i

x N x

1

1

37

slide-38
SLIDE 38

Summarizing a sample-set

 Another common statistic is the sample median,

which is the “middle value”.

 We sort the data array A from smallest to largest. If N

is odd, then the median is the value at the (N+1)/2 position in the sorted array.

 If N is even, the median can take any value in the

interval (A[N/2],A[N/2+1]) – why?

38

slide-39
SLIDE 39

Properties of the mean and median

 Consider each sample point xi were replaced by axi + b

for some constants a and b.

 What happens to the mean? What happens to the

median?

 Consider each sample point xi were replaced by its

square.

 What happens to the mean? What happens to the

median?

39

slide-40
SLIDE 40

Properties of the mean and median

 Question: Consider a set of sample points x1, x2, …,

  • xN. For what value y, is the sum total of the squared

difference with every sample point, the least? That is, what is:

 Question: For what value y, is the sum total of the

absolute difference with every sample point, the least? That is, what is:

? ) ( min arg

1 2

N i i y

x y ? | | min arg

1

N i i y

x y

Total squared deviation (or total squared loss) Total absolute deviation (or total absolute loss)

Answer: mean (proof done in class) Answer: median (two proofs done in class – with and without calculus) 40

slide-41
SLIDE 41

Properties of the mean and median

 The mean need not be a member of the original

sample-set.

 The median is always a member of the original

sample-set if N is odd.

 The median is not unique if N is even and will not be a

member of the set.

41

slide-42
SLIDE 42

Properties of the mean and median

 Consider a set of sample points x1, x2, …, xN. Let us

say that some of these values get grossly corrupted.

 What happens to the mean?  What happens to the median?

42

slide-43
SLIDE 43

Example

 Let A ={1,2,3,4,6}  Mean (A) = 3.2, median (A) = 3  Now consider A = {1,2,3,4,20}  Mean (A) = 6, median(A) = 3.

43