CS 215: Data Interpretation and Analysis Fall 2017 Instructors: - - PowerPoint PPT Presentation
CS 215: Data Interpretation and Analysis Fall 2017 Instructors: - - PowerPoint PPT Presentation
CS 215: Data Interpretation and Analysis Fall 2017 Instructors: Ajit Rajwade & Suyash Awate Where all do you analyze and interpret data? (1) In Medicine: Examples Pathology reports, Epidemiology studies
Where all do you analyze and interpret data?
(1) In Medicine: Examples
- Pathology reports,
- Epidemiology studies
https://ethnomed.org/clinical/tuberculosis/firlan d/epidemiology-of-tb
Where all do you analyze and interpret data?
(2) In Sports
- Tournament data
- Player data
- Questions like: which is the best team?
Which is the best batsman? Which is the best batsman from so and so age-group? http://i.dawn.com/primary/2 015/02/54d32f884dfd0.jpg?r =1999182479
Where all do you analyze and interpret data?
(3) In Economics and Finance:
- Country-wise data
List by the International Monetary Fund (2014 Rank Country/Region GDP (Millions of US$) World 1 United States 17,418,925 2 China 10,380,380[n 2] 3 Japan 4,616,335 4 Germany 3,859,547 5 United Kingdom2,945,146 6 France 2,846,889 7 Brazil 2,353,025 8 Italy 2,147,952 9 India 2,049,501 10 Russia 1,857,461[n 3] 11 Canada 1,788,717 12 Australia 1,444,189 13 South Korea 1,416,949 14 Spain 1,406,855 15 Mexico 1,282,725 16 Indonesia 888,648 17 Netherlands 866,354 18 Turkey 806,108 19 Saudi Arabia 752,459 20 Switzerland 712,050
Gross Domestic Product (GDP) is the broadest quantitative measure
- f a nation's total economic
- activity. More specifically, GDP
represents the monetary value of all goods and services produced within a nation's geographic borders over a specified period of time.
http://www.investinganswer s.com/financial- dictionary/economics/gross- domestic-product-gdp-1223
Where all do you analyze and interpret data?
(3) In Economics and Finance:
- Country-wise data
http://ihds.umd.edu/IHDS_files/02HDinIndia.pdf
Where all do you analyze and interpret data?
(3 ) In Economics and Finance:
- Region-wise data within a country
GDP of Indian states and union territories in 2014–15
- over ₹14 lakh crore (US$220 billion)
- ₹10 lakh crore (US$160 billion) to ₹14 lakh crore (US$220 billion)
- ₹8 lakh crore (US$120 billion) to ₹10 lakh crore (US$160 billion)
- ₹6 lakh crore (US$93 billion) to ₹8 lakh crore(US$120 billion)
- ₹4 lakh crore (US$62 billion) to ₹6 lakh crore(US$93 billion)
- ₹2 lakh crore (US$31 billion) to ₹4 lakh crore(US$62 billion)
- ₹1 lakh crore (US$16 billion) to ₹2 lakh crore(US$31 billion)
- ₹0.5 lakh crore (US$7.8 billion) to ₹1 lakh crore (US$16 billion)
- ₹0.25 lakh crore (US$3.9 billion) to ₹0.50 lakh crore (US$7.8 billion)
- less than ₹0.25 lakh crore (US$3.9 billion)
Source: wikipedia article
Where all do you analyze and interpret data?
(5) In many other fields:
- Weather forecasting
- Psephology
- Stock markets
- Industrial testing
- Market research (eg: in industry and storehouses)
So what’s this course all about?
Sounds like everything under the
http://www.clipartpanda.com/clipart_images/clipart-sun-rays-clipart-1587813
What’s this course all about?
A beginning course on probability and statistics A very useful base for future courses in machine
learning, data mining, statistics, image processing and computer vision.
What’s this course all about? Three sections
Data analysis: Process of gathering,
displaying/visualizing and summarizing the data
Probability: The “chance” that something happens Statistical Inference: The science of drawing precise
inferences from the data gathered using tools from probability
Example in Toxicology
Imagine I invent two new medicines (say) to reduce
blood pressure (BP).
I test the two medicines on two groups of rats – A and
B – respectively.
I will then periodically measure BP of rats in groups A
and B.
And seek to determine which medicine is “better”.
Example in Toxicology: Data Analysis
What should be the size of A and B? How should I pick the members of A and B? Example:
can A be all males, B be all females? Can A be all white rats and B be all black rats?
Once I acquire the BP measurements, how do I display
them succinctly? How do I compute averages?
Example in Toxicology: Data Interpretation (or Statistical Inference)
Let’s say the average BP of A was much lower than
that of B after feeding the two drugs.
Does this mean the first medicine is more effective? Or was this just a matter of chance? (Example: If I flip
an unbiased coin 50 times, I could land up with 30 heads – just by chance!)
One more example
Suppose your friend performs 10,000 independent
tosses of an unbiased coin.
He reports 5200 heads. Is (s)he serious or joking?
Course Information
Instructors: Ajit Rajwade (first half) and Suyash Awate
(second half)
Lecture venue: CDEEP EEG 401 (GG Building 4th
Floor), timings: Slot 10, Tue and Fri, 2:00 to 3:25 pm (i.e. post lunch - and strong coffee ). The class will be broadcast live to IIT Goa.
Course webpage (for the first half):
http://www.cse.iitb.ac.in/~ajitvr/CS215_Fall2017/
Fall 2017 Instructor: Ajit Rajwade
Descriptive Statistics
16
Topic Overview
Some important terminology Methods of data representation: frequency tables,
graphs, pie-charts, scatter-plots
Data mean, median, mode, quantiles Chebyshev’s inequality Correlation coefficient
17
Terminology
Population: The collection of all elements which we wish to
study, example: data about occurrence of tuberculosis all over the world
In this case, “population” refers to the set of people in the entire
world.
The population is often too large to examine/study. So we study a subset of the population – called as a sample. In an experiment, we basically collect values for attributes of
each member of the sample – also called as a sample point.
Example of a relevant attribute in the tuberculosis study would
be whether or not the patient yielded a positive result on the serum TB Gold test.
See http://www.who.int/tb/publications/global_report/en/ for
more information.
18
Terminology
Discrete data: Data whose values are restricted to a
finite set. Eg: letter grades at IITB, genders, marital status (single, married, divorced), income brackets in India for tax purposes
Continuous data: Data whose values belong to an
uncountably infinite set (Eg: a person’s height, temperature of a place, speed of a car at a time instant).
19
Methods of Data Representation/Visualization
20
Frequency Tables
For discrete data having a relatively small number of
values, one can use a frequency table.
Each row of the table lists the data value followed by
the number of sample points with that value (frequency
- f that value).
The values need not always be numeric!
Grade Number of students AA 100 AB BB BC CC The definition of an ideal course (per student perspective) at IITB ;-)
21
Frequency Tables
The frequency table can be visualized using a line
graph or a bar graph or a frequency polygon.
Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20
50 60 70 80 90 5 10 15 20 25 30 35 Marks Number of students
A bar graph plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a thick vertical bar!
22
Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20
50 55 60 65 70 75 80 85 90 5 10 15 20 25 30 35 Marks Number of students
A line diagram plots the distinct data values on the X axis and their frequency on the Y axis by means of the height of a vertical line!
23
50 55 60 65 70 75 80 85 90 5 10 15 20 25 30 35 Marks Number of students
Grade Number of students AA 5 AB 10 BB 30 BC 35 CC 20 A frequency polygon plots the frequency of each data value on the Y axis, and connects consecutive plotted points by means of a line.
24
Relative frequency tables
Sometimes the actual frequencies are not important. We may be interested only in the percentage or
fraction of those frequencies for each data value – i.e. relative frequencies.
Grade Fraction of number of students AA 0.05 AB 0.10 BB 0.30 BC 0.35 CC 0.20
25
Pie charts
For a small number of distinct data values which are
non-numerical, one can use a pie-chart (it can also be used for numerical values).
It consists of a circle divided into sectors
corresponding to each data value.
The area of each sector = relative frequency for that
data value.
Population of native English speakers: https://en.wikipedia.org/wiki/Pie_chart
26
Pie charts can be confusing
A big no-no with too many categories. http://stephenturbek.com/articles/2009/06/better-charts-from-simple-questions.html
27
Dealing with continuous data
Many a time the data can acquire continuous values
(eg: temperature of a place at a time instant, speed of a car at a given time instant, weight or height of an animal, etc.)
In such cases, the data values are divided into intervals
called as bins.
The frequency now refers to the number of sample
points falling into each bin.
The bins are often taken to be of equal length, though
that is not strictly necessary.
28
Dealing with continuous data
Let the sample points be {xi}, 1 <= i <= N. Let there be some K (K << N) bins, where the jth bin
has interval [aj,bj).
Thus frequency fj for the jth bin is defined as follows: Such frequency tables are also called histograms and
they can also be used to store relative frequency instead of frequency.
| } 1 , : { | N i b x a x f
j i j i j
29
Example of a histogram: in image processing
A grayscale image is a 2D array of size (say) H x W. Each entry of this array is called a pixel and is indexed
as (x,y) where x is the column index and y is the row index.
At each pixel, we have an intensity value which tells
us how bright the pixel is (smaller values = darker shades, larger value = brighter shades).
Commonly, pixel values in grayscale photographic are
8 bit (ranging from 0 to 255).
Histograms are widely used in image processing – in
fact a histogram is often used in image retrieval.
30
Example: histogram of the well-known “barbara image”, using bins of length 10. This image has values from 0 to 255 and hence there are 26 bins.
31
Cumulative frequency plot
The cumulative (relative) frequency plot (also called
- give) tells you the (proportion) number of sample
points whose value is less than or equal to a given data value.
The cumulative frequency plot for the frequency plot on the previous slide!
32
Digression: A curious looking histogram in image processing
Given the image I(x,y), let’s say we compute the x-
gradient image in the following manner:
And we plot the histogram of the absolute values of
the x-gradient image.
The next slide shows you how these histograms
typically look! What do you observe?
) , ( ) , 1 ( ) , ( , 1 , 1 , , y x I y x I y x I H y W x y x
x
33
34
35
Summarizing the Data
36
Summarizing a sample-set
There are some values that can be considered
“representative” of the entire sample-set. Such values are called as a “statistic”.
The most common statistic is the sample (arithmetic)
mean:
It is basically what is commonly regarded as “average
value”.
N i i
x N x
1
1
37
Summarizing a sample-set
Another common statistic is the sample median,
which is the “middle value”.
We sort the data array A from smallest to largest. If N
is odd, then the median is the value at the (N+1)/2 position in the sorted array.
If N is even, the median can take any value in the
interval (A[N/2],A[N/2+1]) – why?
38
Properties of the mean and median
Consider each sample point xi were replaced by axi + b
for some constants a and b.
What happens to the mean? What happens to the
median?
Consider each sample point xi were replaced by its
square.
What happens to the mean? What happens to the
median?
39
Properties of the mean and median
Question: Consider a set of sample points x1, x2, …,
- xN. For what value y, is the sum total of the squared
difference with every sample point, the least? That is, what is:
Question: For what value y, is the sum total of the
absolute difference with every sample point, the least? That is, what is:
? ) ( min arg
1 2
N i i y
x y ? | | min arg
1
N i i y
x y
Total squared deviation (or total squared loss) Total absolute deviation (or total absolute loss)
Answer: mean (proof done in class) Answer: median (two proofs done in class – with and without calculus) 40
Properties of the mean and median
The mean need not be a member of the original
sample-set.
The median is always a member of the original
sample-set if N is odd.
The median is not unique if N is even and will not be a
member of the set.
41
Properties of the mean and median
Consider a set of sample points x1, x2, …, xN. Let us
say that some of these values get grossly corrupted.
What happens to the mean? What happens to the median?
42
Example
Let A ={1,2,3,4,6} Mean (A) = 3.2, median (A) = 3 Now consider A = {1,2,3,4,20} Mean (A) = 6, median(A) = 3.
43