Measurement and Data Data describes the real world Data maps - - PowerPoint PPT Presentation

measurement and data data describes the real world
SMART_READER_LITE
LIVE PREVIEW

Measurement and Data Data describes the real world Data maps - - PowerPoint PPT Presentation

Measurement and Data Data describes the real world Data maps entities in the domain of interest to symbolic representation by means of a measurement procedure Numerical relationships between variables capture relationships between


slide-1
SLIDE 1

Measurement and Data

slide-2
SLIDE 2

Data describes the real world

  • Data maps entities in the domain of interest

to symbolic representation by means of a measurement procedure

  • Numerical relationships between variables

capture relationships between objects

  • Measurement process is crucial
slide-3
SLIDE 3

Types of Measurement

  • Ordinal, e.g., excellent=5, very good=4, good=3…
  • Nominal, e.g., religion, profession

– Need non-metric methods

  • Ratio, e.g., weight

– has concatenation property, two weights add to balance a third: 2+3 = 5 – changing scale (multiplying values by a constant) does not change ratio

  • Interval, e.g., temperature, calendar time

– Unit of measurement is arbitrary, as well as origin

slide-4
SLIDE 4

Distance Measures

  • Many data mining techniques (e.g., nn-

classification, cluster analysis) are based on similarity measures between objects

  • s(i,j): similarity, d(i,j): dissimilarity
  • Possible transformations: d(i,j)= 1 – s(i,j) or

d(i,j)=sqrt(2*(1-s(i,j))

slide-5
SLIDE 5

Metric Properties

  • 1. d(i,j) > 0:

Positivity

  • 2. d(i,j) = d(j,i):

Commutativity

  • 3. d(i,j) < d(i,k)+d(k,j):

Triangle Inequality

slide-6
SLIDE 6

Euclidean Distance between vectors

( )

2 / 1 1 2

) , (         − = ∑

= p k k k E

y x y x d

slide-7
SLIDE 7

Commensurability

  • Euclidean distance assumes variables are

commensurate

  • E.g., each variable a measure of length
  • If one were weight and other was length

there is no obvious choice of units

  • Altering units would change which

variables are important

slide-8
SLIDE 8

Standardizing the Data

  • Divide each variable by its standard deviation
  • Standard deviation for the kth variable is

where

2 1 1 2

) ) ( ( 1       − =

= i k k k

i x n µ σ

) ( 1

1

i x n

n i k k

=

= µ

slide-9
SLIDE 9

Weighted Euclidean Distance

  • If we know relative importance of variables

2 1 1 2

)) ( ) ( (( ) , (         − = ∑

= p k k k k WE

j x i x w j i d

slide-10
SLIDE 10

Need for Covariance in distance measure

  • Suppose we measured a cup’s height 100

times and diameter only once

  • Clearly height will dominate although 99 of

the height measurements are not contributing anything

  • They are very highly correlated
  • To eliminate redundancy we need a data-

driven method

slide-11
SLIDE 11

Sample Covariance between X and Y

  • Measure of how X and Y vary together
  • Large positive value if large values of X tend to be

associated with large values of Y and small values

  • f X with small values of Y
  • Large negative value if large values of X tend to

be associated with small values of Y

      −       − = ∑

= _ 1 _

) ( ) ( 1 ) , ( y i y x i x n Y X Cov

n i

slide-12
SLIDE 12

Correlation Coefficient

Value of Covariance is dependent upon ranges of X and Y Removed by dividing values of X by their standard deviation and values of Y by their standard deviation

y x n i

y i y x i x Y X σ σ ρ

=

− − =

1 _ _

) ) ( )( ) ( ( ) , (

slide-13
SLIDE 13

Correlation Matrix

slide-14
SLIDE 14

Mahanalobis Distance

( )

2 1 1

] )) ( ) ( ( ) ( ) ( [ ) , (

− − = j x i x j x i x j i d

T M

slide-15
SLIDE 15

Generalizing Euclidean Distance

  • Minkowski or L? metric
  • ? = 2 gives the Euclidean metric

( )

λ λ 1 1

) ( ) (         −

= p k k k

j x i x

slide-16
SLIDE 16

Minkowski metric

  • ? = 1 is the Manhattan or city block metric
  • ? = infinity yields

=

p k k k

j x i x

1

| ) ( ) ( |

| ) ( ) ( | max j x i x

k k k

slide-17
SLIDE 17

Mutivariate Binary Data

  • Most obvious measure is Hamming Distance normalized by number of

bits

  • If we don’t care about irrelevant properties had by neither object we

have Jaccard Coefficient

  • Dice Coefficient extends this argument. If 00 matches are irrelevant

then 10 and 01 matches should have half relevance

00 01 10 11 00 11

S S S S S S + + + +

01 10 11 11

S S S S + +

slide-18
SLIDE 18

Some Similarity/Dissimilarity Measures for N-dim Binary Vectors

where

* * *

slide-19
SLIDE 19

Some Similarity/Dissimilarity Measures for N-dim Binary Vectors

where

* * *

slide-20
SLIDE 20

Weighted Dissimilarity Measures for Binary Vectors

  • Unequal importance to ‘0’

matches and ‘1’ matches

  • Multiply S00 with ß ([0,1])
  • Examples:

1

00 11

N S S (X,Y) Dsm ⋅ + − = β 2 ) ( 2 ) , (

00 11 00 11

S S N S S N Y X Drta ⋅ − − ⋅ − − = β β

slide-21
SLIDE 21

Transforming the Data

slide-22
SLIDE 22

V1 is non-linearly Related to V2 V3=1/V2 is linearly related to V1 V1 V2

slide-23
SLIDE 23

Square root transformation keeps the variance constant Variance increases (regression assumes variance is constant)

slide-24
SLIDE 24

Form of Data

slide-25
SLIDE 25

Data Matrix

  • A set of p measurements on objects
  • (1)…o(n)
  • n rows and p columns
  • Also called standard data, data matrix or

table

slide-26
SLIDE 26

Multirelational Data

  • Payroll database has

– Employees table: name, department-name, age, salary – Department table: department-name, budget, manager

  • The tables are connected to each other by the

department-name field and the fields name and manager

  • Can be combined together, e.g., with fields name,

department-name, age, salary, budget, manager

  • Or create as many rows as department-names
  • Flattening may require needless replication of values
slide-27
SLIDE 27

Data Quality

slide-28
SLIDE 28

Outlier