Introduction to Data Science: Principles ordered categorical data do - - PowerPoint PPT Presentation

introduction to data science principles
SMART_READER_LITE
LIVE PREVIEW

Introduction to Data Science: Principles ordered categorical data do - - PowerPoint PPT Presentation

Measurements and Data Types Categorical attributes Discrete numeric attributes Getting data Continuous Numeric Attributes Other examples Units Getting data Categorical attributes Categorical attributes Continuous Numeric Attributes


slide-1
SLIDE 1

Measurements and Data Types

A data analysis to get us going

Analysis of Baltimore crime data. Downloaded from Baltimore City's awesome open data site (this was downloaded a couple of years ago so if you download now, you will get different results). The repository for this particular data is here. https://data.baltimorecity.gov/Crime/BPD­Arrests/3i3v­ibrt 1 / 37

Getting data

We've prepared the data previously into a comma­separated value file (.csv file): each column defines attributes that describe arrests each line contains attribute values (separated by commas) describing specific arrests. 2 / 37

Getting data

Note: To download this dataset to follow along you can use the following code:

if (!dir.exists("data")) dir.create("data") download.file("https://www.hcbravo.org/IntroDataSci/misc/BPD_Arrests.csv", destfile="data

3 / 37

Getting data

To make use of this dataset we want to assign the result of calling read_csv (i.e., the dataset) to a variable:

library(tidyverse) arrest_tab <- read_csv("data/BPD_Arrests.csv") arrest_tab ## # A tibble: 104,528 x 15 ## arrest age race sex arrestDate arrestTime arrestLocation ## <dbl> <dbl> <chr> <chr> <chr> <time> <chr> ## 1 1.11e7 23 B M 01/01/2011 00'00" <NA> ## 2 1.11e7 37 B M 01/01/2011 01'00" 2000 Wilkens …

4 / 37

Getting data

Now we can ask what type of value is stored in the arrest_tab variable:

class(arrest_tab) ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"

5 / 37

Getting data

The data.frame is a workhorse data structure in R. It encapsulates the idea of entities (in rows) and attribute values (in columns). We call these rectangular datasets. The other types tbl_df and tbl are added by tidyverse for improved functionality. 6 / 37

Getting data

The data.frame is a workhorse data structure in R. It encapsulates the idea of entities (in rows) and attribute values (in columns). We call these rectangular datasets. The other types tbl_df and tbl are added by tidyverse for improved functionality. Later, we will see how the pandas Python package provides the same semantics. 6 / 37

Getting data

We can ask other features of this dataset:

# This is a comment in R, by the way # How many rows (entities) does this dataset contain? nrow(arrest_tab) ## [1] 104528 # How many columns (attributes)? ncol(arrest_tab) ## [1] 15

7 / 37

Getting data

Now, in Rstudio you can view the data frame using View(arrest_tab). 8 / 37

Entities and attributes

We use the term entities to refer to the objects represented in a dataset refers to. In our example dataset each arrest is an entity. 9 / 37

Entities and attributes

We use the term entities to refer to the objects represented in a dataset refers to. In our example dataset each arrest is an entity. In a rectangular dataset (a data frame) this corresponds to rows in a table. 9 / 37

Entities and attributes

A dataset contains attributes for each entity. Attributes of each arrest would be: the person's age, the type of offense, the location, etc. 10 / 37

Entities and attributes

A dataset contains attributes for each entity. Attributes of each arrest would be: the person's age, the type of offense, the location, etc. In a rectangular dataset, this corresponds to the columns in a table. 10 / 37

Entities and attributes

This language of entities and attributes is commonly used in the database literature. In statistics you may see experimental units or samples for entities and covariates for attributes. In other instances observations for entities and variables for attributes. In Machine Learning you may see example for entities and features for attributes. For the most part, all of these are exchangable. 11 / 37

Entities and attributes

This table summarizes the terminology: Field Entities Attributes Databases Entities Attributes Machine Learning Examples Features Statistics Observations/Samples Variables/Covariates 12 / 37

Categorical attributes

A categorical attribute for a given entity can take only one of a finite set

  • f examples.

For example, the sex variable can only have value M, F, or (we'll talk about missing data later in the semester). 13 / 37

Categorical attributes

The result of a coin flip is categorical The outcome of rolling an 8­sided die, is also categorical Can you think of other examples? 14 / 37

Categorical attributes

Categorical data may be ordered or unordered In our example, all categorical data is unordered. 15 / 37

Categorical attributes

Categorical data may be ordered or unordered In our example, all categorical data is unordered. Examples of ordered categorical data are grades in a class, Likert scale categories, e.g., strongly agree, agree, neutral, disagree, strongly disagree, etc 15 / 37

Discrete numeric attributes

These are attributes that can take specific values from elements of

  • rdered, discrete (possibly infinite) sets. The most common set in this

case would be the non­negative positive integers. 16 / 37

Discrete numeric attributes

These are attributes that can take specific values from elements of

  • rdered, discrete (possibly infinite) sets. The most common set in this

case would be the non­negative positive integers. This data is commonly the result of counting processes. In our example dataset, age, measured in years, is a discrete attribute. 16 / 37

Discrete numeric attributes

Frequently, we obtain datasets as the result of summarizing, or aggregating other underlying data. In our case, we could construct a new dataset containing the number of arrests per neighborhood (we will see how to do this later) 17 / 37

Discrete numeric attributes

## # A tibble: 6 x 2 ## neighborhood number_of_arrests ## <chr> <int> ## 1 Abell 62 ## 2 Allendale 297 ## 3 Arcadia 78 ## 4 Arlington 694 ## 5 Armistead Gardens 153 ## 6 Ashburton 78

18 / 37

Discrete Numeric Attributes

In this new dataset, the entities are each neighborhood, the number_of_arrests attribute is a discrete numeric attribute. 19 / 37

Discrete Numeric Attributes

Other examples: the number of students in a class is discrete, the number of friends for a specific Facebook user. Can you think of other examples? 20 / 37

Discrete Numeric Attributes

Distinctions between ordered categorical and discrete numerical data:

  • rdered categorical data do not have magnitude

21 / 37

Discrete Numeric Attributes

For instance, is an 'A' in a class twice as good as a 'C'? Is a 'C' twice as good as a 'D'? 22 / 37

Discrete Numeric Attributes

For instance, is an 'A' in a class twice as good as a 'C'? Is a 'C' twice as good as a 'D'? Not necessarily. Grades don't have an inherent magnitude. 22 / 37

Discrete Numeric Attributes

However, if we encode grades as 'F=0,D=1,C=2,B=3,A=4', etc. they do have magnitude. In that case, an 'A' is twice as good as a 'C', and a 'C' is twice as good as a 'D'. 23 / 37

Discrete Numeric Attributes

In summary, if ordered data has magnitude, then discrete numeric if not, ordered categorical. 24 / 37

Continuous numeric data

Attributes that can take any value in a continuous set. For example, a person's height, in say inches, can take any number (within the range of human heights). 25 / 37 Different dataset: entities are cars and we look at continuous numeric attributes speed and stopping distance

Continuous numeric data

26 / 37

Continuous Numeric Attributes

The distinction between continuous and discrete can be tricky: measurements that have finite precision are, in a sense, discrete. 27 / 37

Continuous Numeric Attributes

The distinction between continuous and discrete can be tricky: measurements that have finite precision are, in a sense, discrete. Remember, continuity is not a property of the specific dataset you have in hand, It is a property of the process you are measuring. 27 / 37

Continuous Numeric Attributes

The number of arrests in a neighborhood cannot be fractional, regardless of the precision at which we measure this. 28 / 37

Continuous Numeric Attributes

The number of arrests in a neighborhood cannot be fractional, regardless of the precision at which we measure this. On the other hand, if we had the appropriate tool, we could measure a person's height with infinite precision. 28 / 37

Continuous Numeric Attributes

This distinction is very important when we build statistical models of datasets for analysis. For now, think of discrete data as the result of counting, and continuous data the result of some physical measurement. 29 / 37

Continuous Numeric Attributes

This distinction is very important when we build statistical models of datasets for analysis. For now, think of discrete data as the result of counting, and continuous data the result of some physical measurement. Here's a question: is age in our dataset a continuous or discrete numeric value? 29 / 37

Other examples

MNIST dataset of handwritten digits. Each image is an entity. Each image has a label attribute which states which of the digits 0,1,...9 is represented by the image. What type of data is this (categorical, continuous numeric, or discrete numeric)? 30 / 37

Other examples

31 / 37

Other examples

Each image is represented by grayscale values in a 28x28 grid. That's 784 attributes, one for each square in the grid, containing a grayscale value. What type of data are these other 784 attributes? 32 / 37

Other important datatypes

Text: Arbitrary strings that do not encode a categorical attribute. Datetime: Date and time of some event or observation (e.g., arrestDate, arrestTime) Geolocation: Latitude and Longitude of some event or observation (e.g., Location.) Relationships: links between entities, with links having their own attributes (e.g., social network, how long have two people followed each other) 33 / 37

Units

Something that we tend to forget but is extremely important for the modeling and interpretation of data. Attributes are measurements and that they have units. For example, age of a person can be measured in different units: years, months, etc. 34 / 37

Units

These can be converted to one another, but nonetheless in a given dataset, that attribute or measurement will be recorded in some specific units. Similar arguments go for distances and times, for example. 35 / 37

Units

In other cases, we may have unitless measurements (we will see later an example of this when we do dimensionality reduction). In these cases, it is worth thinking about why your measurements are unit­less. 36 / 37

Units

When performing analyses that try to summarize the effect of some measurement or attribute on another, units matter a lot! We will see the importance of this in our regression section. For now, make sure you make a mental note of units for each measurement you come across. Important when modeling and interpreting the results of these models. 37 / 37

Introduction to Data Science: Principles

Héctor Corrada Bravo

University of Maryland, College Park, USA 2020­01­28

slide-2
SLIDE 2

Measurements and Data Types

A data analysis to get us going

Analysis of Baltimore crime data. Downloaded from Baltimore City's awesome open data site (this was downloaded a couple of years ago so if you download now, you will get different results). The repository for this particular data is here. https://data.baltimorecity.gov/Crime/BPD­Arrests/3i3v­ibrt 1 / 37

slide-3
SLIDE 3

Getting data

We've prepared the data previously into a comma­separated value file (.csv file): each column defines attributes that describe arrests each line contains attribute values (separated by commas) describing specific arrests. 2 / 37

slide-4
SLIDE 4

Getting data

Note: To download this dataset to follow along you can use the following code:

if (!dir.exists("data")) dir.create("data") download.file("https://www.hcbravo.org/IntroDataSci/misc/BPD_Arrests.csv", destfile="data

3 / 37

slide-5
SLIDE 5

Getting data

To make use of this dataset we want to assign the result of calling read_csv (i.e., the dataset) to a variable:

library(tidyverse) arrest_tab <- read_csv("data/BPD_Arrests.csv") arrest_tab ## # A tibble: 104,528 x 15 ## arrest age race sex arrestDate arrestTime arrestLocation ## <dbl> <dbl> <chr> <chr> <chr> <time> <chr> ## 1 1.11e7 23 B M 01/01/2011 00'00" <NA> ## 2 1.11e7 37 B M 01/01/2011 01'00" 2000 Wilkens …

4 / 37

slide-6
SLIDE 6

Getting data

Now we can ask what type of value is stored in the arrest_tab variable:

class(arrest_tab) ## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"

5 / 37

slide-7
SLIDE 7

Getting data

The data.frame is a workhorse data structure in R. It encapsulates the idea of entities (in rows) and attribute values (in columns). We call these rectangular datasets. The other types tbl_df and tbl are added by tidyverse for improved functionality. 6 / 37

slide-8
SLIDE 8

Getting data

The data.frame is a workhorse data structure in R. It encapsulates the idea of entities (in rows) and attribute values (in columns). We call these rectangular datasets. The other types tbl_df and tbl are added by tidyverse for improved functionality. Later, we will see how the pandas Python package provides the same semantics. 6 / 37

slide-9
SLIDE 9

Getting data

We can ask other features of this dataset:

# This is a comment in R, by the way # How many rows (entities) does this dataset contain? nrow(arrest_tab) ## [1] 104528 # How many columns (attributes)? ncol(arrest_tab) ## [1] 15

7 / 37

slide-10
SLIDE 10

Getting data

Now, in Rstudio you can view the data frame using View(arrest_tab). 8 / 37

slide-11
SLIDE 11

Entities and attributes

We use the term entities to refer to the objects represented in a dataset refers to. In our example dataset each arrest is an entity. 9 / 37

slide-12
SLIDE 12

Entities and attributes

We use the term entities to refer to the objects represented in a dataset refers to. In our example dataset each arrest is an entity. In a rectangular dataset (a data frame) this corresponds to rows in a table. 9 / 37

slide-13
SLIDE 13

Entities and attributes

A dataset contains attributes for each entity. Attributes of each arrest would be: the person's age, the type of offense, the location, etc. 10 / 37

slide-14
SLIDE 14

Entities and attributes

A dataset contains attributes for each entity. Attributes of each arrest would be: the person's age, the type of offense, the location, etc. In a rectangular dataset, this corresponds to the columns in a table. 10 / 37

slide-15
SLIDE 15

Entities and attributes

This language of entities and attributes is commonly used in the database literature. In statistics you may see experimental units or samples for entities and covariates for attributes. In other instances observations for entities and variables for attributes. In Machine Learning you may see example for entities and features for attributes. For the most part, all of these are exchangable. 11 / 37

slide-16
SLIDE 16

Entities and attributes

This table summarizes the terminology: Field Entities Attributes Databases Entities Attributes Machine Learning Examples Features Statistics Observations/Samples Variables/Covariates 12 / 37

slide-17
SLIDE 17

Categorical attributes

A categorical attribute for a given entity can take only one of a finite set

  • f examples.

For example, the sex variable can only have value M, F, or (we'll talk about missing data later in the semester). 13 / 37

slide-18
SLIDE 18

Categorical attributes

The result of a coin flip is categorical The outcome of rolling an 8­sided die, is also categorical Can you think of other examples? 14 / 37

slide-19
SLIDE 19

Categorical attributes

Categorical data may be ordered or unordered In our example, all categorical data is unordered. 15 / 37

slide-20
SLIDE 20

Categorical attributes

Categorical data may be ordered or unordered In our example, all categorical data is unordered. Examples of ordered categorical data are grades in a class, Likert scale categories, e.g., strongly agree, agree, neutral, disagree, strongly disagree, etc 15 / 37

slide-21
SLIDE 21

Discrete numeric attributes

These are attributes that can take specific values from elements of

  • rdered, discrete (possibly infinite) sets. The most common set in this

case would be the non­negative positive integers. 16 / 37

slide-22
SLIDE 22

Discrete numeric attributes

These are attributes that can take specific values from elements of

  • rdered, discrete (possibly infinite) sets. The most common set in this

case would be the non­negative positive integers. This data is commonly the result of counting processes. In our example dataset, age, measured in years, is a discrete attribute. 16 / 37

slide-23
SLIDE 23

Discrete numeric attributes

Frequently, we obtain datasets as the result of summarizing, or aggregating other underlying data. In our case, we could construct a new dataset containing the number of arrests per neighborhood (we will see how to do this later) 17 / 37

slide-24
SLIDE 24

Discrete numeric attributes

## # A tibble: 6 x 2 ## neighborhood number_of_arrests ## <chr> <int> ## 1 Abell 62 ## 2 Allendale 297 ## 3 Arcadia 78 ## 4 Arlington 694 ## 5 Armistead Gardens 153 ## 6 Ashburton 78

18 / 37

slide-25
SLIDE 25

Discrete Numeric Attributes

In this new dataset, the entities are each neighborhood, the number_of_arrests attribute is a discrete numeric attribute. 19 / 37

slide-26
SLIDE 26

Discrete Numeric Attributes

Other examples: the number of students in a class is discrete, the number of friends for a specific Facebook user. Can you think of other examples? 20 / 37

slide-27
SLIDE 27

Discrete Numeric Attributes

Distinctions between ordered categorical and discrete numerical data:

  • rdered categorical data do not have magnitude

21 / 37

slide-28
SLIDE 28

Discrete Numeric Attributes

For instance, is an 'A' in a class twice as good as a 'C'? Is a 'C' twice as good as a 'D'? 22 / 37

slide-29
SLIDE 29

Discrete Numeric Attributes

For instance, is an 'A' in a class twice as good as a 'C'? Is a 'C' twice as good as a 'D'? Not necessarily. Grades don't have an inherent magnitude. 22 / 37

slide-30
SLIDE 30

Discrete Numeric Attributes

However, if we encode grades as 'F=0,D=1,C=2,B=3,A=4', etc. they do have magnitude. In that case, an 'A' is twice as good as a 'C', and a 'C' is twice as good as a 'D'. 23 / 37

slide-31
SLIDE 31

Discrete Numeric Attributes

In summary, if ordered data has magnitude, then discrete numeric if not, ordered categorical. 24 / 37

slide-32
SLIDE 32

Continuous numeric data

Attributes that can take any value in a continuous set. For example, a person's height, in say inches, can take any number (within the range of human heights). 25 / 37

slide-33
SLIDE 33

Different dataset: entities are cars and we look at continuous numeric attributes speed and stopping distance

Continuous numeric data

26 / 37

slide-34
SLIDE 34

Continuous Numeric Attributes

The distinction between continuous and discrete can be tricky: measurements that have finite precision are, in a sense, discrete. 27 / 37

slide-35
SLIDE 35

Continuous Numeric Attributes

The distinction between continuous and discrete can be tricky: measurements that have finite precision are, in a sense, discrete. Remember, continuity is not a property of the specific dataset you have in hand, It is a property of the process you are measuring. 27 / 37

slide-36
SLIDE 36

Continuous Numeric Attributes

The number of arrests in a neighborhood cannot be fractional, regardless of the precision at which we measure this. 28 / 37

slide-37
SLIDE 37

Continuous Numeric Attributes

The number of arrests in a neighborhood cannot be fractional, regardless of the precision at which we measure this. On the other hand, if we had the appropriate tool, we could measure a person's height with infinite precision. 28 / 37

slide-38
SLIDE 38

Continuous Numeric Attributes

This distinction is very important when we build statistical models of datasets for analysis. For now, think of discrete data as the result of counting, and continuous data the result of some physical measurement. 29 / 37

slide-39
SLIDE 39

Continuous Numeric Attributes

This distinction is very important when we build statistical models of datasets for analysis. For now, think of discrete data as the result of counting, and continuous data the result of some physical measurement. Here's a question: is age in our dataset a continuous or discrete numeric value? 29 / 37

slide-40
SLIDE 40

Other examples

MNIST dataset of handwritten digits. Each image is an entity. Each image has a label attribute which states which of the digits 0,1,...9 is represented by the image. What type of data is this (categorical, continuous numeric, or discrete numeric)? 30 / 37

slide-41
SLIDE 41

Other examples

31 / 37

slide-42
SLIDE 42

Other examples

Each image is represented by grayscale values in a 28x28 grid. That's 784 attributes, one for each square in the grid, containing a grayscale value. What type of data are these other 784 attributes? 32 / 37

slide-43
SLIDE 43

Other important datatypes

Text: Arbitrary strings that do not encode a categorical attribute. Datetime: Date and time of some event or observation (e.g., arrestDate, arrestTime) Geolocation: Latitude and Longitude of some event or observation (e.g., Location.) Relationships: links between entities, with links having their own attributes (e.g., social network, how long have two people followed each other) 33 / 37

slide-44
SLIDE 44

Units

Something that we tend to forget but is extremely important for the modeling and interpretation of data. Attributes are measurements and that they have units. For example, age of a person can be measured in different units: years, months, etc. 34 / 37

slide-45
SLIDE 45

Units

These can be converted to one another, but nonetheless in a given dataset, that attribute or measurement will be recorded in some specific units. Similar arguments go for distances and times, for example. 35 / 37

slide-46
SLIDE 46

Units

In other cases, we may have unitless measurements (we will see later an example of this when we do dimensionality reduction). In these cases, it is worth thinking about why your measurements are unit­less. 36 / 37

slide-47
SLIDE 47

Units

When performing analyses that try to summarize the effect of some measurement or attribute on another, units matter a lot! We will see the importance of this in our regression section. For now, make sure you make a mental note of units for each measurement you come across. Important when modeling and interpreting the results of these models. 37 / 37