Final projects 21 May: Class split into 4 sub-classes (for each TA). - - PowerPoint PPT Presentation

final projects
SMART_READER_LITE
LIVE PREVIEW

Final projects 21 May: Class split into 4 sub-classes (for each TA). - - PowerPoint PPT Presentation

Final projects 21 May: Class split into 4 sub-classes (for each TA). Each group gives a ~8 min presentation (each person~2 min) Motivation & background, which data? small Example, final outcome, (focused on method and data)


slide-1
SLIDE 1

Final projects

21 May: Class split into 4 sub-classes (for each TA). Each group gives a ~8 min presentation (each person~2 min)

  • Motivation & background, which data?
  • small Example,
  • final outcome, (focused on method and data)
  • difficulties,

5 Groups in each sub-class, 15 min in total/group. Class Next week

slide-2
SLIDE 2

Topics: Make sure you only solve a small part

1 Source Localization in an Ocean Waveguide Using Unsupervised ML 3 ML Methods for Ship Detection in Satellite Images 4 Transparent Conductor Prediction 4 Classify ships in San Francisco Bay using Planet satellite imagery 2 Fruit Recognition 3 Bone Age Prediction 1 Facial Expression Classification into Emotions 2 Urban Scene Segmentation for Autonomous Vehicles 1 Face Detection Using Deep Learning 2 Understanding the Amazon Rainforest from Space using NN 4 Mercedez Bench Test Time Estimation 3 Vegetation classification in Hyperspectral Images 4 Threat Detection with CNN 2 Plankton Classification Using ResNet and Inception V3 3 U-net on Biomedical Images 4 Image to Image Transformation using GAN 1 Dog Breed Classification Using CNN 1 Dog Breed Identification 2 Plankton Image Classification 3 Sunspot Detection

slide-3
SLIDE 3

What is a Graph

  • Set of nodes (vertices)
  • Might have properties associated with them
  • Set of edges (arcs) each consisting of pair of nodes
  • Undirected
  • Directed
  • Unweighted or weighted
slide-4
SLIDE 4

Road network

  • Model road system using graph
  • Nodes where road meet
  • Edges connections between points
  • Each edge has a weight
  • Expected time time to get from source node to destination node
  • Distance along from source node to destination node
  • Solve a graph optimization problem
  • Shortest weighted path between departure and destination node
slide-5
SLIDE 5

Trees

Undirected Tree Directed Tree Polytree

slide-6
SLIDE 6

f = 750 Hz

6

Location 1: Prince - “Sign o’ the times” Location 1: Otis Redding - “Hard to handle”

Spectral coherence between i and j

i j

(Normalization: |X(f,t)|2=1) 30-microphone array

slide-7
SLIDE 7

Trees

What would you do tonight? Decide amongst the following: Finish homework • Go to a party • Read a book • Hang out with friends

Homework'Deadline' tonight?' Do'homework'

Yes'

Party'invitaNon?'

No' No'

Do'I'have'friends'

Yes'

Go'to'the'party' Read'a'book'

No'

Hang'out'with' friends'

Yes'

slide-8
SLIDE 8

Regression Trees (Fig 9.2 in Hastie)

|

t1 t2 t3 t4 R1 R1 R2 R2 R3 R3 R4 R4 R5 R5 X1 X1 X1 X2 X2 X1 ≤ t1 X2 ≤ t2 X1 ≤ t3 X2 ≤ t4

slide-9
SLIDE 9

Details of the tree-building process

  • 1. Divide the predictor space, the set of possible values for X1,X2,...,Xp, into J distinct

and non-overlapping regions, R1, R2, . . . , RJ .

  • 2. For every observation that falls into the region Rj, we make the same prediction,

which is simply the mean of the response values for the training observations in Rj. The goal is to find boxes R1,...,RJ that minimize the RSS (residual sum square), given by where ! "#$is the mean response for the training observations within the jth box.

J

X

j=1

X

i∈Rj

(yi − ˆ yRj)2,

X1 X2

slide-10
SLIDE 10

Trees (Murphy 16.1)

We can write the model in the following form f(x) = E [y|x] =

M

  • m=1

wmI(x ∈ Rm) =

M

  • m=1

wmφ(x; vm) (16.4)

  • An alternative approach is to dispense with kernels altogether, and try to learn useful features

φ(x) directly from the input data. That is, we will create what we call an adaptive basis- function model (ABM), which is a model of the form f(x) = w0 +

M

  • m=1

wmφm(x) (16.3)

X1 X2

Classification and regression trees

slide-11
SLIDE 11

Trees (here regression trees)

The Hastie book (chapter 8 &9) is easiest to read

f(x) =

M

  • m=1

cmI(x ∈ Rm). (9.10)

  • ˆ

cm = ave(yi|xi ∈ Rm). (9.11)

We use a sum of squares ∑"

#[%" − ' () ]+

R1(j, s) = {X|Xj ≤ s} and R2(j, s) = {X|Xj > s}. (9.12) Then we seek the splitting variable j and split point s that solve

min

j, s

  • min

c1

  • xi∈R1(j,s)

(yi − c1)2 + min

c2

  • xi∈R2(j,s)

(yi − c2)2 . (9.13)

Define a split: s

X1 X2

slide-12
SLIDE 12

Trees (here classification trees)

In a region Rm, the proportion of points in class k is

ˆ pk(Rm) = 1 nm X

xi∈Rm

1{yi = k}.

{ ∈ ≤ } { ∈ } We then greedily chooses j, s by minimizing the misclassification error argmin

j,s

⇣⇥ 1 − ˆ pc1(R1) ⇤ + ⇥ 1 − ˆ pc2(R2) ⇤⌘

Where c1 is the most common class in R1 (similar for c2)

  • 0.2

0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 x1 x2

  • 0.2

0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 x1 x2

The split in region Rm on requires Nm splits.

slide-13
SLIDE 13

Bootstraping

The bootstrap is a fundamental resampling tool in statistics. The Idea in the bootstrap is that we can estimate the true F by the so-called empirical distribution Fˆ Given the training data (xi, yi), i = 1,…n, the empirical distribution function =>a discrete probability distribution, putting equal weight (1/n) on observed training points

P ˆ

F

  • (X, Y ) = (x, y)

= (

1 n

if (x, y) = (xi, yi) for some i

  • therwise

(x∗

i , y∗ i ), i = 1, . . . m

A bootstrap sample of size m from the training data is where each sample is drawn from uniformly at random from the training data with replacement This corresponds exactly to m independent draws from Fˆ. It approximates what we would see if we could sample more data from the true F. We often consider m = n, which is like sampling an entirely new training set.

From Ryan Tibshirani

slide-14
SLIDE 14

Bagging

A single tree has huge variance as a small change in variable can change the tree. The predictive task is greatly simplified. I.e.,

  • In classification, we predict the most commonly occurring class.
  • In regression, we take the average response value of points in the region.

Bootstrap aggregating, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; it is particularly useful in decision trees. we generate B bootstrapped training data sets. We train our method on the bth bootstrapped training set to get !∗# $ , the prediction at point x. We average all the predictions:

ˆ fbag(x) = 1 B

B

X

b=1

ˆ f∗b(x).

slide-15
SLIDE 15

Example: bagging

Example (from ESL 8.7.1): n = 30 training data points, p = 5 features, and K = 2 classes. No pruning used in growing trees:

Bagging helps decrease the misclassification rate of the classifier (evaluated on a large independent test set). Look at

  • range curve:
slide-16
SLIDE 16

Voting probabilities are not estimated class probabilities

Suppose that we wanted estimated class probabilities out of our bagging procedure. What about using, for each k = 1, . . . K: ˆ pbag

k

(x) = 1 B

B

X

b=1

1{ ˆ ftree,b(x) = k} I.e., the proportion of votes that were for class k? This is generally not a good estimate. Simple example: suppose that the true probability of class 1 given x is 0.75. Suppose also that each of the bagged classifiers ˆ ftree,b(x) correctly predicts the class to be 1. Then ˆ pbag

1

(x) = 1, which is wrong What’s nice about trees is that each tree already gives us a set of predicted class probabilities at x: ˆ ptree,b

k

(x), k = 1, . . . K. These are simply the proportion of points in the appropriate region that are in each class

slide-17
SLIDE 17

Alternative form of bagging

This suggests an alternative method for bagging. Now given an input x ∈ Rp, instead of simply taking the prediction ˆ ftree,b(x) from each tree, we go further and look at its predicted class probabilities ˆ ptree,b

k

(x), k = 1, . . . K. We then define the bagging estimates of class probabilities: ˆ pbag

k

(x) = 1 B

B

X

b=1

ˆ ptree,b

k

(x) k = 1, . . . K The final bagged classifier just chooses the class with the highest probability: ˆ fbag(x) = argmax

k=1,...K

ˆ pbag

k

(x) This form of bagging is preferred if it is desired to get estimates of the class probabilities. Also, it can sometimes help the overall prediction accuracy

…From Ryan Tibshirani

slide-18
SLIDE 18

Random Forrest

Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees. This reduces the variance when averaging the trees. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each split is based on a random selection of K predictors. The chosen split candidates from the full set of p predictors. The split can only one of those m predictors

slide-19
SLIDE 19

PATTERN RECOGNITION

AND MACHINE LEARNING

CHAPTER 8: GRAPHICAL MODELS

slide-20
SLIDE 20

10.1.4 Graph terminology

Before we continue, we must define a few basic terms, most of which are very intuitive. A graph G = (V, E) consists of a set of nodes or vertices, V = {1, . . . , V }, and a set

  • f edges, E = {(s, t) : s, t ∈ V}. We can represent the graph by its adjacency matrix, in

which we write G(s, t) = 1 to denote (s, t) ∈ E, that is, if s → t is an edge in the graph. If G(s, t) = 1 ifg G(t, s) = 1, we say the graph is undirected, otherwise it is directed. We usually assume G(s, s) = 0, which means there are no self loops. Here are some other terms we will commonly use:

  • Parent For a directed graph, the parents of a node is the set of all nodes that feed into it:

pa(s) {t : G(t, s) = 1}.

  • Child For a directed graph, the children of a node is the set of all nodes that feed out of it:

ch(s) {t : G(s, t) = 1}.

  • Family For a directed graph, the family of a node is the node and its parents, fam(s) =

{s} ∪ pa(s).

  • Root For a directed graph, a root is a node with no parents.
  • Leaf For a directed graph, a leaf is a node with no children.
  • Ancestors For a directed graph, the ancestors are the parents, grand-parents, etc of a node.

That is, the ancestors of t is the set of nodes that connect to t via a trail: anc(t) {s : s ❀ t}.

  • Descendants For a directed graph, the descendants are the children, grand-children, etc of

a node. That is, the descendants of s is the set of nodes that can be reached via trails from s: desc(s) {t : s ❀ t}.

  • Neighbors For any graph, we define the neighbors of a node as the set of all immediately

connected nodes, nbr(s) {t : G(s, t) = 1 ∨ G(t, s) = 1}. For an undirected graph, we

slide-21
SLIDE 21

write s ∼ t to indicate that s and t are neighbors (so (s, t) ∈ E is an edge in the graph).

  • Degree The degree of a node is the number of neighbors. For directed graphs, we speak of

the in-degree and out-degree, which count the number of parents and children.

  • Cycle or loop For any graph, we define a cycle or loop to be a series of nodes such that

we can get back to where we started by following edges, s1 − s2 · · · − sn − s1, n ≥ 2. If the graph is directed, we may speak of a directed cycle. For example, in Figure 10.1(a), there are no directed cycles, but 1 → 2 → 4 → 3 → 1 is an undirected cycle.

  • DAG A directed acyclic graph or DAG is a directed graph with no directed cycles. See

Figure 10.1(a) for an example.

  • Topological ordering For a DAG, a topological ordering or total ordering is a numbering
  • f the nodes such that parents have lower numbers than their children. For example, in

Figure 10.1(a), we can use (1, 2, 3, 4, 5), or (1, 3, 2, 5, 4), etc.

  • Path or trail A path or trail s ❀ t is a series of directed edges leading from s to t.
  • Tree An undirected tree is an undirectecd graph with no cycles. A directed tree is a DAG in

which there are no directed cycles. If we allow a node to have multiple parents, we call it a polytree, otherwise we call it a moral directed tree.

  • Forest A forest is a set of trees.
  • Subgraph A (node-induced) subgraph GA is the graph created by using the nodes in A and

their corresponding edges, GA = (VA, EA).

  • Clique For an undirected graph, a clique is a set of nodes that are all neighbors of each
  • ther. A maximal clique is a clique which cannot be made any larger without losing the

clique property. For example, in Figure 10.1(b), {1, 2} is a clique but it is not maximal, since we can add 3 and still maintain the clique property. In fact, the maximal cliques are as follows: {1, 2, 3}, {2, 3, 4}, {3, 5}.

slide-22
SLIDE 22

Three types of graphical model

Directed graphs

– useful for designing models

Undirected graphs

– good for some domains, e.g. computer vision

Factor graphs

– useful for inference and learning

slide-23
SLIDE 23

Bayesian Networks (Bayes Nets) or Directed graphical model (DGM)

Decomposition

slide-24
SLIDE 24

Directed Graphs or Bayesian Networks

General Factorization

slide-25
SLIDE 25

Bayesian Curve Fitting (1)

Polynomial Plate

slide-26
SLIDE 26

Bayesian Curve Fitting (3)

Input variables and explicit hyperparameters Condition on data

slide-27
SLIDE 27

Bayesian Curve Fitting —Prediction

Predictive distribution:

where

slide-28
SLIDE 28

Discrete Variables (1)

One distribution with just x1: K-1 parameters General joint distribution: K2-1 parameters Independent joint distribution: 2(K-1) parameters General joint distribution over M variables: KM-1parameters M-node Markov chain: K-1 + (M-1) K(K-1) parameters

slide-29
SLIDE 29

Discrete Variables: Bayesian Parameters (1)

slide-30
SLIDE 30

Discrete Variables: Bayesian Parameters (2)

Shared prior

slide-31
SLIDE 31

Parameterized Conditional Distributions

If are discrete, K-state variables, in general has O(KM) parameters.

The parameterized form requires only M + 1 parameters If the above is discrete variable 2M parameters

slide-32
SLIDE 32

Conditional Independence

a is independent of b given c Equivalently Notation

slide-33
SLIDE 33

Conditional Independence: Example 1

slide-34
SLIDE 34

Conditional Independence: Example 1

slide-35
SLIDE 35

Conditional Independence: Example 2

slide-36
SLIDE 36

Conditional Independence: Example 2

slide-37
SLIDE 37

Note: this is the opposite of Example 1, with c unobserved.

Conditional Independence: Example 3

slide-38
SLIDE 38

Note: this is the opposite of Example 1, with c unobserved.

Conditional Independence: Example 3

slide-39
SLIDE 39

D-separation

  • A, B, and C are non-intersecting subsets of nodes in a directed graph.
  • path from A to B is blocked if it contains a node such that either

a) the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

  • If all paths from A to B are blocked, A is said to be d-separated from B by C. Then the

joint distribution over all variables satisfies .

D-separation: Example