Administrative notes October 26, 2017 Well do some In the News - - PowerPoint PPT Presentation

administrative notes october 26 2017
SMART_READER_LITE
LIVE PREVIEW

Administrative notes October 26, 2017 Well do some In the News - - PowerPoint PPT Presentation

Administrative notes October 26, 2017 Well do some In the News Groupwork today Reminder: my office hours for this Friday are cancelled Reminder: optional project proposal resubmission deadline extension now due tomorrow


slide-1
SLIDE 1

Computational Thinking ct.cs.ubc.ca

Administrative notes October 26, 2017

  • We’ll do some In the News Groupwork today
  • Reminder: my office hours for this Friday are cancelled
  • Reminder: optional project proposal resubmission deadline

extension – now due tomorrow

  • Reminder: midterm #2 November 9 in class
slide-2
SLIDE 2

Computational Thinking ct.cs.ubc.ca

Today we’re going to start on the group component of In The News call #2

  • Today we’re going to spend some time in class having your groups work
  • n the group component
  • Make sure that you’ve read the grade rubric:

https://www.ugrad.cs.ubc.ca/~cs100/2017W1/in-the-news.html#rubric

  • Make sure you comment on the CT Building Block, Application, and/or

Impact!

  • Picking an article/topic and then looking for related articles is both okay

and encouraged

  • You can pick an article that one of the people in your group did or any
  • ther article that has been posted
  • Make sure that you cite articles that you use – pick any citation style –

see discussion of Plagiarism on project page : https://www.ugrad.cs.ubc.ca/~cs100/2017W1/project.html#plagiarism

slide-3
SLIDE 3

Computational Thinking ct.cs.ubc.ca

Greed, for lack of a better word, is good

  • The algorithm that we used to create the

decision tree is a greedy algorithm

  • In a greedy algorithm, you make a choice that’s

the optimal choice for now and hope that it’s the

  • ptimal choice in the long run
  • Sometimes it’s the best in the long run,

sometimes it’s not.

  • In building a decision tree, greedy will not always

be optimal – but it’s pretty good, and it’s much faster than an optimal approach

  • In some problems you can prove that greedy can

find the best solution!

slide-4
SLIDE 4

Computational Thinking ct.cs.ubc.ca

Computational thinking in your life: homework In a group, discuss your algorithms for how you decide what order to do your homework in and why you choose that order

The homework where I'm the most behind in first Whichever's due first Whatever the next course is Easy ones first Hardest ones first Ones that I like first Whichever one I think is fastest First come first served (first assigned)

slide-5
SLIDE 5

Computational Thinking ct.cs.ubc.ca

Which algorithm is best requires knowing what you’re trying to optimize (the “why”)

In a group, design a greedy algorithm to reduce the length of your homework todo list as fast as possible Hint: your algorithm should look like “always do the [property] remaining assignment next”

Do the shortest first

slide-6
SLIDE 6

Computational Thinking ct.cs.ubc.ca

Clicker question: is it optimal?

Just guess: is a correctly-written greedy algorithm for minimizing the length of your todo list by doing the shortest one next optimal?

  • A. Yes
  • B. No
slide-7
SLIDE 7

Computational Thinking ct.cs.ubc.ca

Are other scheduling criteria maximized with greedy algorithms?

  • Some yes: Minimizing maximal lateness

(greedily do the assignment with the closest due date first)

  • Some no: If you still want to reduce your todo

list as much as possible, but you want to have different priorities for different classes, greedy is no longer optimal.

slide-8
SLIDE 8

Computational Thinking ct.cs.ubc.ca

Popping back up a level…

The second type of data mining that we will look at in detail involves putting similar items together in groups

slide-9
SLIDE 9

Computational Thinking ct.cs.ubc.ca

Exercise: Group this!

Given the list of items below, put items together into groups. You can have as many groups as you want. Groups do not need to have the same number of items.

slide-10
SLIDE 10

Computational Thinking ct.cs.ubc.ca

Exercise: Group this!

What kind of groups did you get? What criteria did you use to form each group?

Digital images vs. non digital images Colours 3D vs. 2D

slide-11
SLIDE 11

Computational Thinking ct.cs.ubc.ca

Exercise: Group this!- Possible Solution 1

Group 1: Things I use when going to school Group 2: People I can call 911 to get Group 3: Flag

slide-12
SLIDE 12

Computational Thinking ct.cs.ubc.ca

Exercise: Group this! - Possible Solution 2

Group 1: Things that are green Group 2: Things that are blue Group 3: Things that are red and white

slide-13
SLIDE 13

Computational Thinking ct.cs.ubc.ca

What is clustering?

Clustering is partitioning a set of items into subgroups so as to ensure certain measures of quality (e.g., “similar” items are grouped together)

slide-14
SLIDE 14

Computational Thinking ct.cs.ubc.ca

Why cluster? Netflix movie recommendations

“There’s a mountain of data that we have at our disposal,” says Todd Yellin, Netflix’s VP of product innovation. “That mountain is composed of two things. Garbage is 99 percent of that

  • mountain. Gold is one percent… .

Group exercise: What information about customers do you think that Netflix uses when deciding what movies to recommend?

Previous movies that you've watched Geographical area Age Gender Time Tv shows vs. movies.

slide-15
SLIDE 15

Computational Thinking ct.cs.ubc.ca

Why cluster? Netflix movie recommendations

“There’s a mountain of data that we have at our disposal,” says Todd Yellin, Netflix’s VP of product innovation. “That mountain is composed of two things. Garbage is 99 percent of that

  • mountain. Gold is one percent… .

Geography, age, and gender? We put that in the garbage heap. Where you live is not that important.”

https://www.wired.com/2016/03/netflixs-grand-maybe- crazy-plan-conquer-world/

slide-16
SLIDE 16

Computational Thinking ct.cs.ubc.ca

Why cluster? Netflix movie recommendations

Netflix group its tens of thousands of titles into a few thousand “clusters" based not on where people live, but what they like. Netflix assigns each subscriber to a handful of these clusters, weighted by the degree to which each matches their taste. “When you have more than 75 million people around the world, you can get really specific about who’s your taste,” says Yellin.

https://www.wired.com/2016/03/netflixs-grand-maybe- crazy-plan-conquer-world/

slide-17
SLIDE 17

Computational Thinking ct.cs.ubc.ca

Why cluster? Netflix movie recommendations

The movies recommended to you are based on those that others in your clusters watch or recommend. “We used to be more naive. We used to overexploit individual signals,” says Yellin. “If you watched a romantic comedy, years ago we would have

  • verexploited that. The whole top of your screen would

be more romantic comedies. Not a lot of variety. And that gets you into a quick cul-de-sac of too much content around one area.”

https://www.wired.com/2016/03/netflixs-grand-maybe- crazy-plan-conquer-world/

slide-18
SLIDE 18

Computational Thinking ct.cs.ubc.ca

Why cluster? Netflix movie recommendations

A related problem: how to predict how users will rate a new movie? Netflix has a competition with a 1 million dollar prize for algorithms that do this well. They provide training data: 100 million ratings generated by over 480 thousand users on over 17 thousand movies. Competitors use clustering (among other techniques) in their solutions.

slide-19
SLIDE 19

Computational Thinking ct.cs.ubc.ca

Why cluster? Breast cancer treatment

slide-20
SLIDE 20

Computational Thinking ct.cs.ubc.ca

First, let’s define Gene Expression

http://learn.genetics.utah.edu/content/science/expression/

slide-21
SLIDE 21

Computational Thinking ct.cs.ubc.ca

Why cluster? Breast cancer treatment

“Breast cancer patients with the same stage of disease can have markedly different treatment responses and overall outcome. [...] Chemotherapy or hormonal therapy reduces the risk of distant metastases by approximately one- third; however, 70–80% of patients receiving this treatment would have survived without it.”

slide-22
SLIDE 22

Computational Thinking ct.cs.ubc.ca

Why cluster? Breast cancer treatment

“Here we applied supervised classification to identify a gene expression signature strongly predictive of a short interval to distant metastases ('poor prognosis' signature). Our findings provide a strategy to select patients who would benefit from adjuvant therapy.” “An unsupervised, hierarchical clustering algorithm allowed us to cluster the 98 tumours on the basis of their similarities measured over [...] approximately 5,000 significant genes.”

slide-23
SLIDE 23

Computational Thinking ct.cs.ubc.ca

Why cluster? Breast cancer treatment

Visual of two tumour clusters,

  • ne with primarily

upregulated (red) genes the

  • ther with almost all

downregulated (green) genes

  • ne tumour

per row

  • ne gene

per column

slide-24
SLIDE 24

Computational Thinking ct.cs.ubc.ca

Why cluster?

  • A way to explore data for any hidden patterns or

correlations

  • Once you see something, you can delve further

but it is a good way to quickly try to see if there are any possible relationships you have missed

  • Helps organize data
  • Reduces the number of data points (e.g., you can

reduce a cluster to a representative data point)

  • Results might be fed into other data mining

techniques

slide-25
SLIDE 25

Computational Thinking ct.cs.ubc.ca

Clustering by numbers

  • All of the examples we’ve seen can be framed as

“clustering by numbers”

  • What do we mean by that?
slide-26
SLIDE 26

Computational Thinking ct.cs.ubc.ca

Clustering by numbers

  • All of the examples we’ve seen can be framed as

“clustering by numbers”

  • What do we mean by that?
  • We cluster points,

typically in a high- dimensional space

  • The example here is a

2-dimensional space

slide-27
SLIDE 27

Computational Thinking ct.cs.ubc.ca

Administrative notes October 31, 2017

  • Reminder: In the News #2 Groupwork due tonight at midnight
  • You must be in a group to submit! Sign up on Canvas now if you

don’t have a group!

  • Reminder: midterm #2 November 9 in class
  • Not cumulative (i.e., material from midterm #1 will not appear)
slide-28
SLIDE 28

Computational Thinking ct.cs.ubc.ca

The goal in clustering data is to find points that are “near” each other

  • For example, to form

project groups, we might cluster students along the dimensions of “desired grade” and “procrastination tendency”

  • Most of the time, there

are many more dimensions

Desired grade Procrastination tendency

Students

slide-29
SLIDE 29

Computational Thinking ct.cs.ubc.ca

Clustering by numbers Netflix example

Clustering task: cluster movies based on whether subscribers give them similar ratings Data: tens of thousands of movies; for each movie, subscriber ratings (there are almost 100 million subscribers!) Data points: one point per movie: (rating1, rating2, ... ratingn), where

  • ratingk is the rating of subscriber k (or "null" if no rating)

Dimension: the number of subscribers who provide ratings

slide-30
SLIDE 30

Computational Thinking ct.cs.ubc.ca

Clustering by numbers Breast cancer example

Clustering task: cluster breast cancer tumour samples, based

  • n similarities between gene expression measurements

Data: 98 tumours; for each tumour, gene expression measurements of ~5,000 genes Data points clicker question: How many data points? What is the data dimension?

  • A. 98 data points, dimension 5000
  • B. 5000 data points, dimension 98
slide-31
SLIDE 31

Computational Thinking ct.cs.ubc.ca

Clustering by numbers Breast cancer example

Clustering task: cluster breast cancer tumour samples, based

  • n similarities between gene expression levels

Data: 98 tumours; for each tumour, gene expression levels of ~5,000 genes Data points: one point per tumour: (level1, level2, ... level5000), where

  • levelk is the gene expression level of tumour k
  • the data dimension is the number of genes
slide-32
SLIDE 32

Computational Thinking ct.cs.ubc.ca

Measuring clustering quality

Super important is knowing which data to cluster on!

  • Netflix does not use data pertaining to geography,

gender, or age of subscribers when clustering movies, so there are no dimensions for that data.

  • Libraries don’t cluster books by colour, rather by

content

  • In what follows we'll assume that the data

dimensions we’re clustering on are those that matter for quality

slide-33
SLIDE 33

Computational Thinking ct.cs.ubc.ca

Measuring clustering quality

  • Suppose that we have two potential clusters of a

bunch of points. Which is better?

  • Let’s look at an example
slide-34
SLIDE 34

Computational Thinking ct.cs.ubc.ca

Consider the following potential clusterings for assigning children to three different schools. What are the relative merits of each clustering? A B

Measuring clustering quality Group exercise

slide-35
SLIDE 35

Computational Thinking ct.cs.ubc.ca

Consider the following potential clusterings for assigning children to three different schools. What are the relative merits of each clustering?

Measuring clustering quality Group exercise

A: More evenly split B: maybe we want more diversity

slide-36
SLIDE 36

Computational Thinking ct.cs.ubc.ca

Measuring clustering quality Which do you like better and why?

A B

A looks cooler! Clustered better along the x axis Yin and Yang, so obviously A is better B is better because the red dots are lower than 5