What's Next? 1. What's next? 2. K-means What's next? Last Class - - PowerPoint PPT Presentation

what s next
SMART_READER_LITE
LIVE PREVIEW

What's Next? 1. What's next? 2. K-means What's next? Last Class - - PowerPoint PPT Presentation

What's Next? 1. What's next? 2. K-means What's next? Last Class Friday No office hours Thurs What's next? Programming Problems get bigger Hundreds or thousands of lines of code in a program Programs themselves become complex


slide-1
SLIDE 1

What's Next?

  • 1. What's next?
  • 2. K-means
slide-2
SLIDE 2

What's next?

  • Last Class Friday
  • No office hours Thurs
slide-3
SLIDE 3

What's next? Programming

  • Problems get bigger
  • Hundreds or thousands of lines of code in a program
  • Programs themselves become complex objects worthy of study!
  • SoKware engineering
  • Programming techniques that support large and complex soKware
  • Object-oriented programming (CS18)
  • Event-driven programming (most web stuff)
slide-4
SLIDE 4

What's next? Data structures

  • Lists, trees are very simple
  • Amenable to recursion approaches
  • Build on these: heaps, priority queues, …
  • Generalize:
  • Directed acyclic graphs
  • Prerequisite structure in course requirements make a good example
  • Directed graphs
  • Streets in a city (some of them one-way) for example
  • Edges oKen "labelled" with data like "how long to traverse this one block stretch?"
  • Problems like "find shortest path" (i.e., quickest route from here to there)
  • … [CS1570]
slide-5
SLIDE 5

What's next? Analysis

  • Analysis of probabiliscc programs like RandSelect
  • Analysis of performance of more complicated data structures
  • Analysis of algorithms like shortest-path
  • Study of "effeccve" solucons to (some instances of) provably hard

problems

slide-6
SLIDE 6

What's next? Algorithms

  • How does Google work?
  • How does Facebook choose which ads to show you?
  • How do we recognize unusual behaviors?
  • Securices fraud
  • Crime
  • How do you make a drone deliver a package?
  • How does Disney/Pixar make Frozen II?
slide-7
SLIDE 7

A shi: in style

  • In CS17, we've been very concrete: let's sort this list of numbers, let's

find an integer in a tree with int-values at nodes, etc.

  • ADTs moved away from this a ligle: we have a Dicconary, but we

don't know the details, only the runcme-performance

  • In general CS work, the gap between the real world and the code is

much greater

slide-8
SLIDE 8

A conceptual gap

  • The internet consists of a bunch of computers ced together by network

conneccons from computers to routers (specialized computers that can pass data from one machine to another)

  • The routers are interconnected as well
  • The conneccons come and go; some are permanent, some are very

temporary

  • How do we get data from my computer to yours?
  • We'll work out an algorithm in which we somehow represent what a router

is or can do, but in discussing the algorithm, we'll just draw pictures, etc.

  • Leave implementacon for later
slide-9
SLIDE 9

What makes the next steps difficult?

  • Complexity
  • Abstraccon helps
  • Messiness
  • Real-world data doesn't arrive nicely formaged as lists
  • Real-world output oKen needs to be in specialized forms
  • Variety
  • Program output might need to go to a file, to your screen, to a remote computer, to your

computer's speaker

  • Does every program need to consider the possibilices of every device?
  • Unreliability
  • Networks that fail
  • Programs interrupted by OS
  • Humans that type weird inputs
  • Data sources that are corrupted
slide-10
SLIDE 10
slide-11
SLIDE 11

Example problem and algorithm

  • We have a bunch of data:
  • We'd like to "classify" it into

clusters (red dots could be cluster centers)

  • Nocce how difficult it is to

even specify the problem precisely!

slide-12
SLIDE 12

Idea

  • First, decide how many clusters (by hand?)
  • really annoying assumpcon, relieved by fancier methods
  • For our example, pick k = 2.
  • Grab any two points in the dataset as "centers"
slide-13
SLIDE 13
slide-14
SLIDE 14

Divide the data into those closer to each point

slide-15
SLIDE 15
slide-16
SLIDE 16

For each group, find the "mean"

slide-17
SLIDE 17
slide-18
SLIDE 18

Using these new means, reclassify!

slide-19
SLIDE 19
slide-20
SLIDE 20

Repeat unLl stabilized

  • What does "stabilized" mean?
slide-21
SLIDE 21
slide-22
SLIDE 22

What didn't I menLon?

  • How to find distances
  • Are data points stored in a list? An array? A tree?
  • What are the piles we created?
  • Are data points lists of ints? of floats? Are they tuples?
  • Are they all 2-dimensional? Could this work in 3D? in 10D?
slide-23
SLIDE 23

Skills

  • Whatever math is needed
  • Whatever else is needed
  • For graphics: physics, …
  • An ability to guess some representacon of the problem that might

work

  • The ability to translate a pictorial record of a discussion into an actual

algorithm ("pseudocode") and then a real program ("code")

  • Analysis (during and aKer the fact)
slide-24
SLIDE 24

ApplicaLon: classifying web pages

  • Start with a word list of all words that you want to consider (e.g., the

words in a Merriam-Webster Dicconary)

  • Take a web page, and for each word, mark how oKen it appears:

a 4 aardvark absolute 2 abrupt … … … …

slide-25
SLIDE 25
  • List of associated numbers ("bag of words") tells you something about the

web page

  • Two pages whose word-counts look alike are "nearby"
  • Challenges:
  • What if my word-counts are exactly 5 cmes your word-counts
  • Our pages are probably very similar!
  • Idea from geometry: treat the count-list as a direccon in N-dimensional

space (where N is the number of words in your dicconary) and divide it by its length to get a length-1 "vector".

  • Use "angle between direccons" as a measure of "distance"
  • Now apply k-means.
slide-26
SLIDE 26

Problems

  • This doesn't work
  • "Common" words ("the", "of", "and", "a", …) completely dominate

everything else.

  • Remove those? [Commonly called a "stop list"]
  • Then it works…sort of OK.
  • Exocc words ("epimetheus"), which don't get counted at all, may be the

most important "signature"

  • Maybe we need to "weight" the word-counts by word-use-frequency!
  • That handles stop-words as well: they're so frequent that they get totally discounted
  • That doesn't work either. L Sufficiently rare words mess up everything
  • Mis-spell "Brady" as "Berady" and suddenly your football webpage is all about Arabic surnames
slide-27
SLIDE 27

Results

  • The core algorithm — cluster by "nearer than" — is simple
  • Applying it in a new domain makes us consider hard quescons
  • What are "distance?" and "sameness?"
  • Why is word-use so skewed? Why are rare words so rare?
  • How is the vocabulary of a sentence related to its meaning?
  • When we used a bag-of-words, did we already throw away the essence?
  • Most of these quescons are outside the domain of "pure computer

science."

  • …which just shows that "pure computer science" may be a misguided nocon
  • The "rare words" problem, and related ones actually showed that k-means

isn't the right path to follow

  • Led to mulcdimensional scaling, topic clustering, etc.
slide-28
SLIDE 28

Summary

  • Pure CS is interescng…
  • … but it gets beger (and harder) when it's influenced by the real

world

  • CS also influences the world
  • We have a responsibility to consider that influence as we work