What's Next? 1. What's next? 2. K-means What's next? Last Class - PowerPoint PPT Presentation

What's Next? 1. What's next? 2. K-means

What's next? • Last Class Friday • No office hours Thurs

What's next? Programming • Problems get bigger • Hundreds or thousands of lines of code in a program • Programs themselves become complex objects worthy of study! • SoKware engineering • Programming techniques that support large and complex soKware • Object-oriented programming (CS18) • Event-driven programming (most web stuff) • …

What's next? Data structures • Lists, trees are very simple • Amenable to recursion approaches • Build on these: heaps, priority queues, … • Generalize: • Directed acyclic graphs • Prerequisite structure in course requirements make a good example • Directed graphs • Streets in a city (some of them one-way) for example • Edges oKen "labelled" with data like "how long to traverse this one block stretch?" • Problems like "find shortest path" (i.e., quickest route from here to there) • … [CS1570]

What's next? Analysis • Analysis of probabiliscc programs like RandSelect • Analysis of performance of more complicated data structures • Analysis of algorithms like shortest-path • Study of "effeccve" solucons to (some instances of) provably hard problems

What's next? Algorithms • How does Google work? • How does Facebook choose which ads to show you? • How do we recognize unusual behaviors? • Securices fraud • Crime • How do you make a drone deliver a package? • How does Disney/Pixar make Frozen II ?

A shi: in style • In CS17, we've been very concrete: let's sort this list of numbers, let's find an integer in a tree with int-values at nodes, etc. • ADTs moved away from this a ligle: we have a Dicconary, but we don't know the details, only the runcme-performance • In general CS work, the gap between the real world and the code is much greater

A conceptual gap • The internet consists of a bunch of computers ced together by network conneccons from computers to routers (specialized computers that can pass data from one machine to another) • The routers are interconnected as well • The conneccons come and go; some are permanent, some are very temporary • How do we get data from my computer to yours? • We'll work out an algorithm in which we somehow represent what a router is or can do , but in discussing the algorithm, we'll just draw pictures, etc. • Leave implementacon for later

What makes the next steps difficult? • Complexity • Abstraccon helps • Messiness • Real-world data doesn't arrive nicely formaged as lists • Real-world output oKen needs to be in specialized forms • Variety • Program output might need to go to a file, to your screen, to a remote computer, to your computer's speaker • Does every program need to consider the possibilices of every device? • Unreliability • Networks that fail • Programs interrupted by OS • Humans that type weird inputs • Data sources that are corrupted • …

Example problem and algorithm • We have a bunch of data: • We'd like to "classify" it into clusters (red dots could be cluster centers) • Nocce how difficult it is to even specify the problem precisely!

Idea • First, decide how many clusters (by hand?) • really annoying assumpcon, relieved by fancier methods • For our example, pick k = 2. • Grab any two points in the dataset as "centers"

Divide the data into those closer to each point

For each group, find the "mean"

Using these new means, reclassify!

Repeat unLl stabilized • What does "stabilized" mean?

What didn't I menLon? • How to find distances • Are data points stored in a list? An array? A tree? • What are the piles we created? • Are data points lists of ints? of floats? Are they tuples? • Are they all 2-dimensional? Could this work in 3D? in 10D?

Skills • Whatever math is needed • Whatever else is needed • For graphics: physics, … • An ability to guess some representacon of the problem that might work • The ability to translate a pictorial record of a discussion into an actual algorithm ("pseudocode") and then a real program ("code") • Analysis (during and aKer the fact)

ApplicaLon: classifying web pages • Start with a word list of all words that you want to consider (e.g., the words in a Merriam-Webster Dicconary) • Take a web page, and for each word, mark how oKen it appears: a 4 aardvark 0 absolute 2 abrupt 0 … … … …

• List of associated numbers ("bag of words") tells you something about the web page • Two pages whose word-counts look alike are "nearby" • Challenges: • What if my word-counts are exactly 5 cmes your word-counts • Our pages are probably very similar! • Idea from geometry: treat the count-list as a direccon in N-dimensional space (where N is the number of words in your dicconary) and divide it by its length to get a length-1 "vector". • Use "angle between direccons" as a measure of "distance" • Now apply k-means.

Problems • This doesn't work • "Common" words ("the", "of", "and", "a", …) completely dominate everything else. • Remove those? [Commonly called a "stop list"] • Then it works…sort of OK. • Exocc words ("epimetheus"), which don't get counted at all, may be the most important "signature" • Maybe we need to "weight" the word-counts by word-use-frequency! • That handles stop-words as well: they're so frequent that they get totally discounted • That doesn't work either. L Sufficiently rare words mess up everything • Mis-spell "Brady" as "Berady" and suddenly your football webpage is all about Arabic surnames

Results • The core algorithm — cluster by "nearer than" — is simple • Applying it in a new domain makes us consider hard quescons • What are "distance?" and "sameness?" • Why is word-use so skewed? Why are rare words so rare? • How is the vocabulary of a sentence related to its meaning? • When we used a bag-of-words, did we already throw away the essence? • Most of these quescons are outside the domain of "pure computer science." • …which just shows that "pure computer science" may be a misguided nocon • The "rare words" problem, and related ones actually showed that k-means isn't the right path to follow • Led to mulcdimensional scaling, topic clustering, etc.

Summary • Pure CS is interescng… • … but it gets beger (and harder) when it's influenced by the real world • CS also influences the world • We have a responsibility to consider that influence as we work

What's Next? 1. What's next? 2. K-means What's next? Last Class - PowerPoint PPT Presentation

What's Next? 1. What's next? 2. K-means What's next? Last Class Friday No office hours Thurs What's next? Programming Problems get bigger Hundreds or thousands of lines of code in a program Programs themselves become complex

Next Edge Theta Yield Fund Next Edge Capital Corp., January 2016 IMPORTANT NOTES The Next Edge

Next Edge Private Debt Fund Next Edge Capital Corp., June 2018 IMPORTANT NOTES The Next Edge

Next Edge Private Debt Fund Next Edge Capital Corp., November 2017 IMPORTANT NOTES The Next

Next Edge Bio-Tech Plus Fund Next Edge Capital Corp., March 2018 Important Notes The Next

Next Science Limited Investor Presentation 28 February 2020 Approved by the Board of Next

A Timeline for Critical Review of DIBELS Next Goals CTL submits new CTL recommends DIBELS Next

Rand Stagen March 5, 2019 THE NEXT LEVEL NEXT LEVEL You cannot solve a problem from the

Next Edge Private Debt Fund Next Edge Capital Corp., September 2016 IMPORTANT NOTES The

URBANISM NEXT: impacts of emerging technologies on cities Becky Steckler, AICP Urbanism Next

One-Loop Single-Real-Emission Contributions to pp H + X at Next-to-Next-to-Next-to-Leading

NEXT-DEMO J. Toledo The NEXT Collaboration jtoledo@eln.upv.es NEXTs experimental area in

The NEXT detector: an Electroluminescence Xenon TPC for neutrinoless double beta decay detection

Next Edge Bio-Tech Plus Fund June 2020 Important Notes The Next Edge Bio-Tech Plus Fundor

Next Generation Lighting Industry Alliance Keith Cook Keith Cook Chair Chair The Next

Next Edge RCM Private Yield Fund June 2020 1 Important Notes The Next Edge RCM Private Yield

Next Edge Private Debt Fund June 2020 1 Important Notes The Next Edge Private Debt Fundor

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

Dasin Retail Trust Financial Results for the Half Year ended 30 June 2020 13 August 2020

42 nd Annual Topics in Emergency Medicine November 4-6, 2013 Parc 55 Wyndham San Francisco, San

Metabolic Alterations in Fumarate Hydratase Deficient Cells Christian Frezza 1 MRC Cancer Unit,

University of California High-Performance AstroComputing Center UC-HIPACC JOEL PRIMACK UCSC

Re-inserting human interaction ! into cancer genome interpretation ! CYDNEY NIELSEN UNIVERSITY OF

MYCOSIS FUNGOIDES Christiane Querfeld, MD, PhD 2015... 2018 T-Cell Lymphomas: we are close to

Introduction Mohammad T . Irfan Email: mirfan@bowdoin.edu Web: www.bowdoin.edu/~mirfan Class