Introduction to Web Mining What is Web Mining? Discovering useful - - PowerPoint PPT Presentation

introduction to web mining what is web mining
SMART_READER_LITE
LIVE PREVIEW

Introduction to Web Mining What is Web Mining? Discovering useful - - PowerPoint PPT Presentation

CS 345A Data Mining Lecture 1 Introduction to Web Mining What is Web Mining? Discovering useful information from the World-Wide Web and its usage patterns Web Mining v. Data Mining Structure (or lack of it) Textual information and


slide-1
SLIDE 1

CS 345A Data Mining Lecture 1

Introduction to Web Mining

slide-2
SLIDE 2

What is Web Mining?

Discovering useful information from the World-Wide Web and its usage patterns

slide-3
SLIDE 3

Web Mining v. Data Mining

Structure (or lack of it)

Textual information and linkage structure

Scale

Data generated per day is comparable to largest conventional data warehouses

Speed

Often need to react to evolving usage patterns in real-time (e.g., merchandising)

slide-4
SLIDE 4

Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

slide-5
SLIDE 5

Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

slide-6
SLIDE 6

Size of the Web

Number of pages

Technically, infinite Much duplication (30-40% ) Best estimate of “unique” static HTML pages comes from search engine claims

Google = 8 billion(?), Yahoo = 20 billion

slide-7
SLIDE 7

Netcraft survey

http: / / news.netcraft.com/ archives/ web_server_survey.html

slide-8
SLIDE 8

The web as a graph

Pages = nodes, hyperlinks = edges

Ignore content Directed graph

High linkage

10-20 links/ page on average Power-law degree distribution

slide-9
SLIDE 9

Structure of Web graph

Let’s take a closer look at structure

Broder et al (2000) studied a crawl of 200M pages and other smaller crawls Bow-tie structure

Not a “small world”

slide-10
SLIDE 10

Bow-tie Structure

Source: Broder et al, 2000

slide-11
SLIDE 11

What can the graph tell us?

Distinguish “important” pages from unimportant ones

Page rank

Discover communities of related pages

Hubs and Authorities

Detect web spam

Trust rank

slide-12
SLIDE 12

Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

slide-13
SLIDE 13

Power-law degree distribution

Source: Broder et al, 2000

slide-14
SLIDE 14

Power-laws galore

Structure

In-degrees Out-degrees Number of pages per site

Usage patterns

Number of visitors Popularity e.g., products, movies, music

slide-15
SLIDE 15

The Long Tail

Source: Chris Anderson (2004)

slide-16
SLIDE 16

The Long Tail

Shelf space is a scarce commodity for traditional retailers

  • Also: TV networks, movie theaters,…

The web enables near-zero-cost dissemination of information about products More choice necessitates better filters

  • Recommendation engines (e.g., Amazon)
  • How Into Thin Air made Touching the Void a

bestseller

slide-17
SLIDE 17

Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

slide-18
SLIDE 18

Extracting Structured Data

http: / / www.simplyhired.com

slide-19
SLIDE 19

Extracting structured data

http: / / www.fatlens.com

slide-20
SLIDE 20

Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

slide-21
SLIDE 21

Searching the Web

Content aggregators The Web Content consumers

slide-22
SLIDE 22

Ads vs. search results

slide-23
SLIDE 23

Ads vs. search results

Search advertising is the revenue model

Multi-billion-dollar industry Advertisers pay for clicks on their ads

Interesting problems

What ads to show for a search? If I’m an advertiser, which search terms should I bid on and how much to bid?

slide-24
SLIDE 24

Sidebar: What’s in a name?

Geico sued Google, contending that it

  • wned the trademark “Geico”

Thus, ads for the keyword geico couldn’t be sold to others

Court Ruling: search engines can sell keywords including trademarks No court ruling yet: whether the ad itself can use the trademarked word(s)

slide-25
SLIDE 25

Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

slide-26
SLIDE 26

Systems architecture

Mem ory Disk CPU Machine Learning, Statistics “Classical” Data Mining

slide-27
SLIDE 27

Very Large-Scale Data Mining

Mem Disk CPU Mem Disk CPU Mem Disk CPU

Cluster of com m odity nodes

slide-28
SLIDE 28

Systems Issues

Web data sets can be very large

Tens to hundreds of terabytes

Cannot mine on a single server!

Need large farms of servers

How to organize hardware/ software to mine multi-terabye data sets

Without breaking the bank!

slide-29
SLIDE 29

Web Mining topics

Web graph analysis Power Laws and The Long Tail Structured data extraction Web advertising Systems Issues

slide-30
SLIDE 30

Project

Lots of interesting project ideas

  • If you can’t think of one please come discuss

with us

Infrastructure

  • Google
  • Amazon EC2

Data

  • Netflix
  • Google
  • WebBase
  • TREC
slide-31
SLIDE 31

The World-Wide Web

Our modern-day Library of Alexandria

The Web