Web Usage Mining from Bing Liu. Web Data Mining: Exploring - PowerPoint PPT Presentation

Web Usage Mining from Bing Liu. “Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data”, Springer Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M. Spiliopoulou 1 Data e Web Mining - S. Orlando

Introduction  Web usage mining – automatic discovery of patterns in clickstreams and associated data, collected or generated as a result of user interactions with one or more Web sites.  Goal: analyze the behavioral patterns and profiles of users interacting with a Web site.  The discovered patterns are usually represented as – collections of pages, objects, or resources that are frequently accessed by groups of users with common interests. 2 Data e Web Mining - S. Orlando

Introduction  Data in Web Usage Mining: – Web server logs – Site contents – Data about the visitors, gathered from external channels – Further application data  Not all these data are always available.  When they are, they must be integrated.  A large part of Web usage mining is about processing usage/ clickstream data. – After that various data mining algorithm can be applied. 3 Data e Web Mining - S. Orlando

Web server logs 4 Data e Web Mining - S. Orlando

Terminology and level of abstractions 5 Data e Web Mining - S. Orlando

Web usage mining (simplified view) 6 Data e Web Mining - S. Orlando

Web usage mining process 7 Data e Web Mining - S. Orlando

Data preparation 8 Data e Web Mining - S. Orlando

Data cleaning, fusion  Data cleaning – remove irrelevant references and fields in server logs – remove references due to spider/robot navigation – remove erroneous references – add missing references due to caching (done after sessionization)  Data fusion/integration – synchronize data from multiple server logs – integrate e-commerce and application server data – integrate meta-data (e.g., content labels) – integrate demographic / registration data 9 Data e Web Mining - S. Orlando

Data transformation  Data Transformation – user identification – sessionization – pageview identification • a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser – episode identification  Data Reduction – sampling and dimensionality reduction (ignoring certain pageviews / items)  Identifying User Transactions – i.e., sets or sequences of pageviews possibly with associated weights 10 Data e Web Mining - S. Orlando

Identify sessions (Sessionization)  Quality of the patterns discovered in KDD depends on the quality of the data on which mining is applied.  In Web usage analysis, these data are the sessions of the site visitors – the activities performed by a user from the moment she enters the site until the moment she leaves it.  Difficult to obtain reliable usage data due to – proxy servers and anonymizers, – dynamic IP addresses, – missing references due to caching, and – the inability of servers to distinguish among different visits. 11 Data e Web Mining - S. Orlando

Sessionization strategies  Session reconstruction = correct mapping of activities to different individuals + correct separation of activities belonging to different visits of the same individual 12 Data e Web Mining - S. Orlando

User identification 13 Data e Web Mining - S. Orlando

Session uncertainty: evaluate Real vs. Re- constructed sessions 14 Data e Web Mining - S. Orlando

User identification: an example Combination of IP address and Agent fields in Web logs 15 Data e Web Mining - S. Orlando

Sessionization heuristics Also called structure-oriented: use either the static structure of the site, or the implicit linkage structure inferred from the referrer fields 16 Data e Web Mining - S. Orlando

Sessionization example: time-oriented heuristic 17 Data e Web Mining - S. Orlando

Pageview identification  Pageview identification – Depends on the intra-page structure of sites – Identify the collection of Web files/objects/resources representing a specific “user event” corresponding to a click- through (e.g. viewing a product page, adding a product to a shopping cart) – In some cases it may be nice to consider pageviews at a higher level of aggregation • e.g. they may correspond to many user event related to the same concept category, like the purchase of a product on an online e- commerce site 18 Data e Web Mining - S. Orlando

Path completion  Client- or proxy-side caching can often result in missing access references to those pages or objects that have been cached.  For instance, – if a user goes back to a page A during the same session, the second access to A will likely result in viewing the previously downloaded version of A that was cached on the client-side, and therefore, no request is made to the server. – This results in the second reference to A not being recorded on the server logs. 19 Data e Web Mining - S. Orlando

Path completion  Path completion: – How to infer missing user references due to caching.  Effective path completion requires extensive knowledge of the link structure within the site  Referrer information in server logs can also be used in disambiguating the inferred paths.  Problem gets much more complicated in frame-based sites. 20 Data e Web Mining - S. Orlando

Missing references due to caching  Reconstruction by using the knowledge about the site structure – also inferred from the the referrer fields  Many paths are possible – usually the selected path is the one requiring the fewest number of “back” reference 21 Data e Web Mining - S. Orlando

Data modeling for Web Usage Mining  Data preprocessing produces – a set of pageviews: P={p 1 , …, p n } – a set of user transactions: T={t 1 , …, t m } where each transaction t i contains a subset of P – Each transaction: t )), … ,( p l t , w ( p 1 t )),( p 2 t , w ( p 2 t , w ( p l t )) t = ( p 1 is a l -length ordered sequence of pageviews, where each w corresponds to a weight, e.g. the significance of the pageview – In collaborative filtering these weights correspond to explicit user ratings – In Web collected transactions , the duration of the page visit in the session 22 Data e Web Mining - S. Orlando

Data modeling for Web Usage Mining (cont.)  In many mining tasks, the sequential ordering of the transactions is not important (e.g.: clustering, association rule extractions)  In this case a transaction can be represented as an n -length vector : t ) t , w 2 t , … , w n t = ( w 1 where the weight is 0 if the corresponding page is not present in t , otherwise correspond to the significance of the page in the t page A page B page C page D page E user 0 15 4 1 0 0 m × n user- pageviews user 1 2 0 25 0 0 matrix (or user 2 200 1 0 0 3 transaction user 3 56 0 0 4 4 matrix) user 4 0 0 23 50 0 user 5 0 0 5 3 0 23 Data e Web Mining - S. Orlando

Data modeling for Web Usage Mining (cont.)  Given a user-pageview matrix, a number of unsupervised mining techniques can be exploited page A page B page C page D page E user 0 15 4 1 0 0 m × n user- pageviews user 1 2 0 25 0 0 matrix (or user 2 200 1 0 0 3 transaction user 3 56 0 0 4 4 matrix) user 4 0 0 23 50 0 user 5 0 0 5 3 0  Clustering of transactions/sessions to determine important visitor segments  Clustering of pageviews (items) expressed in terms of user judgments, , to discover important relationships between pageviews (items)  Sequential (timestamps must be maintained) and non sequential association rules, to discover important relationships between pageviews (items) 24 Data e Web Mining - S. Orlando

Data modeling for Web Usage Mining (cont.)  Automatic integration of content information – textual features from the Web contents represent the underlying semantics of the pages – aiming to transform a user-pageviews matrix into a content-enhanced transaction matrix food news car house party sky page A 0 1 1 0 0 0 n × r page B 1 0 0 1 0 0 pageviews- terms page C 1 1 0 0 0 0 matrix page D 0 0 1 0 0 1 page E 0 0 0 1 1 0 25 Data e Web Mining - S. Orlando

Data modeling for Web Usage Mining (cont.) food news car house party sky page A 0 1 1 0 0 0 n × r page B 1 0 0 1 0 0 pageviews- P= terms page C 1 1 0 0 0 0 matrix page D 0 0 1 0 0 1 page E 0 0 0 1 1 0 page A page B page C page D page E user 0 1 1 0 0 0 m × n user- user 1 0 0 1 0 0 pageviews U= user 2 1 0 0 0 1 matrix (or user 3 1 0 0 1 1 transaction user 4 0 0 1 1 0 matrix) user 5 0 0 1 0 0 food news car house party sky user 0 1 1 1 1 0 0 m × r user 1 1 1 0 0 0 0 content- enhanced U × P = user 2 0 1 1 1 1 0 transaction user 3 0 1 2 1 1 1 matrix user 4 1 1 1 0 0 1 26 user 5 1 1 0 0 0 0 Data e Web Mining - S. Orlando

Web Usage Mining from Bing Liu. Web Data Mining: Exploring - PowerPoint PPT Presentation

Web Usage Mining from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M. Spiliopoulou 1 Data e Web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Web Mining Web Mining to automatically discover and extract information from Web

Web mining and knowledge discovery of usage patterns - A survey CS748 Yan Wang Introduction

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Web Usage Mining Bolong Zhang 3/27/2019 Outline Overview Aim & Obejective Different

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Physics plans and and ILDG ILDG usage usage Physics plans in Italy Italy in Francesco Di

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Human Computer I nterface Design Chapter 8 I nternet-Based I nteractive Systems Design

Background of Project Background of Project The Webs Missing Links: The Webs Missing

Google Ads Search & Display Certification Course 2 Outline Modules we will be covering

February 27, 2018 Forward looking statements and non-GAAP measures Caution Regarding

Optimal Keywork Bids in Search-Based Advertising with Stochastic Ad Positions S. Cholette, .

PTA Meeting January 16 , 2019 psis78pta.org Agenda 1. Call to Order 2. Vote to approve

MODULE 7: NEW CDBG OVERSIGHT FUNCTIONALITY IDIS Online for CDBG Entitlement Communities 1 More

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Web Usage Mining from Bing Liu. Web Data Mining: Exploring - PowerPoint PPT Presentation

Web Usage Mining from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M. Spiliopoulou 1 Data e Web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Web Mining Web Mining to automatically discover and extract information from Web

Web mining and knowledge discovery of usage patterns - A survey CS748 Yan Wang Introduction

Introduction to Web Mining What is Web Mining? Discovering useful information from the

Web Usage Mining Bolong Zhang 3/27/2019 Outline Overview Aim &amp; Obejective Different

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Physics plans and and ILDG ILDG usage usage Physics plans in Italy Italy in Francesco Di

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Human Computer I nterface Design Chapter 8 I nternet-Based I nteractive Systems Design

Background of Project Background of Project The Webs Missing Links: The Webs Missing

Google Ads Search &amp; Display Certification Course 2 Outline Modules we will be covering

February 27, 2018 Forward looking statements and non-GAAP measures Caution Regarding

Optimal Keywork Bids in Search-Based Advertising with Stochastic Ad Positions S. Cholette, .

PTA Meeting January 16 , 2019 psis78pta.org Agenda 1. Call to Order 2. Vote to approve

MODULE 7: NEW CDBG OVERSIGHT FUNCTIONALITY IDIS Online for CDBG Entitlement Communities 1 More

CEPH WIRE PROTOCOL REVISITED CEPH WIRE PROTOCOL REVISITED MESSENGER V2 MESSENGER V2 Ricardo

Web Usage Mining Bolong Zhang 3/27/2019 Outline Overview Aim & Obejective Different

Google Ads Search & Display Certification Course 2 Outline Modules we will be covering