MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, - PowerPoint PPT Presentation

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, May 18, 2009

Web Document Classification Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised with material from the Probabilistic Resoning Project Developed in summer 2007 by Zdravko Markov and Ingrid Russell

Introduction • Topic directories (dmoz.org) • Automatic classification of web pages • Expanding and creating new directory structures structures • Investigating the process of tagging (labeling) web pages using topic directory structures • Applying Machine Learning techniques for automatic tagging

Objectives • Learn basic concepts and techniques of machine learning • Implement a learning system • Understand the role of learning for improving performance and allowing a improving performance and allowing a system to adapt based on previous system to adapt based on previous experiences • Understand the importance of data preparation and feature extraction in machine learning • Learn and apply the vector space model for representing web documents

Project Phases • Collect web documents • Extract text and select features • Represent documents as feature vectors (term-document matrix) (term-document matrix) • Prepare data for Weka • Create and evaluate ML models

Resources • AI course web page (Prolog programs) • Weka (software) • DMW book (sample data) • Related projects (Probabilistic Reasoning) • Related projects (Probabilistic Reasoning) • Other (web crawling, text stat)

Reading • Stuart Russell, Peter Norvig. Artificial Intelligence: A Modern Approach, 2003. • Tom Mitchell, Machine Learning, 1997. • Ian H. Witten and Eibe Frank. Data Mining: • Ian H. Witten and Eibe Frank. Data Mining: Practical ML Tools and Techniques, 2005. • Zdravko Markov and Daniel T. Larose. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.

Data Collection • Collect web pages from 5 different topics with at least 20 documents in each • Choose a more elaborated topic structure (not necessarily a tree) (not necessarily a tree) • Each document should have enough text content • Each document must include enough terms to represent the topic

Data Collection Tools • Topic directory (dmoz.org) • Web browsing • Web search • Web crawler (WebSPHINX) • Web crawler (WebSPHINX)

Feature Extraction • Remove stopwords, apply stemming • Compute term frequencies in the corpus • Select 100 most representative terms (consider TF and IDF factors) (consider TF and IDF factors) • Create term document matrix (binary, TF, TFIDF).

Feature Extraction Tools • Prolog programs (described in project) • Specialized text editors • Weka (described in project) • Custom-made programs • Custom-made programs

Term-Document Matrix • Create a feature vector for each document – Binary (0/1, nominal) – Term frequency (counts) – TFIDF representation (numeric) – TFIDF representation (numeric) • Use Prolog programs or Weka

Data Preparation • Create data files for Weka – CSV format – ARFF format • Use different representations • Use different representations – Binary – TF – TFIDF • Use Weka for conversions between formats and representations

Machine Learning and Model Evaluation • Attribute ranking and selection • Decision trees • Naïve Bayes • Naïve Bayes • KNN • Clustering • Classification of new documents

Sample Project 1 (UH) • 5 topics, 116 documents, 1000 terms – Machine Learning – Agents – Sorting – MPEG – MPEG – History of computing • Feature extraction (binary representation) by using TextSTAT, Excel and VB • ML models and error analysis: Decsion tree, Naive Bayes, KNN

Sample Project 2 (CCSU) • Two separate topic structures: – Musical instruments (5 topics) – Four general topics: Non-profit, Government, Personal, Commercial Personal, Commercial • Data preparation using Prolog and Weka • ML models created by Weka – Increasing number of features (10,20,30,40) – Naïve Bayes, KNN, WKNN, Decision tree (best) – Predicting class of new documents

Sample Project 3 (CCSU) • 5 topics, 100 documents, 100 terms – Computer Science – Artificial Intelligence – Machine Learning – Data Mining – Data Mining • Data preparation using Prolog and Weka • ML models created by Weka – Increasing number of features (25,50,75,100) – Naïve Bayes, KNN, Decision tree – Predicting class of 15 new documents

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, - PowerPoint PPT Presentation

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, May 18, 2009 Web Document Classification Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised with material from the Probabilistic

GIT WORKSHOP GIT WORKSHOP 1 . 1 GIT WORKSHOP GIT WORKSHOP Manuela Salvucci

RAs TLAFs Workshop RAs TLAFs Workshop Dundalk, 26 th July 2010 Objective of the Workshop

ICT Workshop ICT Workshop ICT Workshop ICT Workshop Aims For The Afternoon: Aims

Watershed Planning Watershed Planning Workshop Workshop Workshop Workshop Upper Upper

WORLD WIDE WORKSHOP for WORLD WIDE WORKSHOP for WORLD WIDE WORKSHOP for WORLD WIDE WORKSHOP for

EMA EFPIA workshop EMA EFPIA workshop EMA EFPIA workshop EMA EFPIA workshop Break Break- -out

PACE OF DEVELOPMENT Council Workshop Council Workshop Council Workshop Council Workshop

COMMUNITY WORKSHOP COMMUNITY WORKSHOP 2 DECEMBER 4, 2014 COMMUNITY WORKSHOP 1 BACKGROUND +

HOW TO GET FUNDING WORKSHOP WORKSHOP October 12 2012 October 12, 2012 Workshop Schedule

Workshop Workshop Economic Analysis Economic Analysis Scoping Plan Workshop Scoping Plan

x64 Workshop Didier Stevens Go to http://workshop-x64.DidierStevens.com Unzip x64-workshop.zip

Victoria Dec. 14, 2011 ATLAS CMS TRIUMF Workshop on LHC Results TRIUMF Workshop on LHC

Go to http://workshop.DidierStevens.com Unzip shellcode-workshop.zip to C:\ Password is workshop

23-25, August, 2017 @ Taipei.TW 2017 Belle II TRG/DAQ workshop An Announcement 0. workshop

Workshop Presentations and Handouts www.missionrcd.org/residential/workshop materials/ 1

Main Street Corridor Vision Plan Community Workshop #2 City of Springfield Community Workshop

It's a Tree... It's a Graph... It's a Tree... It's a Graph... It's a Traph!!!! It's a Traph!!!!

The Web ARChive (WARC) File Format Sawood Alam Web Science and Digital Libraries Research Group

Retrieving and Visualizing Data Charles Severance Multi-Step Data Analysis Many Data Mining

Mining Lectures Marcel Caraciolo - @marcelcaraciolo 1 Whos me ? Marcel Pinheiro Caraciolo

Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps,

Internet Technologies Some sample questions for the exam F. Ricci 1 Questions 1. Is the

Advanced Java Course Reflection Reflection API What if you want to access information not

Network Administration Practice Homework 1: Python Scripts weicc & blzhuang Computer Center,

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, - PowerPoint PPT Presentation

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, May 18, 2009 Web Document Classification Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised with material from the Probabilistic

GIT WORKSHOP GIT WORKSHOP 1 . 1 GIT WORKSHOP GIT WORKSHOP Manuela Salvucci

RAs TLAFs Workshop RAs TLAFs Workshop Dundalk, 26 th July 2010 Objective of the Workshop

ICT Workshop ICT Workshop ICT Workshop ICT Workshop Aims For The Afternoon: Aims

Watershed Planning Watershed Planning Workshop Workshop Workshop Workshop Upper Upper

WORLD WIDE WORKSHOP for WORLD WIDE WORKSHOP for WORLD WIDE WORKSHOP for WORLD WIDE WORKSHOP for

EMA EFPIA workshop EMA EFPIA workshop EMA EFPIA workshop EMA EFPIA workshop Break Break- -out

PACE OF DEVELOPMENT Council Workshop Council Workshop Council Workshop Council Workshop

COMMUNITY WORKSHOP COMMUNITY WORKSHOP 2 DECEMBER 4, 2014 COMMUNITY WORKSHOP 1 BACKGROUND +

HOW TO GET FUNDING WORKSHOP WORKSHOP October 12 2012 October 12, 2012 Workshop Schedule

Workshop Workshop Economic Analysis Economic Analysis Scoping Plan Workshop Scoping Plan

x64 Workshop Didier Stevens Go to http://workshop-x64.DidierStevens.com Unzip x64-workshop.zip

Victoria Dec. 14, 2011 ATLAS CMS TRIUMF Workshop on LHC Results TRIUMF Workshop on LHC

Go to http://workshop.DidierStevens.com Unzip shellcode-workshop.zip to C:\ Password is workshop

23-25, August, 2017 @ Taipei.TW 2017 Belle II TRG/DAQ workshop An Announcement 0. workshop

Workshop Presentations and Handouts www.missionrcd.org/residential/workshop materials/ 1

Main Street Corridor Vision Plan Community Workshop #2 City of Springfield Community Workshop

It's a Tree... It's a Graph... It's a Tree... It's a Graph... It's a Traph!!!! It's a Traph!!!!

The Web ARChive (WARC) File Format Sawood Alam Web Science and Digital Libraries Research Group

Retrieving and Visualizing Data Charles Severance Multi-Step Data Analysis Many Data Mining

Mining Lectures Marcel Caraciolo - @marcelcaraciolo 1 Whos me ? Marcel Pinheiro Caraciolo

Jeffrey D. Ullman To motivate the Bloom-filter idea, consider a web crawler. It keeps,

Internet Technologies Some sample questions for the exam F. Ricci 1 Questions 1. Is the

Advanced Java Course Reflection Reflection API What if you want to access information not

Network Administration Practice Homework 1: Python Scripts weicc &amp; blzhuang Computer Center,

Network Administration Practice Homework 1: Python Scripts weicc & blzhuang Computer Center,