MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, - - PowerPoint PPT Presentation

mlexai workshop
SMART_READER_LITE
LIVE PREVIEW

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, - - PowerPoint PPT Presentation

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, May 18, 2009 Web Document Classification Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised with material from the Probabilistic


slide-1
SLIDE 1

MLeXAI Workshop

Ingrid Russell, Zdravko Markov

Sanibel Island, FL, May 18, 2009

slide-2
SLIDE 2

Web Document Classification

Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised with material from the

Probabilistic Resoning Project

Developed in summer 2007 by Zdravko Markov and Ingrid Russell

slide-3
SLIDE 3

Introduction

  • Topic directories (dmoz.org)
  • Automatic classification of web pages
  • Expanding and creating new directory

structures structures

  • Investigating the process of tagging

(labeling) web pages using topic directory structures

  • Applying Machine Learning techniques for

automatic tagging

slide-4
SLIDE 4

Objectives

  • Learn basic concepts and techniques of

machine learning

  • Implement a learning system
  • Understand the role of learning for

improving performance and allowing a system to adapt based on previous improving performance and allowing a system to adapt based on previous experiences

  • Understand the importance of data

preparation and feature extraction in machine learning

  • Learn and apply the vector space model

for representing web documents

slide-5
SLIDE 5

Project Phases

  • Collect web documents
  • Extract text and select features
  • Represent documents as feature vectors

(term-document matrix) (term-document matrix)

  • Prepare data for Weka
  • Create and evaluate ML models
slide-6
SLIDE 6

Resources

  • AI course web page (Prolog programs)
  • Weka (software)
  • DMW book (sample data)
  • Related projects (Probabilistic Reasoning)
  • Related projects (Probabilistic Reasoning)
  • Other (web crawling, text stat)
slide-7
SLIDE 7

Reading

  • Stuart Russell, Peter Norvig. Artificial

Intelligence: A Modern Approach, 2003.

  • Tom Mitchell, Machine Learning, 1997.
  • Ian H. Witten and Eibe Frank. Data Mining:
  • Ian H. Witten and Eibe Frank. Data Mining:

Practical ML Tools and Techniques, 2005.

  • Zdravko Markov and Daniel T. Larose. Data

Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.

slide-8
SLIDE 8

Data Collection

  • Collect web pages from 5 different topics

with at least 20 documents in each

  • Choose a more elaborated topic structure

(not necessarily a tree) (not necessarily a tree)

  • Each document should have enough text

content

  • Each document must include enough terms

to represent the topic

slide-9
SLIDE 9

Data Collection Tools

  • Topic directory (dmoz.org)
  • Web browsing
  • Web search
  • Web crawler (WebSPHINX)
  • Web crawler (WebSPHINX)
slide-10
SLIDE 10

Feature Extraction

  • Remove stopwords, apply stemming
  • Compute term frequencies in the corpus
  • Select 100 most representative terms

(consider TF and IDF factors) (consider TF and IDF factors)

  • Create term document matrix (binary, TF,

TFIDF).

slide-11
SLIDE 11

Feature Extraction Tools

  • Prolog programs (described in project)
  • Specialized text editors
  • Weka (described in project)
  • Custom-made programs
  • Custom-made programs
slide-12
SLIDE 12

Term-Document Matrix

  • Create a feature vector for each document

– Binary (0/1, nominal) – Term frequency (counts) – TFIDF representation (numeric) – TFIDF representation (numeric)

  • Use Prolog programs or Weka
slide-13
SLIDE 13

Data Preparation

  • Create data files for Weka

– CSV format – ARFF format

  • Use different representations
  • Use different representations

– Binary – TF – TFIDF

  • Use Weka for conversions between formats

and representations

slide-14
SLIDE 14

Machine Learning and Model Evaluation

  • Attribute ranking and selection
  • Decision trees
  • Naïve Bayes
  • Naïve Bayes
  • KNN
  • Clustering
  • Classification of new documents
slide-15
SLIDE 15

Sample Project 1 (UH)

  • 5 topics, 116 documents, 1000 terms

– Machine Learning – Agents – Sorting – MPEG – MPEG – History of computing

  • Feature extraction (binary representation)

by using TextSTAT, Excel and VB

  • ML models and error analysis: Decsion

tree, Naive Bayes, KNN

slide-16
SLIDE 16

Sample Project 2 (CCSU)

  • Two separate topic structures:

– Musical instruments (5 topics) – Four general topics: Non-profit, Government, Personal, Commercial Personal, Commercial

  • Data preparation using Prolog and Weka
  • ML models created by Weka

– Increasing number of features (10,20,30,40) – Naïve Bayes, KNN, WKNN, Decision tree (best) – Predicting class of new documents

slide-17
SLIDE 17

Sample Project 3 (CCSU)

  • 5 topics, 100 documents, 100 terms

– Computer Science – Artificial Intelligence – Machine Learning – Data Mining – Data Mining

  • Data preparation using Prolog and Weka
  • ML models created by Weka

– Increasing number of features (25,50,75,100) – Naïve Bayes, KNN, Decision tree – Predicting class of 15 new documents