mlexai workshop
play

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, - PowerPoint PPT Presentation

MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, May 18, 2009 Web Document Classification Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised with material from the Probabilistic


  1. MLeXAI Workshop Ingrid Russell, Zdravko Markov Sanibel Island, FL, May 18, 2009

  2. Web Document Classification Originally developed in summer 2005 by Ingrid Russell, Zdravko Markov, and Todd Neller Revised with material from the Probabilistic Resoning Project Developed in summer 2007 by Zdravko Markov and Ingrid Russell

  3. Introduction • Topic directories (dmoz.org) • Automatic classification of web pages • Expanding and creating new directory structures structures • Investigating the process of tagging (labeling) web pages using topic directory structures • Applying Machine Learning techniques for automatic tagging

  4. Objectives • Learn basic concepts and techniques of machine learning • Implement a learning system • Understand the role of learning for improving performance and allowing a improving performance and allowing a system to adapt based on previous system to adapt based on previous experiences • Understand the importance of data preparation and feature extraction in machine learning • Learn and apply the vector space model for representing web documents

  5. Project Phases • Collect web documents • Extract text and select features • Represent documents as feature vectors (term-document matrix) (term-document matrix) • Prepare data for Weka • Create and evaluate ML models

  6. Resources • AI course web page (Prolog programs) • Weka (software) • DMW book (sample data) • Related projects (Probabilistic Reasoning) • Related projects (Probabilistic Reasoning) • Other (web crawling, text stat)

  7. Reading • Stuart Russell, Peter Norvig. Artificial Intelligence: A Modern Approach, 2003. • Tom Mitchell, Machine Learning, 1997. • Ian H. Witten and Eibe Frank. Data Mining: • Ian H. Witten and Eibe Frank. Data Mining: Practical ML Tools and Techniques, 2005. • Zdravko Markov and Daniel T. Larose. Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, 2007.

  8. Data Collection • Collect web pages from 5 different topics with at least 20 documents in each • Choose a more elaborated topic structure (not necessarily a tree) (not necessarily a tree) • Each document should have enough text content • Each document must include enough terms to represent the topic

  9. Data Collection Tools • Topic directory (dmoz.org) • Web browsing • Web search • Web crawler (WebSPHINX) • Web crawler (WebSPHINX)

  10. Feature Extraction • Remove stopwords, apply stemming • Compute term frequencies in the corpus • Select 100 most representative terms (consider TF and IDF factors) (consider TF and IDF factors) • Create term document matrix (binary, TF, TFIDF).

  11. Feature Extraction Tools • Prolog programs (described in project) • Specialized text editors • Weka (described in project) • Custom-made programs • Custom-made programs

  12. Term-Document Matrix • Create a feature vector for each document – Binary (0/1, nominal) – Term frequency (counts) – TFIDF representation (numeric) – TFIDF representation (numeric) • Use Prolog programs or Weka

  13. Data Preparation • Create data files for Weka – CSV format – ARFF format • Use different representations • Use different representations – Binary – TF – TFIDF • Use Weka for conversions between formats and representations

  14. Machine Learning and Model Evaluation • Attribute ranking and selection • Decision trees • Naïve Bayes • Naïve Bayes • KNN • Clustering • Classification of new documents

  15. Sample Project 1 (UH) • 5 topics, 116 documents, 1000 terms – Machine Learning – Agents – Sorting – MPEG – MPEG – History of computing • Feature extraction (binary representation) by using TextSTAT, Excel and VB • ML models and error analysis: Decsion tree, Naive Bayes, KNN

  16. Sample Project 2 (CCSU) • Two separate topic structures: – Musical instruments (5 topics) – Four general topics: Non-profit, Government, Personal, Commercial Personal, Commercial • Data preparation using Prolog and Weka • ML models created by Weka – Increasing number of features (10,20,30,40) – Naïve Bayes, KNN, WKNN, Decision tree (best) – Predicting class of new documents

  17. Sample Project 3 (CCSU) • 5 topics, 100 documents, 100 terms – Computer Science – Artificial Intelligence – Machine Learning – Data Mining – Data Mining • Data preparation using Prolog and Weka • ML models created by Weka – Increasing number of features (25,50,75,100) – Naïve Bayes, KNN, Decision tree – Predicting class of 15 new documents

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend