comparison of categorical properties offered by multiple
play

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC - PowerPoint PPT Presentation

COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. Franois Bry Betreuer: Prof. Dr. Franois Bry,


  1. COMPARISON OF CATEGORICAL PROPERTIES OFFERED BY MULTIPLE MOOC PLATFORMS Using a Web Crawler in Python with Scrapy Bachelor Thesis - Final Presentation Louis Mbuyu Aufgabensteller: Prof. Dr. François Bry Betreuer: Prof. Dr. François Bry, Yingding Wang 12.04.18 � 1

  2. AGENDA 1. Introduction / Goal 2. Defining MOOC model 3. Web Scraper / Results 4. Gold standard selection 5. Text categorisation approach 6. Gold standard evaluation 7. Evaluation of all platforms 8. Conclusion & Future work � 2

  3. 1. Introduction / Goal � 3

  4. Motivation • Irom - I ntelligent R ecommender O f M OOCs • MOOC - M assive O pen O nline C ourse The goal of Irom • To improve the learning and studying at the university. • To develop an intelligent MOOCs search engine Goal of thesis • Define unified set of categories across all MOOC platforms. � 4

  5. Motivation � 5

  6. Modified Goal � 6

  7. Tasks 1. Define a MOOC model 2. Build a Web Scraper and extract data 3. Select a platform as the “Gold standard” 4. Text categorisation approach (TF-IDF & cos sim.) 5. Evaluate tf-idf and cosine similarity approach 6. Categorise courses from other platforms 7. Evaluate the results � 7

  8. 2. Defining MOOC model � 8

  9. � 9

  10. � 10

  11. Motivation - MOOC platforms � 11

  12. Unified MOOC Model (Table) Coursera Udacity Edx FutureLearn Open2Study Udemy ✓ ✓ ✓ ✓ ✓ ✓ Url ✓ ✓ ✓ ✓ ✓ ✓ Title ✓ ✓ ✓ ✓ ✓ ✓ Summary ✓ ✓ ✓ ✓ ✓ ✓ Description ✓ ✓ ✓ Subcategory ✘ ✘ ✘ ✓ ✓ ✓ ✓ ✓ ✓ Category WhyTakeThis ✓ ✓ ✓ ✓ ✓ ✓ Course ✓ ✓ ✓ ✓ ✓ ✓ Provider ✓ ✓ ✓ ✓ ✘ ✘ Level ✓ ✓ ✓ ✓ ✓ ✓ ImageUrl ✓ ✓ ✓ ✓ ✓ ✓ Price ✓ ✓ ✓ ✓ ✓ ✓ Duration ✓ ✓ ✓ ✓ ✓ ✘ RatingValue ✓ ✓ ✓ ✓ ✓ ✘ RatingAmount ✓ ✓ ✓ ✓ ✘ ✘ StartDate ✓ ✓ ✘ ✘ ✘ ✘ EndDate � 12

  13. 3. Web Scraper / Results � 13

  14. Scraped Data (Table) Coursera Udacity Edx FutureLearn Open2Study Udemy All Total 3.032 232 1.098 193 49 40.003 44.607 number of courses � 14

  15. 4. Gold standard selection � 15

  16. Gold standard criteria 1. Number of categories 2. Number of courses 3. Diversity 4. Represent University Subjects � 16

  17. Gold standard elimination process Coursera Udacity Edx FutureLearn Open2Study Udemy No. of ✓ ✓ ✓ ✓ ✘ ✘ categories No. of ✓ ✓ ✓ ✘ ✘ ✘ courses ✓ ✓ ✓ ✓ ✓ ✘ Diversity University ✓ ✓ ✓ ✓ ✘ ✘ rep. � 17

  18. Gold standard structure � 18

  19. Gold standard structure � 19

  20. Gold standard structure � 20

  21. 5. Text categorisation approach � 21

  22. Text categorisation approach (Step 1) Query Database : ‘Platform’ = ‘Coursera’ AND GROUB BY ‘Subcategory’ MongoDB � 22

  23. Text categorisation approach (Step 2) subcategories Array of courses (JSON object) Finance … course 1 course 2 course n … … Marketing course 1 course 2 course n … … Algorithms course 1 course 2 course n … … Subcategory m course 1 course 2 course n � 23

  24. Text categorisation approach (Step 3) Iterate through courses and extract and combine the ‘title’, ‘Summary’, ‘Description’ subcategories Array of courses (String) … Finance course 1 course 2 course n … … Marketing course 1 course 2 course n … … Subcategory m course 1 course 2 course n � 24

  25. Text categorisation approach (Step 4) Join all arrays/list of strings into one string subcategories Combined array of courses (String) “Intro into Finance. This course …” Finance … “Marketing 101. Learn fundamentals …” Marketing … … Subcategory m course 1 course 2 course n � 25

  26. Text categorisation approach (Step 5) Preprocess Data : Remove all stop words and punctuations, All words to lowercase, All words are stemmed subcategories Preprocessed combined array of courses “intro financ cours …” Finance … “market lear fundamental …” Marketing … … Subcategory m course 1 course 2 course n � 26

  27. Text categorisation approach (Step 6) Course (Query) from another platform, that needs to be categorised { “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … } Extract and combine the ‘title’, ‘Summary’, ‘Description’ Preprocess Data : Remove all stop words and punctuations, All words to lowercase, All words are stemmed � 27

  28. Text categorisation approach (Step 6) Calculate TF-IDF and Cosine similarity for all subcategories. Course is categorised to the subcategory with the highest value. Course(s) Subcategories (String) { Finance “title”: String, TF-IDF and cosine “courseUrl”: String, “imageUrl”: String, “description”: String, similarity “duration”: Int, “category”: String, … … } Marketing … Subcategory m � 28

  29. TF-IDF & Cosine Similarity TF-IDF - Term frequency inverse document frequency Term Frequency - How frequent a term appears in a given document Inverse document frequency - diminishes the weight of terms that appear very frequently in the corpus and increases the weight of terms that appear rarely. Cosine similarity - a measure of similarity between two vectors, that measures the cosine of the angle between them. � 29

  30. 6. Approach evaluation � 30

  31. Approach evaluation Coursera courses TF-IDF and cosine New similarity approach Category { “title”: String, “courseUrl”: String, “imageUrl”: String, “description”: String, “duration”: Int, “category”: String, … } � 31

  32. Approach evaluation Accuracy - Accuracy is a ratio of total correctly categorised courses to the total number of courses 2625 0.87 Gold standard accuracy = ≈ 3032 � 32

  33. 7. Evaluation of all platforms � 33

  34. Evaluation of all platforms (Udacity) Intro. to Android Computer TF-IDF and cosine Science similarity approach Category: Android Category Good or bad outcome? � 34

  35. Evaluation of all platforms (Udacity) Gold standard (Coursera) categories { { Udacity categories � 35

  36. Evaluation of all platforms (Udacity) *The heat-map shows the percentages of courses categorised to that particular category, with darker colours indicating greater percentage. � 36

  37. Grading schema � 37

  38. Evaluation of all platforms (Udacity) Udacity evaluation table � 38

  39. Evaluation of all platforms (Udacity) Udacity courses distribution (Pie Chart) � 39

  40. Evaluation of all platforms (Udacity) Udacity courses distribution (Table) � 40

  41. Evaluation of all platforms (Edx) � 41

  42. 8. Conclusion & Future work � 42

  43. Conclusion 1. ca 45.000 courses scraped and indexed for IROM. 2. Coursera’s categories as the gold standard was a great outcome. 3. Tf-idf and cosine similarity measure was also a positive outcome. � 43

  44. Future work 1. Measure the quality of data scraped 2. Better approach - machine learning (neural networks, etc) 3. Evaluating text categorisation � 44

  45. Thank you. � 45

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend