Topic Modeling on ABC News
Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019
Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve - - PowerPoint PPT Presentation
Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019 ABC News headlines are used for topic modeling 78,000 headlines from news articles published in 2015 (
Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019
Sample Headlines: “Egyptian court orders retrial for journalist Peter Greste” “Boat on fire at Jacobs Well Marina” “Woman arrested after explosives allegedly found in car” https://www.kaggle.com/therohk/million-headlines
Attempted Domain Knowledge/Heuristic based topic selection:
12 on website: https://www.abc.net.au/
Difference in topics and one being US-centric and the other Australia-centric added to some difficulty interpreting topic themes. (i.e. no business/politics and addition of indigenous/rural). In an attempt/experiment to normalize the differences australian words were manually added to stop words with limited
These are both as of 2019, the data used for our LDA variants was from 2015 so topics may have changed.
ABC 12 Topics
(data source)
CNN 10 Topics
Coherence Based Topic Selection: Coherence scores fluctuated throughout our multiple iterations of LDA and its variants
○ Depending on model type and number
topics tended to perform better in terms of coherence
number of topics based on the lower end of possible topics as well as deciding on 10 as a comfortable starting point.
Coherence Scores v. Number of Topics for one of
LSA LDA v1 LDA v2 Dataset ABC 2015 ABC 2015 ABC 2015 Stop words ‘english’ plus ‘english’ ‘english’ plus Pkg/Method Sklearn KMeans Gensim + Mallet’s LDA Gemsim # of Topics 10 10 10 Shared Themes
Politics, Local News, Crime, International
Unique Themes Sports Economy, Education, Car Accident, Health, Sports Agriculture, Car Accident, Infrastructure
variations Output
Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say) Applied KMeans from sklearn package with n_clusters= 10. K Means Clustering method on data with removed stop words produced interpretable topics such as Sports (topic #7, #2), Politics (topic #4, #10), local stories (topic #3, #6, #8, #9), Crime (topic #1), International (#5)
Image: WordCloud base on news data
Crime Topic 1 – police, wa, court, charged, death, sa, murder, fire, car, crash Sports Topic 2 – australia, world, cup, win, final, south, rugby, one, cricket, ntch Local Topic 3 – country, hour, nt, vic, tas, nsw, march, qld Politics Topic 4 – government, council, coast, sydney, health, funding, tasmanian, canberra International Topic 5 – australian, us, market, china, west, open, share Local Topic 6 – news, national, exchange, rural, abc, quiz, press, club, march, park Sports Topic 7 – live, nrl, league, afl, streaming, updates, super, blog, export, final Local Topic 8 – grandstand, drum, hill, capital, breakfast, march, broken, stumps, digital, confab Local Topic 9 – rural, qld, sach, reporter, countrywide, sa, north, outback, drought, central Politics Topic 10 – nsw, interview, election, extended, rural, rain, wrap, shark, baird, police
International Topic 1 – Australia, Australian, day, farmer, China, water, test, market, price, rise Politics Topic 2 – government, call, change, act, urge, election, labor, support, law, group Sports Topic 3 – win, open, world, set, lead, beat, record, final, return, cup Economy Topic 4 – Canberra, report, home, family, Perth, child, service, worker, work, leave Local Topic 5 – make, show, talk, Adelaide, png, head, ban, centre, food, project Education Topic 6 – plan, council, WA, school, interview, community, high, hunter, cattle, student Health Topic 7 – hospital, minister, Tasmanian, budget, concern, cut, funding, job, time Car Accident Topic 8 – year, Queensland, Sydney, kill, south, crash, car, hit, die, Melbourne Crime Topic 9 – man, police, charge, women, find, court, death, face, murder, miss Local Topic 10 – rural, fire, gld, nsw, country_hour, national, nt, hour, warn, podcast
Standard English stop words are used; Lemmatization keeping only noun, Adj, Verb, and Adv Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include International (#1), Politics (#2), Sports (#3), Economy (#4), local stories (#5, #10), Education (#6), Health (#7), Car Accident (#8) and Crime (#9).
Local Topic 1 – australian, tasmanian, top, school, national, industry, victim, defence Politics Topic 2 – australia, council, canberra, community, victorian, review, life, murder, concern, adelaide Infrastructure Topic 3 – road, market, car, hospital, local, funding, law, ban, turnbull, people Local Topic 4 – melbourne, china, water, company, group, claim, qld, act International Topic 5 – former, wa, home, good, hunter, leader, islamic state, turkey, grandstand, call Agriculture Topic 6 – queensland, perth, farmer, record, flood, high, worker, drought, force, price Crime Topic 7 – police, race, sale, warning, drug, number, deal, court, storm, star Car Accident Topic 8 – sydney, plan, death, test, family, government, change, crash, driver, big Local Topic 9 – hour, hobart, league, child, report, season, charge, wa_country, vic_country Local Topic 10 – fire, resident, darwin, attack, student, dog, story, award, mine, service
Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say); Lemmatization keeping only noun and adj Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include Local (#1, #4, #9, #10), Politics (#2), Infrastructure (#3), International (#5), Crime (#7) and Car Accident (#8). model had coherence score of 0.57
Weakness ➢ Results vary with the choice for number of topics ➢ Topic interpretability and overlap: Strengths ❖ Unsupervised model without any labeling requirements ❖ Treats each documents as a mixture of different topics and each topic a mixture of different words ❖ Provides understanding of underlying topic distributions that drive news headlines
Our analysis was performed on a single Australian news source (ABC) in the year 2015
have changed over time
topics of interest vary around the world
political ideologies or differing reader bases to understand topic distributions across different demographics