Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve - - PowerPoint PPT Presentation

topic modeling on abc news
SMART_READER_LITE
LIVE PREVIEW

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve - - PowerPoint PPT Presentation

Topic Modeling on ABC News Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019 ABC News headlines are used for topic modeling 78,000 headlines from news articles published in 2015 (


slide-1
SLIDE 1

Topic Modeling on ABC News

Mining Unstructured Data Team 2: Steve Barnard Charles Huang Shyam Senthilkumar Eda Wang May 21, 2019

slide-2
SLIDE 2

ABC News headlines are used for topic modeling

  • 78,000 headlines from news articles published in 2015 (derived from Kaggle dataset - url listed below)
  • Sourced from the Australian Broadcasting Corp. (ABC)
  • ~200 articles published each day
  • Mix of international and Australian news
  • Expected LDA to find common news topics: Sports, Politics, Local News, Health, Science etc.

Sample Headlines: “Egyptian court orders retrial for journalist Peter Greste” “Boat on fire at Jacobs Well Marina” “Woman arrested after explosives allegedly found in car” https://www.kaggle.com/therohk/million-headlines

slide-3
SLIDE 3

Number of topics choice based on heuristics approach

Attempted Domain Knowledge/Heuristic based topic selection:

  • Australian Broadcasting Corp. (ABC) news has

12 on website: https://www.abc.net.au/

  • CNN lists 10 topics (excluding video)

Difference in topics and one being US-centric and the other Australia-centric added to some difficulty interpreting topic themes. (i.e. no business/politics and addition of indigenous/rural). In an attempt/experiment to normalize the differences australian words were manually added to stop words with limited

  • success. (common words also added i.e. (wo)man)

These are both as of 2019, the data used for our LDA variants was from 2015 so topics may have changed.

ABC 12 Topics

(data source)

CNN 10 Topics

slide-4
SLIDE 4

Number of topics choice based on coherence scores

Coherence Based Topic Selection: Coherence scores fluctuated throughout our multiple iterations of LDA and its variants

  • Coherence ranged from ~0.26-0.56

○ Depending on model type and number

  • f topics selected, stopwords, etc.
  • Overall, models with a higher number of

topics tended to perform better in terms of coherence

  • We settled on using k=10 for the

number of topics based on the lower end of possible topics as well as deciding on 10 as a comfortable starting point.

Coherence Scores v. Number of Topics for one of

  • ur models: to illustrate iterative searching for
  • ptimal topic numbering based on coherence.
slide-5
SLIDE 5

Summary of topic modeling methods and observations

LSA LDA v1 LDA v2 Dataset ABC 2015 ABC 2015 ABC 2015 Stop words ‘english’ plus ‘english’ ‘english’ plus Pkg/Method Sklearn KMeans Gensim + Mallet’s LDA Gemsim # of Topics 10 10 10 Shared Themes

Politics, Local News, Crime, International

Unique Themes Sports Economy, Education, Car Accident, Health, Sports Agriculture, Car Accident, Infrastructure

variations Output

slide-6
SLIDE 6

Observations – LSA ( K-Means Clustering)

Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say) Applied KMeans from sklearn package with n_clusters= 10. K Means Clustering method on data with removed stop words produced interpretable topics such as Sports (topic #7, #2), Politics (topic #4, #10), local stories (topic #3, #6, #8, #9), Crime (topic #1), International (#5)

Image: WordCloud base on news data

Crime Topic 1 – police, wa, court, charged, death, sa, murder, fire, car, crash Sports Topic 2 – australia, world, cup, win, final, south, rugby, one, cricket, ntch Local Topic 3 – country, hour, nt, vic, tas, nsw, march, qld Politics Topic 4 – government, council, coast, sydney, health, funding, tasmanian, canberra International Topic 5 – australian, us, market, china, west, open, share Local Topic 6 – news, national, exchange, rural, abc, quiz, press, club, march, park Sports Topic 7 – live, nrl, league, afl, streaming, updates, super, blog, export, final Local Topic 8 – grandstand, drum, hill, capital, breakfast, march, broken, stumps, digital, confab Local Topic 9 – rural, qld, sach, reporter, countrywide, sa, north, outback, drought, central Politics Topic 10 – nsw, interview, election, extended, rural, rain, wrap, shark, baird, police

slide-7
SLIDE 7

Observations – LDA v1 (Gensim + Mallet’s LDA)

International Topic 1 – Australia, Australian, day, farmer, China, water, test, market, price, rise Politics Topic 2 – government, call, change, act, urge, election, labor, support, law, group Sports Topic 3 – win, open, world, set, lead, beat, record, final, return, cup Economy Topic 4 – Canberra, report, home, family, Perth, child, service, worker, work, leave Local Topic 5 – make, show, talk, Adelaide, png, head, ban, centre, food, project Education Topic 6 – plan, council, WA, school, interview, community, high, hunter, cattle, student Health Topic 7 – hospital, minister, Tasmanian, budget, concern, cut, funding, job, time Car Accident Topic 8 – year, Queensland, Sydney, kill, south, crash, car, hit, die, Melbourne Crime Topic 9 – man, police, charge, women, find, court, death, face, murder, miss Local Topic 10 – rural, fire, gld, nsw, country_hour, national, nt, hour, warn, podcast

Standard English stop words are used; Lemmatization keeping only noun, Adj, Verb, and Adv Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include International (#1), Politics (#2), Sports (#3), Economy (#4), local stories (#5, #10), Education (#6), Health (#7), Car Accident (#8) and Crime (#9).

slide-8
SLIDE 8

Observations – LDA v2 (Gensim)

Local Topic 1 – australian, tasmanian, top, school, national, industry, victim, defence Politics Topic 2 – australia, council, canberra, community, victorian, review, life, murder, concern, adelaide Infrastructure Topic 3 – road, market, car, hospital, local, funding, law, ban, turnbull, people Local Topic 4 – melbourne, china, water, company, group, claim, qld, act International Topic 5 – former, wa, home, good, hunter, leader, islamic state, turkey, grandstand, call Agriculture Topic 6 – queensland, perth, farmer, record, flood, high, worker, drought, force, price Crime Topic 7 – police, race, sale, warning, drug, number, deal, court, storm, star Car Accident Topic 8 – sydney, plan, death, test, family, government, change, crash, driver, big Local Topic 9 – hour, hobart, league, child, report, season, charge, wa_country, vic_country Local Topic 10 – fire, resident, darwin, attack, student, dog, story, award, mine, service

Base on word frequency, added additional stop words such as time frame (e.g., Jan, days, Sept, Christmas), nouns (e.g., woman, man), verbs (e.g., say); Lemmatization keeping only noun and adj Gensim package with Mallet’s version of LDA algorithm was used on data. Interpretable topics include Local (#1, #4, #9, #10), Politics (#2), Infrastructure (#3), International (#5), Crime (#7) and Car Accident (#8). model had coherence score of 0.57

slide-9
SLIDE 9

Strengths and Weakness from LDA

Weakness ➢ Results vary with the choice for number of topics ➢ Topic interpretability and overlap: Strengths ❖ Unsupervised model without any labeling requirements ❖ Treats each documents as a mixture of different topics and each topic a mixture of different words ❖ Provides understanding of underlying topic distributions that drive news headlines

slide-10
SLIDE 10

Applications of LDA on News

Our analysis was performed on a single Australian news source (ABC) in the year 2015

  • Method could be applied across numerous years to study how news topics covered in Australia

have changed over time

  • Topics learned from the Australian News could be compared to other countries’ to learn how the

topics of interest vary around the world

  • Methods could be applied to different news sources, for example to those with contrasting

political ideologies or differing reader bases to understand topic distributions across different demographics

slide-11
SLIDE 11

Q&A