BigARTM: Open Source Library for Regularized Multimodal Topic - PowerPoint PPT Presentation

BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, Marina Dudarenko Yandex • CC RAS • MIPT • HSE • MSU Analysis of Images, Social Networks and Texts Ekaterinburg • 9–11 April 2015

Contents Theory 1 Probabilistic Topic Modeling ARTM — Additive Regularization for Topic Modeling Multimodal Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org 2 BigARTM: parallel architecture BigARTM: time and memory performance How to start using BigARTM Experiments 3 ARTM for combining regularizers Multi-ARTM for classification Multi-ARTM for multi-language TM

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling What is “topic”? Topic is a special terminology of a particular domain area. Topic is a set of coherent terms (words or phrases) that often occur together in documents. Formally, topic is a probability distribution over terms: p ( w | t ) is (unknown) frequency of word w in topic t . Document semantics is a probability distribution over topics : p ( t | d ) is (unknown) frequency of topic t in document d . Each document d consists of terms w 1 , w 2 , . . . , w n d : p ( w | d ) is (known) frequency of term w in document d . When writing term w in document d author thinks about topic t . Topic model tries to uncover latent topics from a text collection. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 3 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Goals and applications of Topic Modeling Goals: Uncover a hidden thematic structure of the text collection Find a compressed semantic representation of each document Applications: Information retrieval for long-text queries Semantic search in large scientific document collections Revealing research trends and research fronts Expert search News aggregation Recommender systems Categorization, classification, summarization, segmentation of texts, images, video, signals, social media and many others Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 4 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Probabilistic Topic Modeling: milestones and mainstream 1 PLSA — Probabilistic Latent Semantic Analysis (1999) 2 LDA — Latent Dirichlet Allocation (2003) 3 100s of PTMs based on Graphical Models & Bayesian Inference David Blei. Probabilistic topic models // Communications of the ACM, 2012. Vol. 55. No. 4. Pp. 77–84. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 5 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Generative Probabilistic Topic Model (PTM) Topic model explains terms w in documents d by topics t : p ( w | d ) = � p ( w | t ) p ( t | d ) t , … , # $ • ( | ! ) ! " " #" $• • ( " | ): 0.023 % ' 0.014 ,а•"• 0.018 •а•••• а!а "# 0.016 (# •) 0.009 ••#'&• 0.013 •$•%•&!• 0.009 *'+#•&"% 0.006 ••&•(• а+- ./ 0.011 •а&&#• … … … … … … … … … … … … " , … , " # $ : Ра••а•• а! #$%& •а'(!• - а!а') )*%#&)+ $•,-•, & ./0.'%!)1 •а•2/ /- $•• 03%!!/- $•. •••. . 4%!•2!/- $•#'%,•.а %'(!•# 0-. М% •, •#!•.а! !а •а•!•2а#7 а•!•2 •8%!).а!)) #-•,# .а !9&'%• ),!/- $•#'%,•.а %'(!•# %+ . $••# •а!# .% &•:;;)8)%! •. •а•'•3%!)0 ;•а42%! •. &•)./- GC - ) GA- #•,%•3а!)0 $• &'а##)*%#&)2 •• •4•!а'(!/2 •а•)#а2. На+,%!/ 9#'•.)0 •$ )2а'(!•+ а$$••&#)2а8)), ••%#$%*).а1>)% а. •2а )*%#&•% •а#$••!а.а!)% $•. •••. •а•')*!/- .),•. ($•02/- ) )!.%• )••.а!!/-, а а&3% а!,%2!/-) !а #$%& •а'(!•+ 2а •)8% #-•,# .а. М% •, •,)!а&•.• -•••7• •а•• а% !а •а•!/- 2а#7 а•а- ,а!!/-. О! $••.•'0% ./0.'0 ( #'%,/ #%42%! !/- ,9$')&а8)+ ) 2%4а#а %'') !/% 9*а# &) . 4%!•2%, •а+•!/ #)! %!)) $•) #•а.!%!)) $а•/ 4%!•2•. . Е4• 2•3!• )#$•'(••.а ( ,'0 ,% а'(!•4• )•9*%!)0 ;•а42%! •. -••2•#•2 ($•)#&а •а•2/ /- 9*а# &•. # 92%•%!!•+ ,')!•+ $•. ••01>%4•#0 $а %•!а ). Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 6 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling PLSA: Probabilistic Latent Semantic Analysis [T. Hofmann 1999] Given: D is a set (collection) of documents W is a set (vocabulary) of terms n dw = how many times term w appears in document d Find: parameters φ wt = p ( w | t ) , θ td = p ( t | d ) of the topic model � p ( w | d ) = φ wt θ td . t The problem of log-likelihood maximization under non-negativeness and normalization constraints: � � n dw ln φ wt θ td → max Φ , Θ , t d , w � � φ wt � 0 , φ wt = 1; θ td � 0 , θ td = 1 . w ∈ W t ∈ T Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 7 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Topic Modeling is an ill-posed inverse problem Topic Modeling is the problem of stochastic matrix factorization : � p ( w | d ) = φ wt θ td . t ∈ T In matrix notation P W × D = Φ W × T · Θ T × D , where � � P = � p ( w | d ) W × D is known term–document matrix, � � � Φ = W × T is unknown term–topic matrix, φ wt = p ( w | t ) , � φ wt � � � Θ = � θ td T × D is unknown topic–document matrix, θ td = p ( t | d ) . � Matrix factorization is not unique, the solution is not stable: ΦΘ = (Φ S )( S − 1 Θ) = Φ ′ Θ ′ for all S such that Φ ′ = Φ S , Θ ′ = S − 1 Θ are stochastic. Then, regularization is needed to find appropriate solution. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 8 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling ARTM: Additive Regularization of Topic Model Additional regularization criteria R i (Φ , Θ) → max , i = 1 , . . . , n . The problem of regularized log-likelihood maximization under non-negativeness and normalization constraints: n � � � n dw ln φ wt θ td + τ i R i (Φ , Θ) → max Φ , Θ , d , w t ∈ T i =1 � �� R (Φ , Θ) log-likelihood L (Φ , Θ) � � φ wt � 0; φ wt = 1; θ td � 0; θ td = 1 w ∈ W t ∈ T where τ i > 0 are regularization coefficients . Vorontsov K. V., Potapenko A. A. Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization // AIST’2014, Springer CCIS, 2014. Vol. 436. pp. 29–46. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 9 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling ARTM: available regularizers topic smoothing (equivalent to LDA) topic sparsing topic decorrelation topic selection via entropy sparsing topic coherence maximization supervised learning for classification and regression semi-supervised learning using documents citation and links modeling temporal topic dynamics using vocabularies in multilingual topic models and many others Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning. Special Issue “Data Analysis and Intelligent Optimization with Applications”. Springer, 2014. Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 10 / 38

Theory Probabilistic Topic Modeling BigARTM implementation — http://bigartm.org ARTM — Additive Regularization for Topic Modeling Experiments Multimodal Probabilistic Topic Modeling Multimodal Probabilistic Topic Modeling Given a text document collection Probabilistic Topic Model finds: p ( t | d ) — topic distribution for each document d , p ( w | t ) — term distribution for each topic t . Topics of documents Text documents D doc1: o c doc2: u m doc3: e n doc4: t s ... Topic Modeling Words and keyphrases of topics T o p i c s Konstantin Vorontsov (voron@yandex-team.ru) BigARTM: Open Source Topic Modeling 11 / 38

BigARTM: Open Source Library for Regularized Multimodal Topic - PowerPoint PPT Presentation

BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, Marina Dudarenko Yandex CC RAS MIPT HSE MSU Analysis of Images, Social

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Make Money With Open Source What is Open Source? Community Free software vs. open source

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

The SEMAINE API: The SEMAINE API: An open-source research platform for An open-source research

Open Source Databases Peter Zaitsev, CEO Percona What a Year! Huge changes for Open Source and

Automating Your Lights with Open Source Combining Open Source Hardware with Free and Open Source

Library Systems @Zero Cost by Omar Farhoud 15 th AMICAL Annual Conference Bishkek, Kyrgyzstan

What is Allegro? Allegro is an open source graphic library for game and Allegro is an open source

The State of Open Source Databases Peter Zaitsev CEO, Percona October 1 st , 2019 Open Source

Composite Correlation Qantization for Efficient Multimodal Retrieval Mingsheng Long 1 , Yue Cao 1

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

12/17/2019 Department of Veterinary and Animal Sciences Hierarchical Markov decision processes

Multi-level Models for Classroom Dynamics Christopher DuBois Padhraic Smyth, UC Irvine Carter

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Production in a Multimodal Corpus: How Speakers Communicate Complex Actions LREC 2008 Carlos

Multimodal Dependent Type Theory Daniel Gratzer 0 Alex Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

BigARTM: Open Source Library for Regularized Multimodal Topic - PowerPoint PPT Presentation

BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections Konstantin Vorontsov, Oleksandr Frei, Murat Apishev, Peter Romov, Marina Dudarenko Yandex CC RAS MIPT HSE MSU Analysis of Images, Social

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Regularized generalized CCA (RGCCA) Arthur Tenenhaus (SUPELEC) Michel Tenenhaus (HEC Paris) 1

Make Money With Open Source What is Open Source? Community Free software vs. open source

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

and Retrieval Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H. Jegou Source: H.

The SEMAINE API: The SEMAINE API: An open-source research platform for An open-source research

Open Source Databases Peter Zaitsev, CEO Percona What a Year! Huge changes for Open Source and

Automating Your Lights with Open Source Combining Open Source Hardware with Free and Open Source

Library Systems @Zero Cost by Omar Farhoud 15 th AMICAL Annual Conference Bishkek, Kyrgyzstan

What is Allegro? Allegro is an open source graphic library for game and Allegro is an open source

The State of Open Source Databases Peter Zaitsev CEO, Percona October 1 st , 2019 Open Source

Composite Correlation Qantization for Efficient Multimodal Retrieval Mingsheng Long 1 , Yue Cao 1

Multimodal Machine Learning Main Goal Define a common taxonomy for multimodal machine learning

12/17/2019 Department of Veterinary and Animal Sciences Hierarchical Markov decision processes

Multi-level Models for Classroom Dynamics Christopher DuBois Padhraic Smyth, UC Irvine Carter

Fusical : Multimodal Fusion for Video Sentiment Boyang Tom Jin Leila Abdelrahman Cong Kevin Chen

Multimodal Corpus for Integrated language and action Rishabh Nigam 10598 Cognitive Sciences

Production in a Multimodal Corpus: How Speakers Communicate Complex Actions LREC 2008 Carlos

Multimodal Dependent Type Theory Daniel Gratzer 0 Alex Kavvos 0 Andreas Nuyts 1 Lars Birkedal 0

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING