SLIDE 1
Semi-Supervised Sentiment Analysis in Hindi
Naman Bansal Umair Z Ahmed
SLIDE 2 MOTIVATION
Why Sentiment Analysis?
- Labeling the reviews with their sentiment would provide succinct
summaries to readers
- Helpful in business intelligence applications, recommender systems,
message filtering, … Why semi-supervised? Problems with Supervised polarity
classification systems:
- Typically domain-specific
- Expensive process of annotating a large amount of data (especially for
low resource languages)
SLIDE 3 PREVIOUS WORKS
Dasgupta and Ng (2009) firstly mine the unambiguous reviews
using spectral techniques, and then exploit them to classify the ambiguous reviews via a novel combination of active learning, transductive learning, and ensemble learning.
Joshi et al. (2010) created H-SWN using English SentiWordNet
and English-Hindi WordNet Linking.
Bakliwal et al. (2012) created Hindi Subjective Lexicon and use
Hindi WordNet to assign similar polarity to synonyms and
- pposite polarity to antonyms.
SLIDE 4 PREVIOUS WORKS ON INDIAN LANGUAGES
Sentiment analysis for Indian Languages has primarily been
focusing on using:
- Machine Translation to translate the data in English to Hindi.
- Bi-Lingual dictionary for English and Indian Languages
- Hindi WordNet expansion to exploit synonyms and antonym polarity
Un/Semi-supervised sentiment analysis techniques are
under-investigated in NLP
SLIDE 5 DATASET
IIT Bombay Movie Review Dataset
- Open source
- 300 Reviews (150 + 150)
IIIT Hyderabad Product Review Dataset
- On Request
- 700 Reviews (350 + 350)
Our contribution
- Building movie review dataset from jagran.com
SLIDE 6
DATASET
<movie sentiment=“neg” star=“2” link = http://www.jagran.com/entertainment/reviews- mickey-virus-movie-review-10821431.html> <review> चरॎचित टीवी एंकर मनीष पॉल की इस फिलॎम से बहुत उमॎमीदेः थीं। … </review> <SelectedLines> <line sentiment=“pos”> ममकी वाइरस पूरी तरह से मनीष पॉल की फिलॎम है और फिलॎम मेः उनकी इमेज क े हहसाब से ही दॄशॎय और सॎथथततयां रची गई थीं। मनीष ने अपने फकरकार को बखूबी तनभाया है। </line> <line sentiment=“neg”> फिलॎम देखने क े बाद न मसि ि उमॎमीदेः धराशायी हुई बसॎलॎक अचॎछे खासे ववषय को यूं ही जाया हो जाने का अिसोस भी हो रहा है। उनक े अमभनय मेः इंटेमसंटी तो है लेफकन फकरदार थटीररयोटाइप होते जाए तो अचॎछा अमभनेता भी बोर कर सकता है।</line> </SelectedLines> </movie>
SLIDE 7
PRE-PROCESSING DATA
Remove:
Punctuations Numbers Words of length one Words that occur only in a single review Words with high document frequency, many of which are
stopwords or domain specific general-purpose words
SLIDE 8 DATA REPRESENTATION
Each review is represented as a vector of unigrams, using binary weight equal to 1 for terms present in a vector. The dataset is represented as a Matrix where R is the number of training samples, T is the number of test samples, D is the number of feature words in the dataset.
SLIDE 9 PROPOSED APPROACH
Deep Learning Architechture
One Input Layer h0
N hidden layers h1, h2, …, hN
One Output Layer
The input layer h0 has D units, equal to the number
- f features of sample data x.
We intend to seek the mapping function XL YL using the L labeled data and R+T
Deep Learning
SLIDE 10
PROPOSED APPROACH
The semi-supervised learning method based on ADN architecture can
be divided into two stages:
First, ADN architecture is constructed by greedy layer-wise unsupervised learning using RBMs as building blocks. All the unlabeled data together with L labelled data are utilized to find the parameter space W with N layers.
Second, ADN architecture is trained according to the exponential loss function using gradient descent method. The parameter space W is retrained by an exponential loss function using L labelled data.
SLIDE 11
PROPOSED APPROACH
Energy of the state(hk-1,hk) as
The probability that the model assigns to hk-1
is: where Z(θ) denotes the normalizing constant.
SLIDE 12
PROPOSED APPROACH
The probability of turning on unit t is a logistic function of the states of hk and wk The probability of turning on unit t is a logistic function of the states of hk-1 and wk The logistic function is:
SLIDE 13
PROPOSED APPROACH
Optimization problem is formulized as The loss function is defined as