Analysis of Wikileaks Cables Using NLP Techniques CS671: Natural - - PowerPoint PPT Presentation

analysis of wikileaks cables using nlp
SMART_READER_LITE
LIVE PREVIEW

Analysis of Wikileaks Cables Using NLP Techniques CS671: Natural - - PowerPoint PPT Presentation

Analysis of Wikileaks Cables Using NLP Techniques CS671: Natural language Processing Arpit Jain Sugam Anand Mentor : Dr . Amitabha Mukerjee Why Wikileaks ? Wikileaks embassy cables revelations covered a huge dataset of official documents


slide-1
SLIDE 1

Analysis of Wikileaks Cables Using NLP Techniques

CS671: Natural language Processing Arpit Jain Sugam Anand Mentor : Dr . Amitabha Mukerjee

slide-2
SLIDE 2

Why Wikileaks ?

 Wikileaks embassy cables revelations covered a huge dataset

  • f official documents counting around 251,287 , from more

than 250 worldwide US embassies and consulates.

 The cables show the extent of US spying on its allies and the

UN; turning a blind eye to corruption and human rights abuse in "client states"; backroom deals with supposedly neutral countries; lobbying for US corporations; and the measures US diplomats take to advance those who have access to them.

 Such a huge, rich and structured dataset can be analyzed with

natural language and Information retrieval techniques.

slide-3
SLIDE 3

Distribution of cables

http://wikileaks.org/cablegate.html

slide-4
SLIDE 4

Structure of Cables

 Cable contains :

Source : Embassy which sent the cable: Destination : Target Embassies Date : Sending date Body : Containing the raw text Tags : Containing meta information regarding cable like classified,unclassified or secret etc.

slide-5
SLIDE 5

Objective

 Diplomats communicated about some topics referencing

people,places ,organizations.

 Extract out these entities from the wikileaks.  Guess what is the topic ?  What is the Opinion of the diplomats (extends to america

also) towards the topic.

 Map these over the timelines.

slide-6
SLIDE 6

Methodology

Get cables for multiple time periods for given embassies.

Extract out the entities using NLTK Named Entity Recognizer

  • r Stanford CoreNLP Toolkit

Score these entities using their occurency frequency over the different cables for a particular time frame.

Guess the topics using topic modelling approach like LDA, PLSA or LSI

slide-7
SLIDE 7

Progress

 For Iran RPO Dubai

  • Total 3853 entities like 'IRIG','supreme leader

Khameni','Khatami','Mousavi','Islamic Revolution','Middle East'.  For Islamabad

  • 'Kashmir','Balochistan','Musharraf','North West Frontier

Province'  For New Delhi

  • 'PM Manmohan Sibgh','BJP','NSSP','Tsunami Relief'
slide-8
SLIDE 8

LDA Results for Islamabad

Relief operation by UN ['0.211*"usaid/dart" + 0.178*"relief" + 0.115*"water" + 0.114*"earthquake" + 0.113*"shelter“ + 0.112*"tents" + 0.103*"october“ + 0.101*"u.n." + 0.097*"sanitation" + 0.095*"food"'] Existence of extremists in madrassa ["0.018*ssp + 0.016*( + 0.012*2005 + 0.010*groups + 0.010*domestic + 0.010*leaders + 0.010*extremist + 0.010*madrassa + 0.009*'s + 0.008*its", '0.000*rns. + 0.000*opened + 0.000*increase + 0.000*2005. + 0.000*receiving + 0.000*viable + 0.000*shows + 0.000*rebuilding + 0.000*e. + 0.000*jalil']

slide-9
SLIDE 9

LDA Results for New Delhi

Nuclear Deal ['0.115*"saran" + 0.113*"bjp" + 0.109*"nuclear" + 0.107*"congress" + 0.105*"jaishankar" + 0.103*"king" + 0.099*"pakistan" + 0.097*"nssp“ + 0.094*"nepal" + 0.080*"iraq"']

slide-10
SLIDE 10

References

@InProceedings{

  • connor-stewart-smith-13_extracting-intl-relations-from-political-

context, author={O'Connor, Brendan and Stewart, Brandon M. and Smith, Noah A.}, title = {Learning to Extract International Relations from Political Context}, booktitle = {Proc. 51st ACL (Long papers)}, month = {August}, year = {2013}, pages = {1094--10104}, url = {http://www.aclweb.org/anthology/P13-1108} annote = { } }