Twitter Data Processing with MongoDB By Ama & Sameera - - PowerPoint PPT Presentation

twitter data processing with mongodb
SMART_READER_LITE
LIVE PREVIEW

Twitter Data Processing with MongoDB By Ama & Sameera - - PowerPoint PPT Presentation

Twitter Data Processing with MongoDB By Ama & Sameera Introduction Create twitter developer account Get access key Access REST API Execute some POST and GET queries Download a sample of twitter streaming data


slide-1
SLIDE 1

Twitter Data Processing with MongoDB

By Ama & Sameera

slide-2
SLIDE 2

Introduction

  • Create twitter developer account
  • Get access key
  • Access REST API
  • Execute some POST and GET queries
  • Download a sample of twitter streaming data
  • Analyze a single object a tweet (json format)
slide-3
SLIDE 3

Running Hadoop

slide-4
SLIDE 4

Twitter Application

slide-5
SLIDE 5

Flume configuration

slide-6
SLIDE 6

Flume- data streaming

slide-7
SLIDE 7

Hadoop File System

slide-8
SLIDE 8

Running MongoDB services

slide-9
SLIDE 9

Twitter data import

slide-10
SLIDE 10

Data Structure

http://www.jsoneditoronline.org/

slide-11
SLIDE 11

Data Mining

slide-12
SLIDE 12

Tweets Per Topic

db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Sunday.*"}},{'text': {$regex: ".*sunday.*"}}] }} ,{$group:{_id:null, count:{$sum:1}} }])

slide-13
SLIDE 13

Tweets vs. Time-Zone: Paris

1000 2000 3000 4000 5000 6000 7000 8000

db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Paris.*"}},{'text': {$regex: ".*paris.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}])

slide-14
SLIDE 14

Tweets vs. Time-Zone: Thanksgiving

1000 2000 3000 4000 5000 6000

db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Thanksgiving.*"}},{'text': {$regex: ".*thanksgiving.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}])

slide-15
SLIDE 15

American Music Awards(AMA) 2015

slide-16
SLIDE 16

AMA: Artist of the year

db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex: ".*Nicky Minaj.*"}},{'text': {$regex: ".*@NICKYMINAJ.*"}}, {'text': {$regex: ".*nicky minaj.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])

slide-17
SLIDE 17

AMA: Performances

db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*5SOS.*"}},{'text': {$regex: ".*5 Seconds Of Summer.*"}},{'text': {$regex: ".*5 Seconds of Summer.*"}},{'text': {$regex: ".*5 seconds of summer.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])

slide-18
SLIDE 18

AMA: Favorite Electronic Dance Music Artist

slide-19
SLIDE 19

Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture

Research Paper

slide-20
SLIDE 20

Introduction

After significant breaking news events, Twitter aims to provide relevant results

within minutes; typically ten minutes.

Related query suggestion is a feature that most searchers are likely familiar with,

e.g. typing “Obama”

Two systems were built to achieve this target but only one was eventually

deployed:

  • First implementation was based on a typical Hadoop-based analytics stack.
  • Second implementation, which was eventually deployed, is a custom in-memory

processing engine.

slide-21
SLIDE 21

Problem definition

  • "search assistance" @ Twitter
  • Twitter context introduces a real-time "twist
  • At twitter, search assistance needs to be provided in real time and must

dynamically adapt to the rapidly evolving "global conversation".

  • The architecture considers 3 aspects of data – volume, velocity, & variety, and it

addressed the challenges of real-time data processing in the era of "big data“

slide-22
SLIDE 22

First approach: Hadoop

  • The first solution sought to take advantage of Twitter's existing analytics platform :

Hadoop

  • Incorporated into its' Hadoop platform are components such as Pig, Hbase,

ZooKeeper, and Vertica.

  • Data is written to the Hadoop Distributed File System (HDFS) via a number of real-

time and batch processes.

  • Intead of directly writing Hadoop code in Java, analystics at Twitter is performed

mostly using Pig

slide-23
SLIDE 23

Hadoop Platform

slide-24
SLIDE 24

Disadvantages

  • Although the system worked reasonably in terms of output, however, latency was

estimated in hours.

  • This is a far away from the targeted 10 minutes.
  • The latency is primarily attributed to:
  • Data import pipeline moving data from tens of thousands of production hosts
  • nto HDFS
  • MapReduce jobs
slide-25
SLIDE 25

New approach: In-memory processing engine

slide-26
SLIDE 26

New approach: Search Assistance Engine

The search assistance engine consists of:

A lightweight frontend serving requests from an in-memory cache, A backend that consumes the fire hose and query hose to compute related query

suggestions and spelling corrections.

slide-27
SLIDE 27

Dataflow

The query path: as a query from a given user is delivered through the query hose, the following actions are taken:

Query statistics are updated in the query statistics store The query is added to the sessions store For each previous query in the session, a query co-occurrence is formed with the new

query.

slide-28
SLIDE 28

Conclusion

  • The authors believe that although the experience was instructive, they hope that

future system designers can benefit from their story and build the right solution the first time.

  • It would be desirable to build a generic data processing platform capable of

handling both “big data” and “fast data”.

slide-29
SLIDE 29

Thank you ☺

slide-30
SLIDE 30

Questions?