twitter data processing with mongodb
play

Twitter Data Processing with MongoDB By Ama & Sameera - PowerPoint PPT Presentation

Twitter Data Processing with MongoDB By Ama & Sameera Introduction Create twitter developer account Get access key Access REST API Execute some POST and GET queries Download a sample of twitter streaming data


  1. Twitter Data Processing with MongoDB By Ama & Sameera

  2. Introduction � Create twitter developer account � Get access key � Access REST API � Execute some POST and GET queries � Download a sample of twitter streaming data � Analyze a single object a tweet (json format)

  3. Running Hadoop

  4. Twitter Application

  5. Flume configuration

  6. Flume- data streaming

  7. Hadoop File System

  8. Running MongoDB services

  9. Twitter data import

  10. Data Structure � http://www.jsoneditoronline.org/

  11. Data Mining

  12. Tweets Per Topic db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Sunday.*"}},{'text': {$regex: ".*sunday.*"}}] }} ,{$group:{_id:null, count:{$sum:1}} }])

  13. Tweets vs. Time-Zone: Paris db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Paris.*"}},{'text': {$regex: ".*paris.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}]) 8000 7000 6000 5000 4000 3000 2000 1000 0

  14. Tweets vs. Time-Zone: Thanksgiving db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*Thanksgiving.*"}},{'text': {$regex: ".*thanksgiving.*"}}] }} ,{$group:{_id:"$user.time_zone", count:{$sum:1}} },{$sort: {count:-1}}]) 6000 5000 4000 3000 2000 1000 0

  15. American Music Awards(AMA) 2015

  16. AMA : Artist of the year db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex: ".*Nicky Minaj.*"}},{'text': {$regex: ".*@NICKYMINAJ.*"}}, {'text': {$regex: ".*nicky minaj.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])

  17. AMA : Performances db.finaltwitterdata.aggregate( [ { $match: {$or: [{'text': {$regex:".*5SOS.*"}},{'text': {$regex: ".*5 Seconds Of Summer.*"}},{'text': {$regex: ".*5 Seconds of Summer.*"}},{'text': {$regex: ".*5 seconds of summer.*"}} ] }} ,{$group:{_id:null, count:{$sum:1}} }])

  18. AMA : Favorite Electronic Dance Music Artist

  19. Research Paper Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture

  20. Introduction � After significant breaking news events, Twitter aims to provide relevant results within minutes; typically ten minutes. � Related query suggestion is a feature that most searchers are likely familiar with, e.g. typing “Obama” � Two systems were built to achieve this target but only one was eventually deployed: � First implementation was based on a typical Hadoop-based analytics stack. � Second implementation, which was eventually deployed, is a custom in-memory processing engine.

  21. Problem definition � "search assistance" @ Twitter � Twitter context introduces a real-time "twist � At twitter, search assistance needs to be provided in real time and must dynamically adapt to the rapidly evolving "global conversation". � The architecture considers 3 aspects of data – volume, velocity, & variety, and it addressed the challenges of real-time data processing in the era of "big data“

  22. First approach: Hadoop � The first solution sought to take advantage of Twitter's existing analytics platform : Hadoop � Incorporated into its' Hadoop platform are components such as Pig, Hbase, ZooKeeper, and Vertica. � Data is written to the Hadoop Distributed File System (HDFS) via a number of real- time and batch processes. � Intead of directly writing Hadoop code in Java, analystics at Twitter is performed mostly using Pig

  23. Hadoop Platform

  24. Disadvantages � Although the system worked reasonably in terms of output, however, latency was estimated in hours. � This is a far away from the targeted 10 minutes. � The latency is primarily attributed to: � Data import pipeline moving data from tens of thousands of production hosts onto HDFS � MapReduce jobs

  25. New approach: In-memory processing engine

  26. New approach: Search Assistance Engine The search assistance engine consists of: � A lightweight frontend serving requests from an in-memory cache, � A backend that consumes the fire hose and query hose to compute related query suggestions and spelling corrections.

  27. Dataflow The query path: as a query from a given user is delivered through the query hose, the following actions are taken: � Query statistics are updated in the query statistics store � The query is added to the sessions store � For each previous query in the session, a query co-occurrence is formed with the new query.

  28. Conclusion � The authors believe that although the experience was instructive, they hope that future system designers can benefit from their story and build the right solution the first time. � It would be desirable to build a generic data processing platform capable of handling both “big data” and “fast data”.

  29. Thank you ☺

  30. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend