TEAM DATA Abhishek Raval Manikantha Dronamraju OBJECTIVE Using - PowerPoint PPT Presentation

TEAM DATA Abhishek Raval Manikantha Dronamraju

OBJECTIVE ➢ Using the Bibliographic Data Set to generate interesting observations ➢ Pre-Process nodes and edges to generate data of the form [PageId | PaperName | AdjList Of Citations |AdjListSize | Number of times this page is cited | List of Authors(authorId:authorName:totalPublicationsOfAuthor)| Conference(conferenceId:conferenceName:conferencePublicationCount)|List of Terms(termId:keyword)] From the data in following format: 0 \tab term \tab systems \tab 0 75568 \tab paper \tab High-speed three dimensional laser sensor \tab 2000 1323803 \tab author \tab Phillip S. Yu \tab 550 1318800 \tab conf \tab IEEE Transactions on Information Theory \tab 7346, connected by edges 1004048 \tab 568 ➢ Find the paper with Highest PageRank ? Is it one with most citations? ➢ Predict the keywords of any paper, using K Nearest Neighbour Algorithm

IS PRE-PROCESSING EASY? In the class we talked a lot about pre-processing, but does that seem a job to just process and reformat a data as per our needs? But what happens if all the nodes contain multi-attribute values with each node having it’s own id, and the only solution is lookup from edges.txt Let’s find out more about that.

WHAT DID IT TAKE TO PRE-PROCESS DATA? ➢ 4 Jobs with each doing MapSide Joins(For Joining paper, with other papers, authors, conference, keywords) ➢ 1 MR Jobs for computing Citations count for paper ➢ And 1 Final MR job for merging all the attributes

PAGE RANK Recursive Formula: ➢ Each Page gets “1/Number of nodes” initial page rank ➢ The Page’s rank is calculated as the sum of the incoming link based on the formula PR(n) = (1- d)/N + d(∑PR(N i )/C(Ni)) ➢ PR(n) is the Page Rank of Node n ➢ PR(Ni) is the Page Rank of Ni ➢ C(Ni) is the number of nodes in the adjacency list ➢ N is the number of Pages ➢ d is the damping factor generally set to 0.85.

EXCEUTION  Internal Execution

DANGLING NODE ➢ A few nodes have no outbound link and are known as Dangling Node ➢ Since, there is no outgoing link the PR mass accumulates at that node and is lost at every iteration. ➢ To handle that and distribute the mass equally among all nodes we modify the formula as below: ➢ PR(n) = (1- d)/N + d(∑PR(N i )/C(Ni) + §/N) ➢ Where “ § ” is the Dangling Mass ➢ To solve this in one MR job and avoid the unnecessary overhead we used a Global counter to accumulate the mass at every reducer.

Global Counter MR Job Iterations Machine Execution Iteration Machine Execution Time Time(Min) 20 1 37 20 1 23 20 5 36 20 5 22 20 10 37 20 10 23 EXECUTION TIME

K NEAREST NEIGHBOUR Recursively Iterate through each testcase in testDataSet: ➢ Compute distance metrics between testPaper and each node from our dataset and emit K nodes at each mapper task of format(null,(distance, Keyword)) ➢ Distance metrics can be computed using Euclidean distance, Manhattan distance, etc. Since we were dealing with Strings, we used Edit Distance. ➢ We Considered AuthorName, ConferenceName, PaperName, AdjacencyList of Paper, Keywords, to build our metrics. ➢ On the basis of k, the most frequent keyword, would be the majority and output, as it will be the closest to the test dataset.

EXCEUTION Internal Execution

#Test Cases Machine Execution Time 15 1 35 15 5 23 15 10 11  Time taken on Aws EXECUTION TIME

CONCLUSION ➢ The paper with highest Page Rank is different then the paper that has been cited the most. ➢ The most cited paper is not distributed across all nodes that are in the data set ➢ Where as the Paper with Max Rank is distributed across many adjacency lists. ➢ On the basis of selected value of K, the predictions for KNN deviates a lot. ➢ Selecting the value of K, can tune the prediction for KNN significantly. ➢ Pre-processing could be a challenging task, handle enough time for it, or else be very careful when selecting a dataset.

TEAM DATA Abhishek Raval Manikantha Dronamraju OBJECTIVE Using - PowerPoint PPT Presentation

TEAM DATA Abhishek Raval Manikantha Dronamraju OBJECTIVE Using the Bibliographic Data Set to generate interesting observations Pre-Process nodes and edges to generate data of the form [PageId | PaperName | AdjList Of Citations

Pawel K. Olszewski, PhD pawel@waikato.ac.nz TEAM TEAM TEAM TEAM TEAM TEAM TEAM TEAM TEAM

BUILDING A CAREGIVING TEAM Lessons Learned Team Structure Team Member Team Coordinator Team

DMIP DMIP team DMIP DMIP team team team Data Mining and Inductive Data Mining and Inductive

Great Writers Read Week 2 - Day 1 Lets get into teams TEAM 1 TEAM 2 TEAM 3 TEAM 4 TEAM 5

Team discussion: Process and formalities The supervisor provides team results to the team in

Characteristics of an E ffective Team Team Simulation What is a Team? Jim Hughes,

Data Action Team Meeting 3 The Data Action Team The Data Action Team will engage in a

Team Dynamics & Management Team Dynamics & Management Team structures Why teams?

BUILDING A TEAM Process description TEAM BUILDING TEAM CONSOLIDATION (The process of

Characteristics of an E ffective Team Jim Hughes, Lovelace Health System Agenda Team

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

NEAT Team NEAT Team New Effusion Alternative Test Team New Effusion Alternative Test Team

KIRC Presentation for Project M1 - ENS 2013/2014 KIRC Team KIRC The Team The Team The Team

TEAM Engineering & TEAM Group 1. COMPANY PRESENTATION 1.1 Origins TEAM Engineering S.p.A.

Final Team Size Olympics 125 athletes 46 Coaches/Managers 9 General Team Management

Final Team Size Paralympics 62 athletes 29 Coaches/Managers 6 General Team

THOMSON REUTERS STREETEVENTS Edited by Rolls Royce Thank you, John, and good afternoon, everyone.

Mahindra CIE Automotive Limited Q1FY16 Results Conference Call July 28, 2015 M R . H EMANT L

Automatic Data Path Extraction in Large-Scale Register-Transfer Level Designs Wei Song, Jim

Group 12 Gender in Third-person Singular Pronouns English Mandarin Cantonese [t h a1]

Introduction My proposal Techniques used for experiments Present results

Welcome to Williamsburg Middle School Mr. Gordon Laurie, Principal Mr. Dante Hicks, Asst.

LILITH QUINTESSENTIAL BAD GIRL OR SIMPLY MISUNDERSTOOD Y V O N N E W I C H M A N K E N N E S A

FuturoDigitale Non-profit Association Who we are FuturoDigitale Association was born on the 20

TEAM DATA Abhishek Raval Manikantha Dronamraju OBJECTIVE Using - PowerPoint PPT Presentation

TEAM DATA Abhishek Raval Manikantha Dronamraju OBJECTIVE Using the Bibliographic Data Set to generate interesting observations Pre-Process nodes and edges to generate data of the form [PageId | PaperName | AdjList Of Citations

Pawel K. Olszewski, PhD pawel@waikato.ac.nz TEAM TEAM TEAM TEAM TEAM TEAM TEAM TEAM TEAM

BUILDING A CAREGIVING TEAM Lessons Learned Team Structure Team Member Team Coordinator Team

DMIP DMIP team DMIP DMIP team team team Data Mining and Inductive Data Mining and Inductive

Great Writers Read Week 2 - Day 1 Lets get into teams TEAM 1 TEAM 2 TEAM 3 TEAM 4 TEAM 5

Team discussion: Process and formalities The supervisor provides team results to the team in

Characteristics of an E ffective Team Team Simulation What is a Team? Jim Hughes,

Data Action Team Meeting 3 The Data Action Team The Data Action Team will engage in a

Team Dynamics &amp; Management Team Dynamics &amp; Management Team structures Why teams?

BUILDING A TEAM Process description TEAM BUILDING TEAM CONSOLIDATION (The process of

Characteristics of an E ffective Team Jim Hughes, Lovelace Health System Agenda Team

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

NEAT Team NEAT Team New Effusion Alternative Test Team New Effusion Alternative Test Team

KIRC Presentation for Project M1 - ENS 2013/2014 KIRC Team KIRC The Team The Team The Team

TEAM Engineering &amp; TEAM Group 1. COMPANY PRESENTATION 1.1 Origins TEAM Engineering S.p.A.

Final Team Size Olympics 125 athletes 46 Coaches/Managers 9 General Team Management

Final Team Size Paralympics 62 athletes 29 Coaches/Managers 6 General Team

THOMSON REUTERS STREETEVENTS Edited by Rolls Royce Thank you, John, and good afternoon, everyone.

Mahindra CIE Automotive Limited Q1FY16 Results Conference Call July 28, 2015 M R . H EMANT L

Automatic Data Path Extraction in Large-Scale Register-Transfer Level Designs Wei Song, Jim

Group 12 Gender in Third-person Singular Pronouns English Mandarin Cantonese [t h a1]

Introduction My proposal Techniques used for experiments Present results

Welcome to Williamsburg Middle School Mr. Gordon Laurie, Principal Mr. Dante Hicks, Asst.

LILITH QUINTESSENTIAL BAD GIRL OR SIMPLY MISUNDERSTOOD Y V O N N E W I C H M A N K E N N E S A

FuturoDigitale Non-profit Association Who we are FuturoDigitale Association was born on the 20

Team Dynamics & Management Team Dynamics & Management Team structures Why teams?

TEAM Engineering & TEAM Group 1. COMPANY PRESENTATION 1.1 Origins TEAM Engineering S.p.A.