Further plans and available Further plans and available data sets - PowerPoint PPT Presentation

Further plans and available Further plans and available data sets for research in data sets for research in directed networks directed networks Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences benczur@sztaki.mta.hu http://datamining.sztaki.hu Supported by the EC FET Open project "New tools and algorithms for directed network analysis" (NADINE No 288956) 14 June 2013

Overview Overview • Web classification, ClueWeb12 • Temporal ranking, learning to rank • Metadata extraction from pdf publications • Plagiarism Detection • Twitter: 1TBdata available, user graph collection in progress for Andreas’ data for Andreas’ data • Distributed systems for very large problems Hardware 50-node old dual core Hadoop • 5-node new Hadoop/HBASE • • 260TB net Isilon

Automatic metadata extraction Automatic metadata extraction • Careful selection of open source PDF converters • Feature generation o font size, face, upper/lower case, numeric characters, symbols o location (centered, vertical position, spacing, page number) o entity list (names, institutions) • Manual training for a Hungarian journal in Economics • Automatic training planned by using publication DBs • Automatic training planned by using publication DBs • Selection of machine learning methods o Random forest is best, LogitBoost with trees is second best o Conditional random fields sound nice but not nearly as good as claimed • Extraction depends of what we can train (manually label) o Author, title, institution o References extracted structured o Tables, figure captions o …?

Plagiarism detection Plagiarism detection • BonFIRE Future Internet Research andExperimentation testbed • KOPI: A plagiarism detection toolkit o http://kopi.sztaki.hu/ o Translation plagiarism (English and Hungarian) (English and Hungarian) o Now serving English Wikipedia o Service puts very heavy load on search index (sentence based checks, existing suboptimal code) o Index ported to several distributed key-value stores o We feed with Web data

Crosslingual Web Classification Crosslingual Web Classification • Save resources, select quality and topic • Legal regulation (porn, illicit content) • Web scale data (Test: ClueWeb09 25TB – 0.5 Billion English language docs) • We just obtained ClueWeb12 • We just obtained ClueWeb12 Cross-Lingual Web Spam Classification. Garzó, Daróczy, Kiss, Siklósi, Benczúr. WebQuality 2013 (@WWW) The classication power of Web features . Erdelyi, Benczur, Daroczy, Garzo, Kiss, Siklosi Internet Mathematics, under revision Julien Philippe Masanes Rigaux Internet Memory Paris

Large Large set set of of features features • Term frequency o tf.idf or BM25 scores for frequent terms • Content o DOM, HTML, HTTP elements o Appearance of popular terms o Term, n-gram statistics, compressibility • Linkage • Linkage o PageRank (truncated variants; ratios) o Neighborhood (only approximate counting is possible) o TrustRank

Workflow Workflow ( (MapRed MapRed jobs jobs indicated indicated) )

SZTAKI Web Processing Framework SZTAKI Web Processing Framework

Crosslingual Web Crosslingual Web Classification Classification • �� • �� Terms in the English model translated into Portuguese to translated into Portuguese to classify in the target language. Strongest positive and negative predictions are used for training a model in the target language.

Temporal Wikipedia Search Temporal Wikipedia Search ( (Julianna Julianna) )

Yago Yago: Yet Another General : Yet Another General Onthology Onthology • By MPII Saarbrücken derived from Wikipedia WordNet and GeoNames • 10+ million entities (persons, organizations, cities), 120+ million facts We are developing similar visualization as Wikipedia (prev slide) •

Temporal trends in blog data Temporal trends in blog data Liberation_war economic promise those engine this_year in_effect fulfill

Temporal trends in blog data Temporal trends in blog data • Temporal Text Mining: probabilistic models, language models • Still in progress, challenging algorithmic issues thesis case phd plagiarism semmelweis university case_discovery

SZTAKI SZTAKI Full Full Text Text Search Search Technology Technology

Network Network Influence Influence in in Recommenders Recommenders

Apply for Twitter: Apply for Twitter: retweets retweets • Twitter data: o topics (~bursts: occupy wall street ....) o Andreas has 4 topics ("10o","occupy","20n","yosoy132"). • For all topics we have a set of tweets (can be a retweet) • In numbers: Follower network: 10 6 users o Tweets: ~ 10 5 - 10 6 per topic o o Tweets: ~ 10 - 10 per topic • Social network (who follows who) is missing • Needed since we only know the ROOT of a retweet sequence • Robert is collecting the network

The Matrix Factorization recommender The Matrix Factorization recommender Learning Source of next slides: Domonkos Tikk, CEO, Gravity

BRISMF model BRISMF model • Biased Regularized Incremental Simultaneous Matrix Factorization • Apply regularization to prevent overfitting • To further decrease RMSE using bias values • Model: K � � ∑ = + + = + + r p q b c p q b c ˆ ui u i u i uk ki u i = 1 k

BRISMF Learning BRISMF Learning • Loss function 2   K ∑ ∑ ∑ ∑ ∑ ∑  − − −  + λ + λ + λ + λ r p q b c p q b c 2 2 2 2 ui uk ki u i uk ki u i   u i ∈ R k = u k i k u i ( , ) 1 ( , ) ( , ) train • SGD update rules • SGD update rules ( ) ( ) ∆ = η − λ ∆ = η − λ p e q p q e p q uk ui ki uk ki ui uk ki ( ) ( ) ∆ = η − λ ∆ = η − λ b e b c e c u ui u i ui i

R P 1 4 3 1,2 -0,3 1,1 -0,2 1,1 -0,4 1,2 -0,5 1,2 0,9 1,1 0,8 1,2 0,9 4 4 0,5 -0,1 0,4 -0,4 0,5 -0,3 0,4 -0,2 4 2 4 1,4 1,5 1,3 0,8 0,9 -1,3 -1,1 -1,2 -0.1 0,1 0,0 0.6 0.5 Q -0,2 -0,1 0,0 0,5 0,4 -0,3 -0,4 -0,2 1,6 1,6 1,5 0,3 0,2

R P 1 4 3.3 3 2.4 1,4 1,1 0,9 1,9 -0.5 3.5 4 4 1.5 2,5 -0,3 4 4.9 2 1.1 4 1,5 2,1 1,0 0.7 1.6 Q -1,0 0,8 1,6 1,8 0,0

Influence Influence Learning Learning by by Gradient Gradient Descent Descent • Present influence recommender: o heuristic weighted network learning o no artist based learning part • Heuristic combination of the influence and factor models o Is it likely that user v influences user u on artist a? o Can user a be influenced at all in case of artist a? • Use SGD method to learn user and artist factors � � ∑ = Γ ∆ + + r t p q b c ( )( ) ˆ uat v a v i v

Distributed learning? Distributed learning? • Hadoop gathered bad reputation recently o Wants to be too robust, keep writing all temporal data several times to disk o Fails after a given number of servers o The learning and graph problems do more computation on less data compared to building a Google search index • My personal choice of frameworks o GraphLab (Danny Bickson, HUJI) • Nearly as efficient as possible C++ codes • Nearly as efficient as possible C++ codes • But very hard to write them • We work with them on implementing learning-to-rank methods o Stratosphere (Volker Markl, Kostas Tzoumas, TU Berlin) • Developments coordinated by TU Berlin with lots of partners incl. us • Promises to simplify complex workflows like the spam filter • Yet what many applications need would be o Streaming (read data only once, no batch computations) o Fully distributed: no Facebook, Google, Netflix knowing each and every online action ever in our life – have P2P learning

A d A distributed istributed systems systems comparison slide comparison slide “Scalable Machine Learning for Big Data” tutorial at ICDE 2012

Mobility Mobility Data Data Stream Stream processing processing ( (Orange Orange D4D) D4D)

Stream Stream Processing Processing Architec Architecture ture Overview Overview Goal is to hide Storm details from user • Streaming infrastructure pluggable (could combine with Stratosphere) • Persistence layer pluggable

Conclusions Conclusions • Web classification plans to integrate with BUbiNG, use SZTAKI cluster to test the crawler • Analyze ClueWeb12 and maybe a NADINE crawl? • Temporal ranking in Wikipedia – other temporal collections? • Use metadata extraction from online publications to infer topics and rich information that is available in full text only topics and rich information that is available in full text only (beyond the usual DBLP graph analysis) • Network analysis in the plagiarism detection tool? • Twitter o Understand the 1TBdata o Find influences in the user graph that we collect for Andreas’ data • Distributed machine learning and graph algorithms

Questions? Questions? András Benczúr Head, Informatics Laboratory and “Big Data” lab “Big Data” lab http://datamining.sztaki.hu/ benczur@sztaki.mta.hu Web and Social Media 14 June 2013

Further plans and available Further plans and available data sets - PowerPoint PPT Presentation

Further plans and available Further plans and available data sets for research in data sets for research in directed networks directed networks Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences

Go Further Go Further Go Further Go Further BOB SHANKS: EXECUTIVE VICE PRESIDENT AND CFO

Go Further Go Further Go Further Go Further BOB SHANKS EXECUTIVE VICE PRESIDENT AND CFO

Further Mathematics Important questions What is A level Further Mathematics? What Maths

Chapter 17 Employee Benefits: Retirement Plans Fundamentals of Private Retirement Plans

2007-08 August 2008 Table 11: Early Retirement Incentive Plans and Flexible Benefit Plans Early

District Plans ( Combined, CIP, Advising, and Literacy Plans) Guidance Webinar: July 14, 2020

DETERMINE WHICH OPTIONS ARE RECOMMENDED FOR FURTHER STUDY OPTIONS FOR FURTHER STUDY MUST INCLUDE:

1 Further information Slides to this webcast (available here:

Physics plans and and ILDG ILDG usage usage Physics plans in Italy Italy in Francesco Di

Am I a Fiduciary? For ERISA Plans, Non-ERISA Plans, and Plans That Aren't Sure National Tax

9/11/2018 JOINT ESTATE PLANS What makes the good plans good and the bad "plans" bad?

November 16, 2016 4-5pm AGENDA RFP 1. Hiring plans a. Startup plans b. Space plans c.

Enrollment: High-Deductible Health Plans Moda Health High-deductible health plans (HDHPs)

BIAS Personal Retirement Plans BIAS plan is one of the few licensed personal plans operating in

Further and Higher Education Data Daniel Harrison daniel.harrison@gov.scot Further and Higher

Further Discussions and Beyond EE630 Further Discussions and Beyond EE630 Final exam: two

Problems Samples & Perspectives on Cyber-Physical Energy Networks ETH D-INFK Seminar @ Oct 31

2D Face Image Analysis Probabilistic Morphable Models Summer School, June 2017 Sandro Schnborn

IT-SDC : Support for Distributed Computing 1 The problem Pick a number of generic

Big Data for Data Science Data streams and low latency processing event.cwi.nl/lsde DATA STREAM

Advanced Process Control: An Overview Sachin C. Patwardhan Dept. of Chemical Engineering I.I.T.

a11y and UX Junaid Masoodi Accessibility enables people with disabilities to perceive,

PROGETTAZIONE E MANAGEMENT DEL MULTIMEDIA PER LA COMUNICAZIONE XVIII EDIZIONE 2014/2015

www.drupaleurope.org No photos please image Responsible disclosure, cross-project

Further plans and available Further plans and available data sets - PowerPoint PPT Presentation

Further plans and available Further plans and available data sets for research in data sets for research in directed networks directed networks Andras Benczur Insitute for Computer Science and Control Hungarian Academy of Sciences

Go Further Go Further Go Further Go Further BOB SHANKS: EXECUTIVE VICE PRESIDENT AND CFO

Go Further Go Further Go Further Go Further BOB SHANKS EXECUTIVE VICE PRESIDENT AND CFO

Further Mathematics Important questions What is A level Further Mathematics? What Maths

Chapter 17 Employee Benefits: Retirement Plans Fundamentals of Private Retirement Plans

2007-08 August 2008 Table 11: Early Retirement Incentive Plans and Flexible Benefit Plans Early

District Plans ( Combined, CIP, Advising, and Literacy Plans) Guidance Webinar: July 14, 2020

DETERMINE WHICH OPTIONS ARE RECOMMENDED FOR FURTHER STUDY OPTIONS FOR FURTHER STUDY MUST INCLUDE:

1 Further information Slides to this webcast (available here:

Physics plans and and ILDG ILDG usage usage Physics plans in Italy Italy in Francesco Di

Am I a Fiduciary? For ERISA Plans, Non-ERISA Plans, and Plans That Aren't Sure National Tax

9/11/2018 JOINT ESTATE PLANS What makes the good plans good and the bad &quot;plans&quot; bad?

November 16, 2016 4-5pm AGENDA RFP 1. Hiring plans a. Startup plans b. Space plans c.

Enrollment: High-Deductible Health Plans Moda Health High-deductible health plans (HDHPs)

BIAS Personal Retirement Plans BIAS plan is one of the few licensed personal plans operating in

Further and Higher Education Data Daniel Harrison daniel.harrison@gov.scot Further and Higher

Further Discussions and Beyond EE630 Further Discussions and Beyond EE630 Final exam: two

Problems Samples &amp; Perspectives on Cyber-Physical Energy Networks ETH D-INFK Seminar @ Oct 31

2D Face Image Analysis Probabilistic Morphable Models Summer School, June 2017 Sandro Schnborn

IT-SDC : Support for Distributed Computing 1 The problem Pick a number of generic

Big Data for Data Science Data streams and low latency processing event.cwi.nl/lsde DATA STREAM

Advanced Process Control: An Overview Sachin C. Patwardhan Dept. of Chemical Engineering I.I.T.

a11y and UX Junaid Masoodi Accessibility enables people with disabilities to perceive,

PROGETTAZIONE E MANAGEMENT DEL MULTIMEDIA PER LA COMUNICAZIONE XVIII EDIZIONE 2014/2015

www.drupaleurope.org No photos please image Responsible disclosure, cross-project

9/11/2018 JOINT ESTATE PLANS What makes the good plans good and the bad "plans" bad?

Problems Samples & Perspectives on Cyber-Physical Energy Networks ETH D-INFK Seminar @ Oct 31