Data-intensive Programming Lecture #3 Timo Aaltonen Department of - PowerPoint PPT Presentation

Data-intensive Programming Lecture #3 Timo Aaltonen Department of Pervasive Computing

Guest Lectures • I’ll try to organize two guest lectures • Oct 14, Tapio Rautonen, Gofore LTd, Making sense out of your big data • Oct 7, ???

Outline • Course Work • Apache Sqoop • SQL Recap • MapReduce Examples – Inverted Index – Finding Friends – Computing Page Rank • (Hadoop) – Combiner – Other programming languages

Course Work • MySportShop is a sports gear retailer. All the sales happens online in their webstore. Examples of their products are different game jerseys and sport watches. • The webstore has an Apache web server for the incoming HTTP requests. The web server logs all traffic to a log file. – Using these logs, one can study the browsing behavior of the users. • The sales data of MySportShop is in PostrgreSQL, which is a relational database. Among other things, the database has a table order_items containing data of all sales events of the shop.

Course Work: Questions • Based on the data answer to the following questions 1. What are the top-10 best selling products in terms of total sales? 2. What are the top-10 browsed products? 3. What anomaly is there between these two? 4. What are the most popular browsing hours?

Course Work • Since the managers of the company don’t use Hadoop but a RDBMS, all the data must be transferred to PostgreSQL • In order to do that – Transfer Apache logs (with Apache Flume) to the HDFS – Compute the frequencies of viewing of different products using MapReduce (Question 2) – Compute the viewing hour data with MapReduce (Q4) – Transfer the results (with Apache Sqoop) to PostgreSQL – Find answer to the questions in PostgreSQL using SQL (Q1-4)

Environment: three options 1. You can use your own computer by installing VirtualBox 5.x – We offer you a virtual machine, which has been installed all required software and data – In the next weekly exercises assistants solve VirtualBox-related problems, if you encounter any 2. We offer you a virtual machine from TUT cloud – All required software and data is installed – No graphical user interface – Guidance available in the weekly exercises 3. Own installation/cloud service can be used – No help from the course personnel

Course Work • The work is done in groups of three – Enroll in Moodle: https://moodle2.tut.fi/course/view.php?id=9954 – opens today at 10 o’clock • Deadline is Oct 14 th • Instructions for returning will be published later – IntelliJ IDEA project

Course Work • Material – https://flume.apache.org/FlumeUserGuide.html – https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.ht ml – http://hadoop.apache.org/docs/r2.7.3/ – https://www.postgresql.org/docs/9.5/static/index.html

MapReduce l Simple programming model l Map is stateless - allows running map functions in parallel l Also Reduce can be executed in parallel l The canonical example is the word count

Inverted Index • Collating – Problem : There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes. – Solution : Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process or save them. In case of inverted indexes, items are terms (words) and function is a document ID where the term was found.

Simple Inverted Index • Reduced output: word, list of docIDs Doc #1 this, 1 Reduced output doc, 1 This doc contains, 1 contains text text, 1 this: 1 doc: 1, 2 Doc #2 contains: 1, 2 my, 2 text: 1, 2 doc, 2 My doc my: 2 contains, 2 contains my my, 2 text text, 2

(Normal) Inverted Index • Reduced output: word(, list (docID, frequency) Doc #1 this, (1,1) Reduced output doc, (1,1) This doc contains, (1,1) contains text text, (1,1) this: (1,1) doc: (1,1), (2,1) Doc #2 contains: (1,2), (2,1) my, (2,1) text: (1, 1), (2,1) doc, (2,1) My doc my: (2,2) contains, (2,1) contains my my, (2,1) text text, (2,1)

Using Inverted Index: Searching • Documents – D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink. • Index – he: (1,2), (2,1) , (3,1), (4,1), (5,1) – ink: (3,1), (4,1), (5,1) – pink: (4,1), (5,1) – thing: (3, 1) – wink: (1,1), (5,1)

Using Inverted Index • Indexing makes search engines fast • Data is sparse since most word appear only in one document – (id, val) tuples – sorted by id Index – compact he: (1,2), (2,1) , (3,1), (4,1), (5,1) – very fast ink: (3,1), (4,1), (5,1) • Linear merge pink: (4,1), (5,1) thing: (3, 1) wink: (1,1), (5,1)

Linear Merge • Find documents marching query {ink, wink} – Load inverted lists for all query words – Linear merge O(n) • n is the total number of items in the two lists • f() is a scoring function: how well doc matches the query ink --> (3,1) (4,1) (5,1) wink--> (1,1) (5,1) Matching set: 5: f(1,1) 1: f(0,1) 3: f(1,0) 4: f(1,0)

� Scoring Function • Specify which docs are matched – in: counts of query words in a doc – out: ranking score • how well doc matches the query • 0 if document does not match – Example: (1: 𝑜 , > 0 • Boolean AND: 𝑔 𝑅, 𝐸 = ∏ , ∈1 0: 𝑜 , = 0 – 1 iff all query words are present

Phrases and Proximity • Query “pink ink” as a phrase D4: The ink he likes to drink is pink. • Using regular index: D5: He likes to wink and drink pink ink . – match #and(pink, ink) -> – scan match match documents for query string (slow) • Idea: index all bi-grams as words – can approximate “drink pink ink” pink_ink-> (5,1) – fast, but index size explodes drink_pink-> (5,1) – inflexible: can’t query #5(pink, ink) • Construct proximity index

Proximity Index • Embed position information to the inverted lists – called positional/proximity index (prox-list) – handles arbitrary phrases, windows – key to “rich” indexing: structure, fields, tags, …

Proximity Index • Reduced output: word, list of (docID, location) Doc #1 this, (1,1) Reduced output doc, (1,2) This doc contains, (1,3) contains text text, (1,4) this: (1,1) doc: (1,2), (2,1) Doc #2 contains: (1,3), (2,3) my, (2,1) text: (1, 4), (2,5) doc, (2,3) My doc my: (2,1),(2,4) contains, (2,3) contains my my, (2,4) text text, (2,5)

Proximity Index • Documents – D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink. • Index – he: (1,1), (1,5) , (2,1), (3,3), (4,3), (5,1) – ink: (3,8), (4,2), (5,8) – pink: (4,8), (5,7) – thing: (3, 2) – wink: (1,4), (5,5)

Using Proximity Index • Query: “pink ink” • Linear Merge – compare docIDs under pointer – if match – check pos(ink) - pos(pink) = 1 – near operator ink --> (3,8) (4,2) (5,8) pink--> (4,8) (5,7)

Structure and Tags • Documents are not always flat – meta-data: title, author, date – structure: part, chapter, section, paragraph – tags: named entity, link, translation • Options for dealing with structure – create separate index for each field (like in SQL) – push structure into index values – construct extend index

Extent Index • Special “term” for each element, field or tag – spans a region of text • words in the span belong to the field – allows multiple overlapping spans – similar stand-off annotation formats

Extent Index • Documents – D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink. • Index – he: (1,1), (1,5) , (2,1), (3,3), (4,3), (5,1) – ink: (3,8), (4,2), (5,8) – pink: (4,8), (5,7) – thing: (3, 2) – wink: (1,4), (5,5) – link: (3, 1:2), (4, 1:2), (5, 7:8)

Using Extent Index • Query: find an ink-related hyper-link • Same approach as with proximity – only now “tag” and “word” must have distance = 0 – Linear Merge, match when positions fall into extent – amenable to all optimizations ink --> (3,8) (4,2) (5,8) link-> (3,1:2) (4,1:2) (5,7:8)

Overview on Inverted Indices • Normal • Positional – phrases, near operator • Extent – metadata, structure

MR Example: Finding Friends l http://stevekrenzel.com/finding-friends-with- mapreduce l Facebook could use MapReduce in the following way

MR Example: Finding Friends l Facebook has a list of friends - the relation is bidirectional l FB has lots of disk space and serve millions of requests per day l Certain results are pre-computed to reduce the processing time of requests - E.g. ”You and Joe have 230 mutual friends” - The list of common friends is quite stable - so recalculating would be wasteful

Data-intensive Programming Lecture #3 Timo Aaltonen Department of - PowerPoint PPT Presentation

Data-intensive Programming Lecture #3 Timo Aaltonen Department of Pervasive Computing Guest Lectures Ill try to organize two guest lectures Oct 14, Tapio Rautonen, Gofore LTd, Making sense out of your big data Oct 7, ??? Outline

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

and Observational Science The Convergence of Data-Intensive and Compute-Intensive Infrastructure

Turning Data Into Business Value Qwertee 101: Finding Your Next Data Partner Data-Intensive

Data-Intensive Applications on Numerically-Intensive Supercomputers David Daniel / James Ahrens

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Data-Intensive Research in Education: NSF Initiatives in Big Data and Data Science Chris

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

MANAGEMENT OF AN INTENSIVE CARE UNIT Dr. I l Kse Tepecik Training and Research Hospital

CHANGE IN RESIDENTIAL STATUS INTENSIVE STUDY COURSE ON FEMA INTENSIVE STUDY COURSE ON FEMA

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

61A Lecture 26 Announcements Programming Languages Programming Languages 4 Programming

21cm Intensity Mapping with MeerKAT and SKA (autocorrelations) Prina Patel with Mario Santos

Noise model simulation for the DUNE FD DAQ Miquel Nebot, Babak Abi, Justo Martin-Albo 1

information retrieval find documents find documents in response to user query find

Drive , Daniel Pink Motivation comes from three sources: 1. Purpose 2. Agency (control) 3.

the-pi-project.com Kathleen Jalalpour and Corrinne Lieu BLOG : Follow a year in 5 th grade, week

Uns Unsup uper ervis vised ed PC PCFG FG Ind nduc ucti tion on for r Grounde ounded d

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 2 0 : A

Adversarial Examples and Adversarial Training Ian Goodfellow, Sta ff Research Scientist, Google