Data-intensive Programming Lecture #3 Timo Aaltonen Department of - - PowerPoint PPT Presentation
Data-intensive Programming Lecture #3 Timo Aaltonen Department of - - PowerPoint PPT Presentation
Data-intensive Programming Lecture #3 Timo Aaltonen Department of Pervasive Computing Guest Lectures Ill try to organize two guest lectures Oct 14, Tapio Rautonen, Gofore LTd, Making sense out of your big data Oct 7, ??? Outline
Guest Lectures
- I’ll try to organize two guest lectures
- Oct 14, Tapio Rautonen, Gofore LTd, Making sense
- ut of your big data
- Oct 7, ???
Outline
- Course Work
- Apache Sqoop
- SQL Recap
- MapReduce Examples
– Inverted Index – Finding Friends – Computing Page Rank
- (Hadoop)
– Combiner – Other programming languages
Course Work
- MySportShop is a sports gear retailer. All the sales
happens online in their webstore. Examples of their products are different game jerseys and sport watches.
- The webstore has an Apache web server for the incoming
HTTP requests. The web server logs all traffic to a log file.
– Using these logs, one can study the browsing behavior of the users.
- The sales data of MySportShop is in PostrgreSQL, which is
a relational database. Among other things, the database has a table order_items containing data of all sales events of the shop.
Course Work: Questions
- Based on the data answer to the following questions
- 1. What are the top-10 best selling products in terms of
total sales?
- 2. What are the top-10 browsed products?
- 3. What anomaly is there between these two?
- 4. What are the most popular browsing hours?
Course Work
- Since the managers of the company don’t use Hadoop
but a RDBMS, all the data must be transferred to PostgreSQL
- In order to do that
– Transfer Apache logs (with Apache Flume) to the HDFS – Compute the frequencies of viewing of different products using MapReduce (Question 2) – Compute the viewing hour data with MapReduce (Q4) – Transfer the results (with Apache Sqoop) to PostgreSQL – Find answer to the questions in PostgreSQL using SQL (Q1-4)
Environment: three options
1. You can use your own computer by installing VirtualBox 5.x
– We offer you a virtual machine, which has been installed all required software and data – In the next weekly exercises assistants solve VirtualBox-related problems, if you encounter any
2. We offer you a virtual machine from TUT cloud
– All required software and data is installed – No graphical user interface – Guidance available in the weekly exercises
3. Own installation/cloud service can be used
– No help from the course personnel
Course Work
- The work is done in groups of three
– Enroll in Moodle: https://moodle2.tut.fi/course/view.php?id=9954 – opens today at 10 o’clock
- Deadline is Oct 14th
- Instructions for returning will be published later
– IntelliJ IDEA project
Course Work
- Material
– https://flume.apache.org/FlumeUserGuide.html – https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.ht ml – http://hadoop.apache.org/docs/r2.7.3/ – https://www.postgresql.org/docs/9.5/static/index.html
MapReduce
l Simple programming model l Map is stateless
- allows running map functions in parallel
l Also Reduce can be executed in parallel l The canonical example is the word count
Inverted Index
- Collating
– Problem: There is a set of items and some function of one item. It is required to save all items that have the same value of function into one file or perform some other computation that requires all such items to be processed as a group. The most typical example is building of inverted indexes. – Solution: Mapper computes a given function for each item and emits value of the function as a key and item itself as a value. Reducer obtains all items grouped by function value and process
- r save them. In case of inverted indexes, items are terms
(words) and function is a document ID where the term was found.
Simple Inverted Index
This doc contains text My doc contains my text Doc #1 Doc #2 this, 1 doc, 1 contains, 1 text, 1 my, 2 doc, 2 contains, 2 my, 2 text, 2 this: 1 doc: 1, 2 contains: 1, 2 text: 1, 2 my: 2 Reduced output
- Reduced output: word, list of docIDs
(Normal) Inverted Index
This doc contains text My doc contains my text Doc #1 Doc #2 this, (1,1) doc, (1,1) contains, (1,1) text, (1,1) my, (2,1) doc, (2,1) contains, (2,1) my, (2,1) text, (2,1) this: (1,1) doc: (1,1), (2,1) contains: (1,2), (2,1) text: (1, 1), (2,1) my: (2,2) Reduced output
- Reduced output: word(, list (docID, frequency)
Using Inverted Index: Searching
- Documents
– D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink.
- Index
– he: (1,2), (2,1) , (3,1), (4,1), (5,1) – ink: (3,1), (4,1), (5,1) – pink: (4,1), (5,1) – thing: (3, 1) – wink: (1,1), (5,1)
Using Inverted Index
- Indexing makes search engines fast
- Data is sparse since most word appear only in one
document
– (id, val) tuples – sorted by id – compact – very fast
- Linear merge
Index he: (1,2), (2,1) , (3,1), (4,1), (5,1) ink: (3,1), (4,1), (5,1) pink: (4,1), (5,1) thing: (3, 1) wink: (1,1), (5,1)
Linear Merge
- Find documents marching query {ink, wink}
– Load inverted lists for all query words – Linear merge O(n)
- n is the total number of items in the two lists
- f() is a scoring function: how well doc matches the query
(3,1) (4,1) (1,1) (5,1) (5,1) ink --> wink-->
Matching set:
1: f(0,1) 3: f(1,0) 4: f(1,0) 5: f(1,1)
Scoring Function
- Specify which docs are matched
– in: counts of query words in a doc – out: ranking score
- how well doc matches the query
- 0 if document does not match
– Example:
- Boolean AND: 𝑔 𝑅, 𝐸 = ∏
(1: 𝑜, > 0 0: 𝑜, = 0
- , ∈1
– 1 iff all query words are present
Phrases and Proximity
- Query “pink ink” as a phrase
- Using regular index:
– match #and(pink, ink) -> – scan match match documents for query string (slow)
- Idea: index all bi-grams as words
– can approximate “drink pink ink” – fast, but index size explodes – inflexible: can’t query #5(pink, ink)
- Construct proximity index
D4: The ink he likes to drink is pink. D5: He likes to wink and drink pink ink.
(5,1) pink_ink-> (5,1) drink_pink->
Proximity Index
- Embed position information to the inverted lists
– called positional/proximity index (prox-list) – handles arbitrary phrases, windows – key to “rich” indexing: structure, fields, tags, …
Proximity Index
This doc contains text My doc contains my text Doc #1 Doc #2 this, (1,1) doc, (1,2) contains, (1,3) text, (1,4) my, (2,1) doc, (2,3) contains, (2,3) my, (2,4) text, (2,5) this: (1,1) doc: (1,2), (2,1) contains: (1,3), (2,3) text: (1, 4), (2,5) my: (2,1),(2,4) Reduced output
- Reduced output: word, list of (docID, location)
Proximity Index
- Documents
– D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink.
- Index
– he: (1,1), (1,5) , (2,1), (3,3), (4,3), (5,1) – ink: (3,8), (4,2), (5,8) – pink: (4,8), (5,7) – thing: (3, 2) – wink: (1,4), (5,5)
Using Proximity Index
- Query: “pink ink”
- Linear Merge
– compare docIDs under pointer – if match – check pos(ink) - pos(pink) = 1 – near operator
(3,8) (4,2) (4,8) (5,8) (5,7) ink --> pink-->
Structure and Tags
- Documents are not always flat
– meta-data: title, author, date – structure: part, chapter, section, paragraph – tags: named entity, link, translation
- Options for dealing with structure
– create separate index for each field (like in SQL) – push structure into index values – construct extend index
Extent Index
- Special “term” for each element, field or tag
– spans a region of text
- words in the span belong to the field
– allows multiple overlapping spans – similar stand-off annotation formats
Extent Index
- Documents
– D1: He likes to wink, he likes to drink. – D2: He likes to drink, and drink, and drink. – D3: The thing he likes to drink is ink. – D4: The ink he likes to drink is pink. – D5: He likes to wink and drink pink ink.
- Index
– he: (1,1), (1,5) , (2,1), (3,3), (4,3), (5,1) – ink: (3,8), (4,2), (5,8) – pink: (4,8), (5,7) – thing: (3, 2) – wink: (1,4), (5,5) – link: (3, 1:2), (4, 1:2), (5, 7:8)
Using Extent Index
- Query: find an ink-related hyper-link
- Same approach as with proximity
– only now “tag” and “word” must have distance = 0 – Linear Merge, match when positions fall into extent – amenable to all optimizations
(3,8) (4,2) (3,1:2) (5,8) ink --> link-> (4,1:2) (5,7:8)
Overview on Inverted Indices
- Normal
- Positional
– phrases, near operator
- Extent
– metadata, structure
MR Example: Finding Friends
l http://stevekrenzel.com/finding-friends-with-
mapreduce
l Facebook could use MapReduce in the
following way
MR Example: Finding Friends
l Facebook has a list of friends
- the relation is bidirectional
l FB has lots of disk space and serve millions of
requests per day
l Certain results are pre-computed to reduce the
processing time of requests
- E.g. ”You and Joe have 230 mutual friends”
- The list of common friends is quite stable
- so recalculating would be wasteful
MR Example: Finding Friends
l Idea: MapReduce is used to calculate the
common friends daily and store results
- later only a quick lookup is needed
l Assume the friends are stored as l Person ⟶ [List of friends]
- A ⟶ [B, C, D]
- B ⟶ [A, C, D, E]
- C ⟶ [A, B, D, E]
- D ⟶ [A, B, C, E]
- E ⟶ [B, C, D]
MR Example: Finding Friends
l Each line is input for mapper l For every friend in the list of friends, the mapper
will emit a (key, value) pair, where
- key is
l (person, friend), if person < friend l (friend, person), otherwise
- value is the list of person’s friends
MR Example: Finding Friends
map(A , [B, C, D]): (A, B), [B, C, D] (A, C), [B, C, D] (A, D), [B, C, D] map(B, [A, C, D, E]): (A, B), [A, C, D, E] (B, C), [A, C, D, E] (B, D), [A, C, D, E] (B, E), [A, C, D, E] map(C, [A, B, D, E]): (A, C), [A, B, D, E] (B, C), [A, B, D, E] (C, D), [A, B, D, E] (C, E), [A, B, D, E] map(D, [A, B, C, E]): (A, D), [A, B, C, E] (B, D), [A, B, C, E] (C, D), [A, B, C, E] (D, E), [A, B, C, E] map(E, [B, C, D]): (B, E), [B, C, D] (C, E), [B, C, D] (D, E), [B, C, D]
MR Example: Finding Friends
- After shuffling inputs to the reducers:
(A, B), [[B, C, D], [A, C, D, E]] (A, C), [[B, C, D], [A, B, D, E]] (A, D), [[B, C, D], [A, B, C, E]] (B, C), [[A, C, D, E], [A, B, D, E]] (B, D), [[A, C, D, E], [A, B, C, E]] (B, E), [[A, C, D, E], [B, C, D]] (C, D), [[A, B, D, E], [A, B, C, E]] (C, E), [[A, B, D, E], [B, C, D]] (D, E), [[A, B, C, E], [B, C, D]]
MR Example: Finding Friends
l Each line is given to a reducer l Reducer computes an intersection of the sets
- and removes persons from the key pair
l For example (A, B), [[B, C, D], [A, C, D, E]] is
reduced to (A, B), [C, D]
(A, C), [B, D] (B, E), [C, D] (A, D), [B, C] (C, D), [A, B, E] (B, C), [A, D, E] (C, E), [B, D] (B, D), [A, C, E] (D, E), [B, C]
l Now, when D visit B the common friends are
found fast [A, C, E]
MR: Page Rank
- Google’s description
– relies on the “uniquely democratic” nature of the web – interprets a link from page A to page B as “a vote”
- aà B means A thinks B is worth something
– many links mean that B must be good – content-independent measure
- Use as a ranking feature, combined with content
– not all pages linking to B are equally important – a single link from Slashdot or CNN may be worth thousands
- Google PageRank
– how many “good” pages link to B
Page Rank: Random Surfer
- Analogy
– user starts browsing from random – pick a random out-going link
- repeat
– example: FàEàFàEàDà… – with probability 1- λ jump to a random page
- PageRank of page x
– probability of being on page x at a random moment – formally
Page Rank
- Initialize PR(x) = 1/N
- For every page: 𝑄𝑆 𝑦 = 567
8 + 𝜇 ∑ ;<(>) @AB(>) >→D
– y à x contributes part of its PR to x – spreads PR equally among out-links – PR scores should sum to 100%
- use two arrays PRt à PRt+1
- Iteration #1:
– PR(B) = 0.18 * 9.1 + 0.82 * [PR(C) + 1/3 * PR(E) + ½ *PR(F) + ½ * PR(G) + ½ * PR(I)] = 31 – PR(C) = 0.18 * 9.1 + 0.82 * 9.1 = 9.1
- Iteration #2:
– ... – PR(C) = 0.18 * 9.1 + 0.82 * PR(B) = 26
Page Rank
- Algorithm converges
- Observations:
– pages with no inlinks: PR = (1- λ) * 1/N = 0.16 – same inlinks è same PR – one inlink from high PR >> many from low PR
PageRank with MapReduce
map(y, {x1, x2, …, xn}) for j=1..n emit(xj, (
;<(>) @AB(>))
reduce(x, {
;<(>5) @AB(>5) , …, ;<(>E) @AB(>E) })
𝑄𝑆 𝑦 =
567 8 + 𝜇 ∑ ;<(>) @AB(>) >→D
for j = 1 .. n emit(xj,
;<(D) @AB(D))
- Result goes recursively to another reducer
- Still sink nodes should be considered
PageRank with MapReduce
map(y, {x1, x2, …, xn}) for j=1..n emit(xj, (
;<(>) @AB(>))
emit(y, {x1, …, xn} ) reduce(x, {
;<(>5) @AB(>5) , …, ;<(>E) @AB(>E) }, {x1, …, xn} )
𝑄𝑆 𝑦 =
567 8 + 𝜇 ∑ ;<(>) @AB(>) >→D
for j = 1 .. n emit(xj,
;<(D) @AB(D))
emit(x, {x1, …, xn} )
- Result goes recursively to another reducer
- Still sink nodes should be considered
Combiners
A A 1 1 A 1 B 1 Map node 1 A Reduce node for key A A 111 A 3 Combiner
Combiners
- Combiner can ”compress” data on a mapper node
before sending it forward
- Combiner input/output types must equal the
mapper output types
- In Hadoop Java, Combiners use the Reducer interface
job.setCombinerClass(MyReducer.class);
Reducer as a Combiner
- Reducer can be used as a Combiner if it is commutative
and associative
– Eg. max is
- max(1, 2, max(3,4,5)) =
max(max(2, 4), max(1, 5, 3))
- true for any order of function applications…
– Eg. avg is not
- avg(1, 2, avg(3, 4, 5)) = 2.33333 ≠
avg(avg(2, 4), avg(1, 5, 3)) = 3
- Note: if Reducer is not c&a, Combiners can still be used
– The Combiner just has to be different from the Reducer and designed for the specific case
Adding a Combiner to WordCount
walk, 1 run, 1 walk, 1 run, 1 walk, 2 Map Combiner Shuffle
Hadoop Streaming
- Map and Reduce functions can be implemented in
any language with the Hadoop Streaming API
- Input is read from standard input
- Output is written to standard output
- Input/output items are lines of the form
key\tvalue
– \t is the tabulator character
- Reducer input lines are grouped by key
– One reducer instance may receive multiple keys
Run Hadoop Streaming
- Debug using Unix pipes:
cat sample.txt | ./mapper.py | sort | ./reducer.py
- On Hadoop:
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
- input sample.txt \
- output output \
- mapper ./mapper.py \
- reducer ./reducer.py