IR: Information Retrieval
FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá
Department of Computer Science, UPC
Fall 2018 http://www.cs.upc.edu/~ir-miri
1 / 66
IR: Information Retrieval FIB, Master in Innovation and Research in - - PowerPoint PPT Presentation
IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, Jos Luis Balczar, Ramon Ferrer-i-Cancho, Ricard Gavald Department of Computer Science, UPC Fall 2018 http://www.cs.upc.edu/~ir-miri 1
1 / 66
3 / 66
4 / 66
5 / 66
6 / 66
L.A. Barroso, J. Dean, U. Hölzle: “Web Search for a Planet: The Google Cluster Architecture”, 2003 7 / 66
8 / 66
9 / 66
10 / 66
11 / 66
◮ Proprietary implementation ◮ Implements old ideas from functional programming,
◮ HDFS: Open Source Hadoop Distributed File
◮ Pig: Yahoo! Script-like language for data
◮ Hive: Facebook SQL-like language /
◮ . . . 12 / 66
◮ 1000’s of machines, 10,000’s disks ◮ Abstract hardware & distribution (compare MPI: explicit
◮ Easy to use: good learning curve for programmers
◮ Commodity machines: cheap, but unreliable ◮ Commodity network ◮ Automatic fault-tolerance and tuning. Fewer administrators 13 / 66
◮ Serialized for network transfer and system & language
14 / 66
15 / 66
16 / 66
17 / 66
18 / 66
19 / 66
20 / 66
21 / 66
22 / 66
23 / 66
24 / 66
25 / 66
26 / 66
27 / 66
28 / 66
29 / 66
30 / 66
31 / 66
◮ Data storage and retrieval components (e.g. HDFS in
32 / 66
33 / 66
34 / 66
35 / 66
◮ what to look for in the data? ◮ what questions to ask? ◮ how to model the data? ◮ where to start? 36 / 66
◮ LAMP = Linux + Apache HTTP server + MySQL + PHP
37 / 66
38 / 66
39 / 66
In other words: In a system made up of nonreliable nodes and network, it is impossible to implement atomic reads & writes and ensure that every request has an answer. 40 / 66
41 / 66
42 / 66
43 / 66
44 / 66
45 / 66
46 / 66
47 / 66
48 / 66
49 / 66
50 / 66
51 / 66
52 / 66
53 / 66
54 / 66
55 / 66
56 / 66
57 / 66
58 / 66
59 / 66
[source: https://www.tutorialspoint.com/apache_spark/apache_spark_pdf_version.htm] 60 / 66
[source: https://www.tutorialspoint.com/apache_spark/apache_spark_pdf_version.htm] 61 / 66
[source: https://www.tutorialspoint.com/apache_spark/apache_spark_pdf_version.htm] 62 / 66
[source: https://www.tutorialspoint.com/apache_spark/apache_spark_pdf_version.htm] 63 / 66
[source: https://spark.apache.org/docs/latest/cluster-overview.html] 64 / 66
◮ Dataset partitioned among worker nodes ◮ Can be created from HDFS files
◮ Specifies data transformations ◮ Data moves from one state to another
65 / 66