CS 744: SCOPE
Shivaram Venkataraman Fall 2020
Hello!
CS 744: SCOPE Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation
Hello ! CS 744: SCOPE Shivaram Venkataraman Fall 2020 ADMINISTRIVIA Thursday - Assignment grades this week Single PDF file next - Midterm details on Piazza - Course Project Proposal Submission a convert I ppf photo Hot
CS 744: SCOPE
Shivaram Venkataraman Fall 2020
Hello!
ADMINISTRIVIA
→
next
Thursday
→
Single PDF file
I
photo
a convertppf
↳
Hot CRP
↳ Peer
reviewAnonymous
→ppf upload
don't include your
namesOnly
include
them inHotCRP
itself
Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications
Pytoiphipep
rear
upper
←I ✓
→ MapReduce
Ray
spark
SQL: STRUCTURED QUERY LANGUAGE
I
language
toquery
adatabase
DATABASE SYSTEMS
Sou
÷:
OLAP
↳
t
Transaction
processing
Airline
.PROCEDURAL VS. RELATIONAL
SELECT COUNT(*) FROM “users” WHERE age < 21 lines = sc.textFile(“users") csv = lines.map(x => x.split(‘,’)) young = csv.filter(x => x(1) < 21) println(young.count())
artie
schema
^tendered
\
data great:b:! !
Esv
. .
Ekin:&
:c.
÷
easy
ftp.ograrre
"
SCOPE
SELECT query, COUNT(*) AS count FROM "search.log" USING LogExtractor GROUP BY query HAVING count > 1000 ORDER BY count DESC;
→ Microsoft→
←
to
hang
÷.
Motl
SCOPE OPERATORS
Input reading: What is different? EXTRACT column[:<type>] [, ...] FROM <input_stream(s) > USING <Extractor> [(args)] [HAVING <predicate>]
powiat
x RDD
A
① asthma
information
?
us
.so
. text Fileclass
② pluggable
function
csr Extractor
" pwndoiirb& furring
↳:p.com?l
.SQL OPERATORS
Select – read rows that satisfy some predicate Join – Equijoin with support for Inner and Outer join GroupBy – Group by some column OrderBy – Sorting the output Aggregations – COUNT, SUM, MAX etc.
Yay
! these
→
A large
→
muser
analytics
LANGUAGE INTEGRATION
R1 = SELECT A+C AS ac, B.Trim() AS B1 FROM R WHERE StringOccurs(C, “xyz”) > 2 #CS public static int StringOccurs(string str, string ptrn){ int cnt=0; int pos=-1; while (pos+1 < str.Length) { pos = str.IndexOf(ptrn, pos+1); if (pos < 0) break; cnt++; } return cnt; } #ENDCS C#
"
Trim
from
C#
stdtib
I
→ inline↳
Custom
C#function
C
#
compiler
User
functions
uDFs
MAPREDUCE-LIKE?
Process Reduce Combine
COMBINE S1 WITH S2 ON S1.A==S2.A AND S1.B==S2.B AND S1.C==S2.C USING MultiSetDifference PRODUCE A, B, C
→
map
← likeinotnutpa
Rpf!Yet)
→
reduce huoperator→ongroy#
→Rxwsety
pparciismediw;
I← equi
←
#← www.F#ihon
\,
produce
many
columns
Wk if
combine
canbe
runmultiple
times
Sl comb 52 152 gaff
EXECUTION: COMPILER
SELECT query, COUNT() AS count FROM "search.log" USING LogExtractor GROUP BY query HAVING count > 1000 ORDER BY count DESC; Check syntax, resolve names Checks if columns have been defined Result: Internal parse tree
←
←
2
.=
↳ smiter
.
compiler
seamy J
OPTIMIZER
Rewrite the query expression à lowest cost Examples: Removing unnecessary columns Pushing down selection predicates Pre-aggregating Also need to reason about partitioning (See VLDBJ paper)
w:*:*postman
:
chunk
every optimizer
. .→
itqie.gr?z
Quite
a >←
query
columns
query't
query)
↳ combiner similar
↳ filtering
before
add'ty
quemgrouping
I
71000 C
.L
7
. I:>RUNTIME OPTIMIZATIONS
Hierarchical aggregation Locality-sensitive task placement Grouping heuristics?
Mmm
;÷g: dnt "
not
all
have
somebw
⇒ Aff
within
a rack→
links
agg Idiom
racks
→
similar
to
spark IMR
they
also
do
|
this,fas;
intermediate
[
*
vague
in
the
paper
↳ Default
C # codeautomatically
set
m
partitions)
*
after
group BT l )
↳ binary
SUMMARY, TAKEAWAYS
Relational API
Scope Execution
Precursor to systems like SparkSQL
→
Schema .
↳
UDFS IDISCUSSION
https://forms.gle/hL8VJ6uSG7Lzm164A
Consider you have a column-oriented data layout on your storage system (Example below). What are some reasons that a SCOPE query might be faster than running equivalent MR program?
http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html
£
Apache
parquet
qs
notion of
Robin
Extractor
wk
,forage
→
EITI
5
b7
8
g I 9D
↳
→ Pre -filtering
→ Ogletree
via
is easier
→ querytouches
single
column
in
the
this
isMN
extractor
this
asefficient
well
Does SCOPE-like Optimizer help ML workloads? Consider the code in your
Colum
filtering
⇒feature
extraction !
Joins in MLworkloads
rare ?,
µ,,µ,
⇒ a.adieu
,
yn!
No
details
about
caching intermediate
Hash →
→" ?
dgjfjkn.gg
Dort merge
join
NEXT STEPS
Next class: Elastic Data Warehousing with SnowFlake Project proposals due tomorrow! See Piazza! Midterm coming up!
"