TheTransi*ontoHandhelds Powerwall: Previous genera*onsreusedso4ware - - PowerPoint PPT Presentation

the transi on to handhelds
SMART_READER_LITE
LIVE PREVIEW

TheTransi*ontoHandhelds Powerwall: Previous genera*onsreusedso4ware - - PowerPoint PPT Presentation

ParallelizingtheWebBrowser ChrisJones,RoseLiu,LeoMeyerovich KrsteAsanovic,andRastislavBodik ParLab UCBerkeley TheTransi*ontoHandhelds Powerwall: Previous


slide-1
SLIDE 1

Parallelizing
the
Web
Browser


Chris
Jones,
Rose
Liu,
Leo
Meyerovich
 Krste
Asanovic,
and
Rastislav
Bodik
 ParLab
 UC
Berkeley


slide-2
SLIDE 2

The
Transi*on
to
Handhelds


Handset
 Mini


Time


Mainframe
 PC


Log
price

WS
 Laptop
 Ubiquitous


2 


Power
wall:
Previous
 genera*ons
reused
so4ware


  • f
their
ancestors.

Mobiles


will
need
parallel
so4ware.

 Soon
on
mobile:
4‐cores
x
2‐threads
x
8‐SIMD
=
64‐way
parallelism


slide-3
SLIDE 3

Why
Parallelize
a
Browser?


  • Dominant
applica1on
pla2orm


– easy
deployment:

apps
downloaded,
JS
portable
 – produc*ve
programming:
scrip*ng,
layout


  • …
but
not
on
handhelds


– na*ve
frameworks
for:
iPhone,
Google
Android
 – slow:


for
Slashdot,
Laptop:
3s
=>

iPhone:
21s


  • Parallel
browser
may
need
new
architecture


– ex:
JavaScript
relies
on
“gotos”,
is
too
serial


slide-4
SLIDE 4

Frontend
 Layout
 Scrip1ng


Anatomy
of
a
Browser


web
servers
 layout
 render
 decompress
 lex
 parse
+
build
DOM
 plugin
 (decode
 image,
…)
 page?
 script


slide-5
SLIDE 5

Project
Status


  • 1. Developed
work‐efficient
algorithms


work‐efficient
:
no
more
work
than
sequen*al
algo.
 – layout:
parallel‐map
with
a
*ling
op*miza*on
 – layout:
break
up
tree
traversal
into
five
parallel
ones
 – lexing:
specula*on
to
break
sequen*al
dependencies


  • 2. Reexamining
the
scrip1ng
programming
model


– programmer
produc4vity:
from
callbacks
to
actors
 – performance:
adding
structure
to
detect
dependences


slide-6
SLIDE 6

Frontend:
Lexing


web
servers
 layout
 render
 decompress
 lex
 parse
+
build
DOM
 plugin
 (decode
 image,
…)
 page?
 script


slide-7
SLIDE 7

Lexing,
from
10,000
feet


Goal:
given
lexical
spec
and
input,
find
lexemes


STag ::= <[^>]*> Content ::= [^<]+ ETag ::= </[^>]*> <
 b
 >
 B
 e
 r
 k
 e
 l
 e
 y
 !
 <
 /
 b
 >
 STag


Σ
–
{‘<‘}
 ‘/’
 Σ
–
{‘<‘}
 Σ
–
{‘>‘}
 ‘<’
 Σ
–
{‘/‘}
 Σ
–
{‘>‘}


(label
each
character

with
its
state)


slide-8
SLIDE 8

Inherently
Sequen*al?


STag ::= <[^>]*> Content ::= [^<]+ ETag ::= </[^>]*> <
 b
 >
 B
 e
 r
 k
 e
 l
 e
 y
 !
 <
 /
 b
 >
 STag


Σ
–
{‘<‘}
 ‘/’
 Σ
–
{‘<‘}
 Σ
–
{‘>‘}
 ‘<’
 Σ
–
{‘/‘}
 Σ
–
{‘>‘}


?


Processor
1
 Processor
2
 …


slide-9
SLIDE 9

An
observa*on


In
lexing,
irrespec*ve
of
where
DFA
starts,
it
 converges
to
a
stable,
recurring
state


<
 b
 >
 B
 e
 r
 k
 e
 l
 e
 y
 !
 <
 /
 b
 >


start
state
 “in
ETag”

 “in
Content”

 Lexing:


Parallel
scans
thus
need
not
scan
from
all
possible
states,
 just
one,
yielding
a
work‐efficient
algorithm.


9 


slide-10
SLIDE 10

Our
solu*on
(1/2):
Par**on



  • split
input
into
blocks
with
k‐character
overlap

  • scan
in
parallel;
start
block
from
a
tolerant
state


…
 …
 …
 …
 …
 …
 …
 …
 …
 …


Processor
1
 Processor
2


k

slide-11
SLIDE 11

Our
solu*on
(2/2):
Speculate


  • split
input
into
blocks
with
k‐character
overlap

  • scan
in
parallel;
start
block
from
a
tolerant
state

  • check
if
blocks
converge:
expected
in
k‐overlap

  • specula*on
may
fail;
if
so,
block
is
rescanned


…
 …
 …
 …
 …
 …
 …
 …
 …
 …
 …


slide-12
SLIDE 12

Speedup:
Flex
vs
Cell


today’s
page
 sizes:
5
cores
 are
4.5x
faster
 than
flex


baseline:
(sequen*al)
flex
on
the
CELL
main
CPU


slide-13
SLIDE 13

Layout
Solving
(1/2)


web
servers
 layout
 render
 decompress
 lex
 parse
+
build
DOM
 plugin
 (decode
 image,
…)
 page?
 script


slide-14
SLIDE 14

Goal:
Match
rules
with
nodes:


– a
rule:
p
img
{
fontsize:
7px}
 – match
tag
path
 – path‐rule
matching


  • end
with
the
same
node

  • and
are
a
substring


Rule
Matching


<body>
 <p>
 <p>
 <img>
 hello
 <b>
 ok
 ok
 ok
 ok
 world


  • k


selectors
 p
 img
 p
img
 proper1es
 height=83%
 width=100px
 float=le4
 fontsize=7px


slide-15
SLIDE 15
  • 1000s
nodes,
1000s
rules

  • Assign
nodes
to
cores


Paralleliza*on


<body>
 <p>
 <p>
 <img>
 hello
 <b>
 ok
 ok
 ok
 ok
 world


  • k


selectors
 p
 img
 p
img
 proper1es
 height=83%
 width=100px
 float=le4
 fontsize=7px


slide-16
SLIDE 16

Tiling
for
Caches


selectors
 p
 img
 p
img
 proper1es
 height=83%
 width=100px
 float=le4
 fontsize=7px


<body>
 <p>
 <p>
 <img>
 hello
 <b>
 ok
 ok
 ok
 ok
 world


  • k


Problem:
all
the
nodes
+
selectors
might
not
fit
in
cache!



slide-17
SLIDE 17

Speedup
(Cilk++)


0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
 5
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16


Speedup
 Cores
 Speedup
vs.
Fastest
Sequen1al
 (Slashdot)
 2
socket
x
4
core
x
2
thread
(2.6
Ghz,
12x
1
GB)


Redundancy
opt.
+
*ling(Cilk)
 Naïve
+
*ling
(Cilk)
 Redundancy
opt.
+
*ling(seq.)
 Naïve
(Cilk)
 Naïve
(seq)


slide-18
SLIDE 18

Layout
Solving
(2/2)


web
servers
 layout
 render
 decompress
 lex
 parse
+
build
DOM
 plugin
 (decode
 image,
…)
 page?
 script


slide-19
SLIDE 19

Problem:
Layout
a
Page


<body>
 <p>
 <p>
 <img>
 hello
 <b>


  • k
ok
ok
ok


world


  • k


w=100,
fs=12
 w=50,
float=le4
 w=100,
fs=12
 x=0,
y=0
 w=100,
fs=6
 x=0,
y=0
 w=40,
fs=6
 x=0,
y=0
 h=10
 h=10
 w=100,
fs=12
 x=0,
y=10
 w=50
 x=0,
y=10
 h=20
 w=30,
fs=12
 x=50,
y=10
 h=10
 h=10
 w=100,
fs=12
 x=0,
y=10
 h=40
 h=40
 fs=50%


fs,
Δ,
w
 fs,
Δ,
w
 Δ
 fs,Δ,w
 Δ
 Δ
 fs,
Δ,
w
 fs,
Δ,
w
 fs,
Δ,
w


slide-20
SLIDE 20

It
looks
rather
sequen*al..


<body>
 <p>
 <p>
 <img>
 hello
 <b>


  • k
ok
ok
ok


world


  • k


w=200,
fs=12
 w=50,
float=le4
 w=100,
fs=12
 x=0,
y=0
 w=100,
fs=6
 x=0,
y=0
 w=40,
fs=6
 x=0,
y=0
 h=10
 h=10
 w=100,
fs=12
 x=0,
y=10
 w=50
 x=0,
y=10
 h=20
 w=30,
fs=12
 x=50,
y=10
 h=10
 h=10
 w=100,
fs=12
 x=0,
y=10
 h=40
 h=40
 fs=50%


fs,
Δ,
w
 fs,
Δ,
w
 Δ
 fs,
Δ,w
 Δ
 Δ
 fs,
Δ,
w
 fs,
Δ,
w
 fs,
Δ,
w


w=40,
fs=6
 x=0,
y=0
 h=10
 w=100,
fs=12
 x=0,
y=10


Δ
 Δ
 fs,
Δ,
w


slide-21
SLIDE 21

But
not
en*rely


<body>
 <p>
 <p>
 <img>
 hello
 <b>


  • k
ok
ok
ok


world


  • k


w=200,
fs=12
 w=50,
float=le4
 w=100,
fs=12
 x=0,
y=0
 w=100,
fs=6
 x=0,
y=0
 w=40,
fs=6
 x=0,
y=0
 h=10
 h=10
 w=100,
fs=12
 x=0,
y=10
 w=50
 x=0,
y=10
 h=20
 w=30,
fs=12
 x=50,
y=10
 h=10
 h=10
 w=100,
fs=12
 x=0,
y=10
 h=40
 h=40
 fs=50%


fs,
Δ,
w
 fs,
Δ,
w
 Δ
 fs,
Δ,w
 Δ
 Δ
 fs,
Δ,
w
 fs,
Δ,
w
 fs,
Δ,
w


w=30,
fs=12
 x=50,
y=10
 h=10


fs,
Δ,
w
 fs,
Δ,
w
 fs,
Δ,
w


slide-22
SLIDE 22

5
Phases:
Each
Exhibits
Tree
Parallelism


<body>
 <p>
 <p>
 <img>
 hello
 <b>


  • k
ok
ok
ok


world


  • k


w=100,
fs=12
 float
=
le4
 fs=6
 fs=12
 fs=12
 fs=6
 fs=12
 fs=12
 fs=12
 wp=40
 wm=40
 wp=50
 wm=50
 wp=30,
wm=30
 wp=10
 wm=10
 wp=80
 wm=30
 wp=80,
wm=40
 wp=30
 wm=30
 wp=40
 wm=40
 fs=12


<body>
 <p>
 <p>
 <img>
 hello
 <b>


  • k
ok
ok
ok


world


  • k


Phase
1:
font
size,
temporary
width
 Phase
2:
preferred
max
&
min
width
 Phase
3:
solved
width
 Phase
4:
height,
rela1ve
x/y
posi1on


fs=50%


Phase
5:
absolute
x/y
posi1on


slide-23
SLIDE 23

Results:
layout
(modeled)


0
 1
 2
 3
 4
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 Average
Speedup
 #
Hardware
Threads


Modeled
Speedup
w/Cilk++


Eight
socket
x
4
core
AMD
Opteron
2356
Barcelona
Sun
X4600
 Dual
socket
x
4
core
AMD
Opteron
2356
Barcelona
Sun
X2200
 Preproduc*on
2
socket
x
4
core
x
2
thread
Intel
Xeon
Nehalem


Baseline:
Cilk++
model
on
1
core.



slide-24
SLIDE 24

Scrip*ng


web
servers
 layout
 render
 decompress
 lex
 parse
+
build
DOM
 plugin
 (decode
 image,
…)
 page?
 script


slide-25
SLIDE 25

Why
parallelize
scrip*ng
(example)


Example:
animate
between
different
views


  • each
transi*on:
recolor,
resize
each
state
or
county

  • anima*on
rate
30fps
=>
33ms
for
1000s
of
nodes

slide-26
SLIDE 26

The
browser
programming
model



  • Nonpreemp*ve
event
model

  • Handlers
respond
to
events

  • Handlers
execute
atomically


– document
changes
cause
relayout
 – style
changes
cause
relayout


  • To
parallelize,
must
understand


how
the
document
is
shared


– document‐carried
dependencies:
 handler
A:
california.x
=100;
 handler
B:
var
z
=
california.x;
 – layout‐carried
dependencies:
 handler
A:
america.w
=
200%;
 










layout:

california.w
=
200%;
 handler
B:
var
z
=
california.w;


script
B
 (Menlo)
 layout
 render
 layout
 render
 script
A
 (Alameda)
 …


slide-27
SLIDE 27

Concurrency
bugs


  • 1. GUI
anima*ons
and
interac*ons


– several
anima*ons
modifying
an
object
simultaneously


  • 2. Server
interac*ons


– responses
to
requests
may
be
delayed,
reordered


  • 3. Eager
script
loading


– execu*ng
a
script
on
a
document
before
done
loading


slide-28
SLIDE 28

“Gotos”
in
JavaScript


<div id="box" style="position:absolute; background: yellow;”> My box </div> <script> document.addEventListener ( 'mousemove', function (e) { var left = e.pageX; var top = e.pageY; setTimeout(function() { document.getElementById("box").style.top = top; document.getElementById("box").style.left = left; } , 500); }, false); </script>

slide-29
SLIDE 29

Preliminary
design
of
our
language


body.column.div 


delay

500 
 delay

500 
 mouse 
 top
 left
 top
 left


Program
structure
is
clearer
when
data
and
control
is
explicit


  • in
dataflow
version:
changing
mouse
coordinates
are
streams

  • coordinate
streams
adjust
box
position
after
they
are
delayed

  • structured
names
of
document
element
allow
analysis

slide-30
SLIDE 30

Summary


  • 1. Developed
work‐efficient
algorithms


– Rule
matching:
parallel‐map
with
a
*ling
op*miza*on
 – Layout:
break
up
tree
traversal
into
five
parallel
ones
 – Lexing:
specula*on
to
break
sequen*al
dependencies


  • 2. Reexamining
the
scrip1ng
programming
model


– programmer
produc4vity:
from
callbacks
to
actors


  • influenced
by
Flapjax,
Ptolemy,
Max/MSP,
LabVIEW


– performance:
adding
structure
to
detect
dependences


  • current
browsers:
JIT
compila*on,
font
vectoriza*on,
task


parallelism
eg
for
image
rendering
–
all
these
are
useful,
too.