P a r s i n g J S O N R e a l l y Q u i c k l y : L e s s o n s L e a - - PowerPoint PPT Presentation

p a r s i n g j s o n r e a l l y q u i c k l y l e s s o
SMART_READER_LITE
LIVE PREVIEW

P a r s i n g J S O N R e a l l y Q u i c k l y : L e s s o n s L e a - - PowerPoint PPT Presentation

P a r s i n g J S O N R e a l l y Q u i c k l y : L e s s o n s L e a r n e d D a n i e l L e m i r e b l o g : h t t p s : / / l e m i r e . m e t w i t t e r : @ l e m i r e G i t H u b : h t t p s : / / g i t h u b . c o m / l e m i r e / p ro fe


slide-1
SLIDE 1
slide-2
SLIDE 2

P a r s i n g J S O N R e a l l y Q u i c k l y : Le s s o n s Le a r n e d

D a n i e l Le m i re b l o g : h t t p s : / / l e m i re . m e t w i t te r : @ l e m i re G i t H u b : h t t p s : / / g i t h u b.c o m / l e m i re / p ro fe s s o r ( C o m p u te r S c i e n c e ) a t U n i ve r s i té d u Q u é b e c ( T É LU Q ) M o n t re a l 2
slide-3
SLIDE 3

H o w fa s t c a n yo u re a d a l a rg e f i l e ?

A re yo u l i m i te d by yo u r d i s k o r A re yo u l i m i te d by yo u r C P U ?

3
slide-4
SLIDE 4

A n i Ma c d i s k : 2 . 2 G B / s , Fa s te r S S Ds ( e .g . , 5 G B / s ) a re ava i l a b l e

4
slide-5
SLIDE 5

Re a d i n g tex t l i n e s ( C P U o n l y ) ~ 0.6 G B / s o n 3. 4 G H z S k y l a ke i n J ava

void parseLine(String s) { volume += s.length(); } void readString(StringReader data) { BufferedReader bf = new BufferedReader(data); bf.lines().forEach(s -> parseLine(s)); }

S o u rc e a v a i l a b l e . I m p rove d by J D K
  • 8 2 2 9 0 2 2
5
slide-6
SLIDE 6

Re a d i n g tex t l i n e s ( C P U o n l y )

~ 1. 5 G B / s o n 3. 4 G H z S k y l a ke i n C+ + ( G N U G C C 8. 3 )

size_t sum_line_lengths(char * data, size_t length) { std::stringstream is; is.rdbuf()->pubsetbuf(data, length); std::string line; size_t sumofalllinelengths{0}; while(getline(is, line)) { sumofalllinelengths += line.size(); } return sumofalllinelengths; }

S o u rc e a v a i l a b l e . 6
slide-7
SLIDE 7 s o u rc e 7
slide-8
SLIDE 8

J S O N

S p e c i f i e d by D o u g l a s C ro c k fo rd R F C 7 1 5 9 by T i m B r a y i n 2 0 1 3 U b i q u i to u s fo r m a t to e x c h a n g e d a t a

{"Image": {"Width": 800,"Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "http://www.example.com/81989943", "Height": 125,"Width": 100} }

8
slide-9
SLIDE 9 " O u r b a c ke n d s p e n d s h a l f i t s t i m e s e r i a l i z i n g a n d d e s e r i a l i z i n g j s o n" 9
slide-10
SLIDE 10 J S O N p a r s i n g R e a d a l l o f t h e c o n te n t C h e c k t h a t i t i s v a l i d J S O N C h e c k U n i c o d e e n c o d i n g P a r s e n u m b e r s B u i l d D O M ( d o c u m e n t - o b j e c t - m o d e l ) Ha rd e r t h a n p a r s i n g l i n e s ? 1 0
slide-11
SLIDE 11 J a c ks o n J S O N s p e e d ( J a va )

t w i t te r . j s o n : 0.3 5 G B / s o n 3. 4 G H z S k y l a ke

S o u rc e c o d e a v a i l a b l e . s p e e d J a c ks o n ( J a v a )
  • 0. 3 5 G B / s
re a d L i n e s C+ +
  • 1. 5 G B / s
d i s k 2 . 2 G B / s 1 1
slide-12
SLIDE 12 R a p i d J S O N s p e e d ( C+ + )

t w i t te r . j s o n : 0.6 5 0 G B / s o n 3. 4 G H z S k y l a ke

s p e e d R a p i d J S O N ( C+ + )
  • 0. 6 5 G B / s
J a c ks o n ( J a v a )
  • 0. 3 5 G B / s
re a d L i n e s C+ +
  • 1. 5 G B / s
d i s k 2 . 2 G B / s 1 2
slide-13
SLIDE 13

s i m d j s o n s p e e d ( C+ + )

t w i t te r . j s o n : 2 . 4 G B / s o n 3. 4 G H z S k y l a ke s p e e d s i m d j s o n ( C+ + ) 2 . 4 G B / s R a p i d J S O N ( C+ + )
  • 0. 6 5 G B / s
J a c ks o n ( J a v a )
  • 0. 3 5 G B / s
re a d L i n e s C+ +
  • 1. 5 G B / s
d i s k 2 . 2 G B / s 1 3
slide-14
SLIDE 14

2 . 4 G B / s o n a 3. 4 G H z ( + t u r b o ) p ro c e s s o r i s ~ 1. 5 c yc l e s p e r i n p u t by te

1 4
slide-15
SLIDE 15

T r i c k # 1 : a vo i d h a rd - to - p re d i c t b ra n c h e s

1 5
slide-16
SLIDE 16 Wr i te r a n d o m n u m b e r s o n a n a r r a y .

while (howmany != 0) {

  • ut[index] = random();

index += 1; howmany--; }

e .g . , ~ 3 c yc l e s p e r i te r a t i o n 1 6
slide-17
SLIDE 17 Wr i te o n l y o d d r a n d o m n u m b e r s :

while (howmany != 0) { val = random(); if( val is odd) { // <=== new

  • ut[index] = val;

index += 1; } howmany--; }

1 7
slide-18
SLIDE 18

F ro m 3 c yc l e s to 1 5 c yc l e s p e r va l u e !

1 8
slide-19
SLIDE 19

G o b ra n c h l e s s !

while (howmany != 0) { val = random();

  • ut[index] = val;

index += (val bitand 1); howmany--; }

b a c k to u n d e r 4 c yc l e s ! D e t a i l s a n d c o d e a v a i l a b l e 1 9
slide-20
SLIDE 20

W h a t i f I ke e p r u n n i n g t h e s a m e b e n c h m a r k ?

( s a m e p s e u d o - r a n d o m i n te g e r s f ro m r u n - to - r u n ) 2 0
slide-21
SLIDE 21

T r i c k # 2 : U s e w i d e " wo rd s "

D o n' t p ro c e s s by te by by te

2 1
slide-22
SLIDE 22

W h e n p o s s i b l e , u s e S I M D

Av a i l a b l e o n m o s t c o m m o d i t y p ro c e s s o r s ( A R M , x 6 4 ) O r i g i n a l l y a d d e d ( Pe n t i u m ) fo r m u l t i m e d i a ( s o u n d ) A d d w i d e r ( 1 2 8 - b i t , 2 5 6 - b i t , 5 1 2 - b i t ) re g i s te r s A d d s n e w f u n i n s t r u c t i o n s : d o 3 2 t a b l e l o o k u p s a t o n c e . 2 2
slide-23
SLIDE 23 I S A w h e re m a x . re g i s te r w i d t h A R M N E O N ( A A rc h 6 4 ) m o b i l e p h o n e s , t a b l e t s 1 2 8 - b i t S S E 2 . . . S S E 4 . 2 l e g a c y x 6 4 ( I n te l , A M D ) 1 2 8 - b i t A V X , A V X 2 m a i n s t re a m x 6 4 ( I n te l , A M D ) 2 5 6 - b i t A V X
  • 5 1 2
l a te s t x 6 4 ( I n te l ) 5 1 2 - b i t 2 3
slide-24
SLIDE 24 " I n t r i n s i c " f u n c t i o n s ( C , C+ + , R u s t , . . . ) m a p p i n g to s p e c i f i c i n s t r u c t i o n s o n s p e c i f i c i n s t r u c t i o n s s e t s H i g h e r l e ve l f u n c t i o n s ( S w i f t , C+ + , . . . ) : J a v a V e c to r A P I Au tove c to r i z a t i o n ( "c o m p i l e r m a g i c " ) ( J a v a , C , C+ + , . . . ) O p t i m i ze d f u n c t i o n s ( s o m e i n J a v a ) A s s e m b l y ( e .g . , i n c r y p to ) 24
slide-25
SLIDE 25

T r i c k # 3 : a vo i d m e m o r y / o b j e c t a l l o c a t i o n

2 5
slide-26
SLIDE 26 I n s i m d j s o n , t h e D O M ( d o c u m e n t- o b j e c t- m o d e l ) i s s to re d o n o n e c o n t i g u o u s ta p e . 2 6
slide-27
SLIDE 27

T r i c k # 4 : m e a s u re t h e p e r fo r m a n c e !

b e n c h m a r k- d r i ve n d e ve l o p m e n t 2 7
slide-28
SLIDE 28

C o n t i n u o u s I n te g ra t i o n Pe r fo r m a n c e te s t s

p e r fo r m a n c e re g re s s i o n i s a b u g t h a t s h o u l d b e s p o t te d e a r l y 2 8
slide-29
SLIDE 29

P ro c e s s o r f re q u e n c i e s a re n o t c o n s ta n t

E s p e c i a l l y o n l a p to p s C P U c yc l e s d i f fe re n t f ro m t i m e T i m e c a n b e n o i s i e r t h a n C P U c yc l e s 2 9
slide-30
SLIDE 30

S p e c i f i c exa m p l e s

3 0
slide-31
SLIDE 31

E xa m p l e 1. U T F - 8

S t r i n g s a re A S C I I ( 1 by te p e r c o d e p o i n t ) O t h e r w i s e m u l t i p l e by te s ( 2 , 3 o r 4 ) O n l y 1. 1 M v a l i d U T F - 8 c o d e p o i n t s 3 1
slide-32
SLIDE 32 V a l i d a t i n g U T F - 8 w i t h i f / e l s e / w h i l e

if (byte1 < 0x80) { return true; // ASCII } if (byte1 < 0xE0) { if (byte1 < 0xC2 || byte2 > 0xBF) { return false; } } else if (byte1 < 0xF0) { // Three-byte form. if (byte2 > 0xBF || (byte1 == 0xE0 && byte2 < 0xA0) || (byte1 == 0xED && 0xA0 <= byte2) blablabla ) blablabla } else { // Four-byte form. .... blabla }

3 2
slide-33
SLIDE 33

U s i n g S I M D

Lo a d 3 2 - by te re g i s te r s U s e ~ 2 0 i n s t r u c t i o n s N o b r a n c h , n o b r a n c h m i s p re d i c t i o n 3 3
slide-34
SLIDE 34 E x a m p l e : V e r i f y t h a t a l l by te v a l u e s a re n o l a rg e r t h a n 24 4 S a t u r a te d s u b t r a c t i o n : x - 244 i s n o n -ze ro i f a n o n l y i f x > 244 .

_mm256_subs_epu8(current_bytes, 244 );

O n e i n s t r u c t i o n , c h e c ks 3 2 by te s a t o n c e ! 3 4
slide-35
SLIDE 35

p ro c e s s i n g ra n d o m U T F - 8

c yc l e s / by te b r a n c h i n g 1 1 s i m d j s o n
  • 0. 5

20 x fa s te r !

S o u rc e c o d e a v a i l a b l e . 3 5
slide-36
SLIDE 36

E xa m p l e 2 . C l a s s i f y i n g c h a ra c te r s

c o m m a ( 0 x 2 c ) , c o l o n ( 0 x 3 a ) : b r a c ke t s ( 0 x 5 b, 0 x 5 d , 0 x 7 b, 0 x 7d ) : [, ], {, } w h i te - s p a c e ( 0 x 0 9, 0 x 0 a , 0 x 0 d , 0 x 2 0 )
  • t h e r s
C l a s s i f y 1 6 , 3 2 o r 6 4 c h a r a c te r s a t o n c e ! 3 6
slide-37
SLIDE 37 D i v i d e v a l u e s i n to t w o ' n i b b l e s ' 0 x 2 c i s 2 ( h i g h n i b b l e ) a n d c ( l o w n i b b l e ) T h e re a re 1 6 p o s s i b l e l o w n i b b l e s . T h e re a re 1 6 p o s s i b l e h i g h n i b b l e s . 3 7
slide-38
SLIDE 38

A R M N E O N a n d x 6 4 p ro c e s s o r s h ave i n s t r u c t i o n s to l o o k u p 1 6 - by te ta b l e s i n a ve c to r i ze d m a n n e r ( 1 6 va l u e s a t a t i m e ) : p s h u f b, t b l

3 8
slide-39
SLIDE 39 S t a r t w i t h a n a r r a y o f 4 - b i t v a l u e s [ 1, 1, 0, 2 , 0, 5 , 1 0, 1 5 , 7 , 8, 1 3, 9, 0, 1 3, 5 , 1 ] C re a te a l o o k u p t a b l e [ 2 0 0, 2 0 1, 2 0 2 , 2 0 3, 2 0 4 , 2 0 5 , 2 0 6 , 2 07 , 2 0 8, 2 0 9, 2 1 0, 2 1 1, 2 1 2 , 2 1 3, 2 1 4 , 2 1 5 ] 2 0 0, 1 2 0 1, 2 2 0 2 R e s u l t : [ 2 0 1, 2 0 1, 2 0 0, 2 0 2 , 2 0 0, 2 0 5 , 2 1 0, 2 1 5 , 2 07 , 2 0 8, 2 1 3, 2 0 9, 2 0 0, 2 1 3, 2 0 5 , 2 0 1 ] 3 9
slide-40
SLIDE 40 F i n d t w o t a b l e s H1 a n d H2 s u c h a s t h e b i t w i s e A N D o f t h e l o o k c l a s s i f y t h e c h a r a c te r s .

H1(low(c)) & H2(high(c))

c o m m a ( 0 x 2 c ) : 1 c o l o n ( 0 x 3 a ) : 2 b r a c ke t s ( 0 x 5 b, 0 x 5 d , 0 x 7 b, 0 x 7d ) : 4 m o s t w h i te - s p a c e ( 0 x 0 9, 0 x 0 a , 0 x 0 d ) : 8 w h i te s p a c e ( 0 x 2 0 ) : 1 6
  • t h e r s : 0
4 0
slide-41
SLIDE 41

const uint8x16_t low_nibble_mask = (uint8x16_t){16, 0, 0, 0, 0, 0, 0, 0, 0, 8, 12, 1, 2, 9, 0, 0}; const uint8x16_t high_nibble_mask = (uint8x16_t){8, 0, 18, 4, 0, 1, 0, 1, 0, 0, 0, 3, 2, 1, 0, 0}; const uint8x16_t low_nib_and_mask = vmovq_n_u8(0xf);

F i ve i n s t r u c t i o n s :

uint8x16_t nib_lo = vandq_u8(chunk, low_nib_and_mask); uint8x16_t nib_hi = vshrq_n_u8(chunk, 4); uint8x16_t shuf_lo = vqtbl1q_u8(low_nibble_mask, nib_lo); uint8x16_t shuf_hi = vqtbl1q_u8(high_nibble_mask, nib_hi); return vandq_u8(shuf_lo, shuf_hi);

4 1
slide-42
SLIDE 42

E xa m p l e 3. D e te c t i n g e s c a p e d c h a ra c te r s

" \ " \ \ \ \ " \ \ \ "

4 2
slide-43
SLIDE 43

Ca n yo u te l l w h e re t h e s t r i n g s s ta r t a n d e n d ?

{ "\\\"Nam[{": [ 116,"\\\\"

. . . W i t h o u t b ra n c h i n g ?

4 3
slide-44
SLIDE 44

E s c a p e c h a ra c te r s fo l l ow a n o d d s e q u e n c e o f b a c ks l a s h e s !

4 4
slide-45
SLIDE 45 I d e n t i f y b a c ks l a s h e s :

{ "\\\"Nam[{": [ 116,"\\\\" ___111________________1111_

: B O d d a n d e ve n p o s i t i o n s

1_1_1_1_1_1_1_1_1_1_1_1_1_1

: E ( c o n s t a n t )

_1_1_1_1_1_1_1_1_1_1_1_1_1_

: O ( c o n s t a n t ) 4 5
slide-46
SLIDE 46 D o a b u n c h o f a r i t h m e t i c a n d l o g i c a l o p e r a t i o n s . . .

(((B + (B &~(B << 1)& E))& ~B)& ~E) | (((B + ((B &~(B << 1))& O))& ~B)& E)

R e s u l t :

{ "\\\"Nam[{": [ 116,"\\\\"

. . .

______1____________________

N o b r a n c h ! 4 6
slide-47
SLIDE 47

Re m ove t h e e s c a p e d q u o te s , a n d

t h e re m a i n i n g q u o te s te l l yo u w h e re t h e s t r i n g s a re !

4 7
slide-48
SLIDE 48

{ "\\\"Nam[{": [ 116,"\\\\" __1___1_____1________1____1

: a l l q u o te s

______1____________________

: e s c a p e d q u o te s

__1_________1________1____1

: s t r i n g - d e l i m i te r q u o te s 4 8
slide-49
SLIDE 49

F i n d t h e s p a n o f t h e s t r i n g

mask = quote xor (quote << 1); mask = mask xor (mask << 2); mask = mask xor (mask << 4); mask = mask xor (mask << 8); mask = mask xor (mask << 16); ... __1_________1________1____1

( q u o te s ) b e c o m e s

__1111111111_________11111_

( s t r i n g re g i o n ) 4 9
slide-50
SLIDE 50

E n t i re s t r u c t u re o f t h e J S O N d o c u m e n t c a n b e i d e n t i f i e d ( a s a b i t s e t ) w i t h o u t a ny b ra n c h !

5 0
slide-51
SLIDE 51

E xa m p l e 4 . D e c o d e b i t i n d exe s

G i ve n t h e b i t s e t 1000100010001 , w e w a n t t h e l o c a t i o n o f t h e 1 s ( e .g . , 0, 4 , 8 1 2 ) 5 1
slide-52
SLIDE 52

while (word != 0) { result[i] = trailingzeroes(word); word = word & (word - 1); i++; }

I f n u m b e r o f 1 s p e r 6 4 - b i t i s h a rd to p re d i c t : l o t s o f m i s p re d i c t i o n s ! ! ! 5 2
slide-53
SLIDE 53 I n s te a d o f p re d i c t i n g t h e n u m b e r o f 1 s p e r 6 4 - b i t , p re d i c t w h e t h e r i t i s i n { 1, 2 , 3, 4 } { 5 , 6 , 7 , 8 } { 9, 1 0, 1 1, 1 2 } E a s i e r ! 5 3
slide-54
SLIDE 54 R e d u c e t h e n u m b e r o f m i s p re d i c t i o n by d o i n g m o re w o r k p e r i te r a t i o n :

while (word != 0) { result[i] = trailingzeroes(word); word = word & (word - 1); result[i+1] = trailingzeroes(word); word = word & (word - 1); result[i+2] = trailingzeroes(word); word = word & (word - 1); result[i+3] = trailingzeroes(word); word = word & (word - 1); i+=4; }

D i s c a rd b o g u s i n d e x e s by c o u n t i n g t h e n u m b e r o f 1 s i n t h e w o rd d i re c t l y ( e .g . ,

bitCount

) 5 4
slide-55
SLIDE 55

E xa m p l e 5 . N u m b e r p a r s i n g i s ex p e n s i ve

strtod

: 9 0 M B / s 3 8 c yc l e s p e r by te 1 0 b r a n c h m i s s e s p e r f l o a t i n g - p o i n t n u m b e r 5 5
slide-56
SLIDE 56 C h e c k w h e t h e r w e h a ve 8 c o n s e c u t i ve d i g i t s

bool is_made_of_eight_digits_fast(const char *chars) { uint64_t val; memcpy(&val, chars, 8); return (((val & 0xF0F0F0F0F0F0F0F0) | (((val + 0x0606060606060606) & 0xF0F0F0F0F0F0F0F0) >> 4)) == 0x3333333333333333); }

5 6
slide-57
SLIDE 57 T h e n c o n s t r u c t t h e c o r re s p o n d i n g i n te g e r U s i n g o n l y t h re e m u l t i p l i c a t i o n s ( i n s te a d o f 7 ) :

uint32_t parse_eight_digits_unrolled(const char *chars) { uint64_t val; memcpy(&val, chars, sizeof(uint64_t)); val = (val & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8; val = (val & 0x00FF00FF00FF00FF) * 6553601 >> 16; return (val & 0x0000FFFF0000FFFF) * 42949672960001 >> 32; }

C a n d o e ve n b e t te r w i t h S I M D 5 7
slide-58
SLIDE 58

R u n t i m e d i s p a tc h

O n f i r s t c a l l , p o i n te r c h e c ks C P U , a n d re a s s i g n s i t s e l f. N o l a n g u a g e s u p p o r t . 5 8
slide-59
SLIDE 59

int json_parse_dispatch(...) { Architecture best_implementation = find_best_supported_implementation(); // Selecting the best implementation switch (best_implementation) { case Architecture::HASWELL: json_parse_ptr = &json_parse_implementation<Architecture::HASWELL>; break; case Architecture::WESTMERE: json_parse_ptr= &json_parse_implementation<Architecture::WESTMERE>; break; default: return UNEXPECTED_ERROR; } return json_parse_ptr(....); }

5 9
slide-60
SLIDE 60 W h e re to g e t i t ? G i t H u b : h t t p s : / / g i t h u b.c o m / l e m i re / s i m d j s o n / M o d e r n C+ + , s i n g l e - h e a d e r ( e a s y i n te g r a t i o n ) A R M ( e .g . , i P h o n e ) , x 6 4 ( g o i n g b a c k 1 0 ye a r s ) A p a c h e 2 . 0 ( n o h i d d e n p a te n t s ) U s e d by M i c ro s o f t F i s h S to re a n d Y a n d e x C l i c k H o u s e w r a p p e r s i n P y t h o n , P H P , C # , R u s t , J a v a S c r i p t ( n o d e ) , R u by p o r t s to R u s t , G o a n d C # 6 0
slide-61
SLIDE 61 R e fe re n c e G e o f f L a n g d a l e , D a n i e l Le m i re , P a r s i n g G i g a by te s o f J S O N p e r S e c o n d , V L D B J o u r n a l , h t t p s : / / a r x i v .o rg / a b s / 1 9 0 2 . 0 8 3 1 8 6 1
slide-62
SLIDE 62 C re d i t G e o f f L a n g d a l e ( a l g o r i t h m i c a rc h i te c t a n d w i z a rd ) C o n t r i b u to r s : T h o m a s Na ve n n e c , Ka i W
  • l f, T
y l e r Ke n n e d y , F r a n k W e s s e l s , G e o rg e Fo to p o u l o s , H e i n z N . G i e s , E m i l G e d d a , W
  • j c i e c h M u ł a , G e o rg i o s F l o ro s , D o n g X i e , Na n X i a o, E g o r
B o g a tov , J i n x i W a n g , L u i z Fe r n a n d o Pe re s , W
  • u te r B o l s te r l e e , A n i s h Ka r a n d i k a r
, R e i n i U r b a n . T
  • m D y s o n , I h o r D o t s e n ko, A l e x e y M i l ov i d ov
, C h a n g L i u , S u n n y G l e a s o n , J o h n Ke i s e r , Z a c h B j o r n s o n , V i t a l y B a r a n ov , J u h o L a u r i , M i c h a e l E i s e l , I o D a z a D i l l o n , P a u l D re i k , J é ré m i e P i o t te a n d o t h e r s 6 2
slide-63
SLIDE 63 6 3