S2Graph : A large-scale graph database
with Hbase
daumkakao
S2Graph : A large-scale graph database with Hbase Reference 1. - - PowerPoint PPT Presentation
daumkakao S2Graph : A large-scale graph database with Hbase Reference 1. HBase Conference 2015 1.http://www.slideshare.net/HBaseCon/use-cases-session-5 2.https://vimeo.com/128203919 2. Deview 2015 3. Apache Con BigData Europe
daumkakao
2
1.http://www.slideshare.net/HBaseCon/use-cases-session-5 2.https://vimeo.com/128203919
1.http://sched.co/3ztM
3
Message Write length : Read Coupon price : Present price : 3
affinity affinity: affinity affinity affinity affinity affinity affinity affinity
Friend
Group size : 6 Emoticon Eat rating : View count : Play level: 6 Style share : 3 Advertise Search keyword : Listen count : Like count : 7 Comment
affinity
4
Message length : 9 Write length : 3
affinity 6 affinity: 9 affinity 3 affinity 3 affinity 4 affinity 1 affinity 2 affinity 2 affinity 9
Friend
Play level: 6 Style share : 3 Advertise ctr : 0.32 Search keyword : “HBase" Listen count : 6
Comment length : 15
affinity 3
Message ID : 201 Ad ID : 603 Music ID : 603 Item ID : 13 Post ID : 97 Game ID : 1984
5
more than, social network: 10 billion edges, 200 million vertices, 50 million update on existing edges. user activities: over 1 billion new edges per day
6
peak graph-traversing query per second: 20000 response time: 100ms
7
Person A
Post Fast
Person B
Comment
Person C
Sharing
Person D
Mention Fast Fast
8
9
Each app server should know each DB’s sharding logic. Highly inter-connected architecture
Friend relationship SNS feeds Blog user activities Messaging
Messaging App SNS App Blog App
10
SNS App Blog App Messaging App
stateless app servers
daumkakao
12
13
Participates
Chat Room Message 1 Message 1 Message 1
Contains Recent messages in my chat rooms.
SELECT a.* FROM user_chat_rooms a, chat_room_messages b WHERE a.user_id = 1 AND a.chat_room_id = b.chat_room_id WHERE b.created_at >= yesterday
14
Participates
Chat Room Message 1 Message 1 Message 1
Contains Recent messages in my chat rooms.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": "user_chat_rooms", "direction": "out", "limit": 100}], // step [{"label": "chat_room_messages", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}] ] } '
15
Post1 Post 2 Post 3
Posts that my friends interacted.
SELECT a.*, b.* FROM friends a, user_posts b WHERE a.user_id = b.user_id WHERE b.updated_at >= yesterday and b.action_type in (‘create’, ‘like’, ‘share’)
16
Friends
Post1 Post 2 Post 3
create/like/share posts
Posts that my friends interacted.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": "friends", "direction": "out", "limit": 100}], // step [{"label": “user_posts", "direction": "out", "limit": 10, “where”: “created_at >= yesterday”}] ] } '
17
Product 1 Product2 Product 3
Products that similar user interact recently.
SELECT a.* , b.* FROM similar_users a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday
18
Products that similar user interact recently.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { “filterOut”: {“srcVertices”: [{“serviceName”: “s2graph”, “columnName”: “user_id”, “id”: 1}], “steps”: [[{“label”: “user_products_interact”}]] }, "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “similar_users", "direction": "out", "limit": 100, “where”: “similarity > 0.2”}], // step [{"label": “user_products_interact”, "direction": "out", "limit": 10, “where”: “created_at >= yesterday and price >= 1000”}] ] } '
Similar Users
Product 1 Product2 Product 3
user-product interaction (click/buy/like/share) Batch
19
Product 1 Product2 Product 3
Product 1 Product 1 Product 1
Products that are similar to what I have interested.
SELECT a.* , b.* FROM similar_ a, user_products b WHERE a.sim_user_id = b.user_id AND b.updated_at >= yesterday
20
Products that are similar to what I have interested.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}], [{"label": “similar_products”, "direction": "out", "limit": 10, “where”: “similarity > 0.2”}] ] } '
Similar Products
Product 1 Product2 Product 3
user-product interaction (click/buy/like/share)
Product 1 Product 1 Product 1
Batch
21
TopK(k=1) product per timeUnit(day)
Product1 Product2 Product 3
Daily top product per categories in products that I liked.
SELECT c.* FROM user_products a, product_categories b, category_daily_top_products c WHERE a.user_id = 1 and a.product_id = b.product_id and b.category_id = c.category_id and c.time between (yesterday, today)
Category1 Category2 Product10 Product20 Product20
Today
Product10
Yesterday Today Yesterday
22
Daily top product per categories in products that I liked.
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}], [{“label”: “product_cates”, “direction”: “out”, “limit”: 3}], [{"label": “category_products_topK”, "direction": "out", "limit": 10] ] } '
TopK(k=1) product per timeUnit(day)
Product1 Product2 Product 3
user-product interaction (click/buy/like/share)
Category1 Category2 Product10 Product20 Product20
Today
Product10
Yesterday Today Yesterday
23
Product 1 Product2 Product 3
Products that is interacted by users who interacted on products that I interact
SELECT b.product_id, count(*) FROM user_products a, user_products b WHERE a.user_id = 1 AND a.product_id = b.product_id GROUP BY b.product_id
24
Product 1 Product2 Product 3
user-product interaction (click/buy/like/share)
Products that is interacted by users who interacted on products that I interact
curl -XPOST localhost:9000/graphs/getEdges -H 'Content-Type: Application/json' -d ' { "srcVertices": [{"serviceName": "s2graph", "columnName": “user_id", "id":1}], "steps": [ [{"label": “user_products_interact", "direction": "out", "limit": 100, “where”: “created_at >= yesterday and price >= 1000”}], [{"label": “user_products_interact", "direction": "in", "limit": 10, “where”: “created_at >= today”}], [{"label": “user_products_interact", "direction": "out", "limit": 10, “where”: “created_at >= 1 hour ago”}], ] } '
25
26
ID 1231-123 Prop1 Val1 Prop2 Val2 … …
27
CRUD in RDBMS)
direction)
follow).
Edge Reference 1,101,”friend”,”out” Prop1 Val1 Prop2 Val2 … …
28
Degree Q1 Q2 Q3 1-friend-
3 c-103 b-102 a-101
1 101 102 103
1. addIndex, createIndex 2. Automatically keep edges ordered for multiple indices. 3. Support int/long/float/string data types.
29
Class Query { // Define breadth first search List[VertexId] startVertices; List[Step] steps; } Class Step { // Define one breadth List[QueryParam] queryParams; } Class QueryParam { // Define each edges to traverse for current breadth String label; String direction; Map options; }
QueryParam Step1 Step2 Query
30
31
Post
Like Write(Fanout) Friends
Feed Queue Feed Queue Feed Queue
Write # of friends Read O(1) for friends Storage AVG(# of friends) * total user activity Query O(1)
32
Post
Write O(1) Read None Storage total user activity Query O(1) for friends + O(# of friends)
33
34
S2Graph
Write API + Query DSL
OpenSourced User/Item Similarity Apache Spark (Batch Computing Layer) TopK Counter Others S2Graph Bulk Loader will be open sourced soon
35
36
S2Graph
and many many more. just think your service as graph model.
daumkakao
38
Logical View
Tgt Vertex ID1 Tgt Vertex ID2 Tgt Vertex ID3 Src Vertex ID1 Properties Properties Properties Src Vertex ID2 Properties Properties Properties
Index Values | Tgt Vertex ID1 Index Values | Tgt Vertex ID2 Src Vertex ID1 Non-index Properties Non-index Properties
39
Logical View
column row Property Key1 Property Key2 Src Vertex ID1 Value1 Value2 Vertex ID2 Value1 Value2
40
Update/Delete edge is hard.
Backtracking from snapshotEdge
Problem
41
IndexedEdge Degree Q1 Q2 Q3 1-friend-out-PK 3 c-103 b-102 a-101
age:30, gender:M age:21 age:15, gender:F
SnapshotEdge Degree Q1 Q2 Q3 1-friend-out-PK 3 103 102 101
name:c:t0 age:30:t0 gender:M:t0
name:b:t0 age:21:t0 name:a:t0 age:15:t0 gender:F:t0
curl -XPOST localhost:9000/graphs/edges/insert -H ‘Content-Type: Application/json’ -d ‘ [ {“timestamp”: t0, “from”: 1, “to”: 101, “label”: “friend”, “props”: {“name”: “a”, “age”: 15, “gender”: “F”}}, {“timestamp”: t0, “from”: 1, “to”: 102, “label”: “friend”, “props”: {“name”: “b”, “age”: 21}}, {“timestamp”: t0, “from”: 1, “to”: 103, “label”: “friend”, “props”: {“name”: “c”, “age”: 30, “gender”: “M”} ] ‘
42
SnapshotEdge Degree Q1 Q2 Q3 1-friend-out-PK 3 103 102 101
name:c:t0 age:30:t0 gender:M:t0
name:b:t0 age:21:t0
name:a:t0 name:d:t1 age:15:t0 age:26:t1 gender:F:t0 IndexedEdge: delete(1, (a-101)) insert(1, (d-101))
curl -XPOST localhost:9000/graphs/edges/update -H ‘Content-Type: Application/json’ -d ‘ [ {“timestamp”: t-1, “from”: 1, “to”: 101, “label”: “friend”, “props”: {“name”: “k”, “age”: -10}} {“timestamp”: t1, “from”: 1, “to”: 101, “label”: “friend”, “props”: {“name”: “d”, “age”: 26}} ] ‘
1.Fetch SnapshotEdge 2.check pending mutations and retry 3.Build Update on Snapshot/ Indexed Edge 4.CAS on new SnapshotEdge 5.Mutate indexedEdge 6.CAS on new SnapshotEdge if pending mutations exist, other thread mutate this, so commit pending mutations and retry. If CAS is failed at 4, other thread lock this, so retry.
SnapshotEdge Degree Q1 Q2 Q3 1-friend-out-PK 3 103 102 101
name:c:t0 age:30:t0 gender:M:t0
name:b:t0 age:21:t0 name:a:t0 age:15:t0 gender:F:t0
43
IndexedEdge Degree Q0 Q1 Q2 Q3 1-friend-out- PK 3 d-101 c-103 b-102 a-101
age:26,gender:F age:30, gender:M age:21 age:15, gender:F
SnapshotEdge Degree Q1 Q2 Q3 1-friend-out-PK 3 103 102 101
name:c:t0 age:30:t0 gender:M:t0
name:b:t0 age:21:t0 name:a:t0 name:d:t1 age:15:t0 age:26:t1 gender:F:t0 IndexedEdge: delete(1, (a-101)) insert(1, (d-101))
1.Fetch SnapshotEdge 2.Apply mutations stored in SnapshotEdge if exist 3.Build Update on Snapshot/ Indexed Edge 4.CAS on new SnapshotEdge 5.Mutate indexedEdge 6.CAS on new SnapshotEdge If any failure exist on 5, abort and retry from 1. it is safe to issue same mutation multiple time since s2graph is idempotent.
44
IndexedEdge Degree Q0 Q1 Q2 1-friend-out-PK 3 d-101 c-103 b-102
age:26,gender:F age:30, gender:M age:21
SnapshotEdge Degree Q1 Q2 Q3 1-friend-out-PK 3 103 102 101
name:c:t0 age:30:t0 gender:M:t0
name:b:t0 age:21:t0 name:d:t1 age:26:t1 gender:F:t0
1.Fetch SnapshotEdge 2.Apply mutations stored in SnapshotEdge if exist 3.Build Update on Snapshot/ Indexed Edge 4.CAS on new SnapshotEdge 5.Mutate indexedEdge 6.CAS on new SnapshotEdge If CAS is failed at 6, retry from 1
45
1.Fetch SnapshotEdge 2.Build Update on Snapshot/ Indexed Edge 3.CAS on new SnapshotEdge 4.Mutate indexedEdge 5.CAS on new SnapshotEdge
IndexedEdge Degree Q0 Q1 Q2 1-friend-out-PK 3 d-101 c-103 b-102
age:26,gender:F age:30, gender:M age:21
SnapshotEdge Degree Q1 Q2 Q3 1-friend-out-PK 3 103 102 101
name:c:t0 age:30:t0 gender:M:t0
name:b:t0 age:21:t0 name:d:t1 age:26:t1 gender:F:t0
46
daumkakao
48
49
50
51
Latency 50 100 150 200 QPS 1,000 2,000 3,000 4,000
# of app server
1 2 4 8
QPS(Query Per Second) Latency(ms)
# of app server
1 2 3 4 5 6 7 8 500 1000 1500 2000 2500 3000
QPS
Latency 87.5 175 262.5 350 QPS 500 1,000 1,500 2,000 Limit on first step 20 40 80 200 400 800
QPS Latency(ms)
53
Latency 37.5 75 112.5 150 QPS 80 160 240 320 400 limits on path 10 -> 100 100 -> 10 10 -> 10 -> 10 2 -> 5 -> 10 -> 10 2 -> 5 -> 2 -> 5 -> 10
QPS Latency(ms)
54
Latency 1.25 2.5 3.75 5 Request per second 8000 16000 800000
55
Latency 2 4 6 8 Request per second 2000 4000 6000
56
57
58
59