Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore - PowerPoint PPT Presentation

Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore Ai www.lore.ai We’re Hiring!

Overview 1. Problem statement (~2 minute) 2. (Proprietary) Solution: Datomics (~10 minutes) 3. Proposed Open Source Solution: (~8 minutes) a. Easy Version(s) b. Hard Version(s) 4. Discussion & More Brain Storming (after talk & all of tomorrow)

The hard thing about hard things... Celery Workers NoSQL vs SQL ➔ Django Balancer Cron-jo Web Load bs Data-model vs database type ➔ Django API Logging Column vs Row vs Document ◆ Web Service Broker scraper s ➔ Backups: consistent snapshots across NLP Services different (types of) DBs. I nput KG Pipeline Process Service or Entity ➔ Versioning: changes over time Generat or ➔ Trade-offs: Scaling ◆ MySQL Minio Redis Mongo Consistency ◆ (soon (distributed blob (distributed Cluster ElasticSearch) storage) caching)

Deconstructing the Database Outdated, Bad Ideas: Rich Hickey (inventor of Clojure) ● Changing data in place (mutability) ● Fixed schema structure ● Disk locality is an outdated idea (networks are faster than disks!) ● Database must be a server (why not db as a library?) ● Reads & Writes done by same system Even modern “SQL/NoSQL” DBs make out-dated trade-offs (e.g. consistency vs scalability) https://youtu.be/Cym4TZwTCNU ⇒ Can we come up with more informed, better trade-offs?

Disclaimers 1. I have no affiliation with Datomic 2. I’ve never used Datomic 3. I’m compressing Hickey’s very nice talk into 10 min ⇒ Missing lots of important ideas/details 4. I’ve only been thinking about this for a few weeks (on & off)

Hickey’s Solution I: Fact-Based Schema ● “Atomic” unit of data (datom) is a “fact”: Transaction Time Entity (subject) Property/Relation Value 12234 334 firstName sheer “Equivalent” JSON: 12235 334 cofounderOf 445 (Lore) { _id: 334, first_name: “sheer”, ● Fact “values” can be literals, lists, types or Ids ... of other entities (i.e. relations). company: { name: “Lore”, ● Encode arbitrary schema but additionally …. } includes “time” dimension. …. } ● Don’t confound data with encoding.

Hickey’s Solution II: Immutable Data ● All writes add/retract a “fact” from DB Time Op Ent Rel Value ● Writes are appended to a “log” 123 add sheer email sheer@gmail.com ⇒ append only system 124 add sheer email sheer@yahoo.com ● Deletions are just a series of “retractions” ● Operations (add/del) are “time-stamped” 125 del sheer email sheer@gmail.com 126 add sheer livesIn Paris Data is never deleted! ● The database is just a log of very granular updates of facts. ● Consistency is trivial: select a timestamp and read log up to timestamp.

Hickey’s Solution III: Structural Sharing ● Consistent reads require versioning (timestamps) so what about an index? ● Hickey (+Phil Bagwell) invented/improved “HMAT” ○ Hash-mapped Array Trie ○ Structural sharing for efficiently versioning indices ○ High (32-way) branching ratio for shallow trees ● Querying element at fixed time-stamp only requires accessing simple path on tree ○ Cache/access subset of index as required. https://hypirion.com/musings/understanding-persistent-vector-pt-1 ● Universal fact-based schema: ○ only need fixed number (6) of indices ○ composite index on ent-rel-value, rel-ent-value, etc.

Hickey’s Solution IV: Separate Reads & Writes ● Reads, queries and index lookups happen in client process (i.e. here client means app server not user/browser). ● Client-side library provides in-process indexing, queries, aggregations, etc ● Granular data model (fact-based) and incremental index ⇒ efficient caching of working set (facts & sub-trees) ● Consistency is trivially assured ⇒ each timestamp is consistent “snapshot” of DB. ● Trivial scalability ⇒ every app server is a processing peer. ● Only bottleneck is Transactor but it just appends updates to a log (can be made very fast).

How much of this can we replicate? Improve?

Proposal 1. Versioned Document Store (“easy”) a. Update log via JSON-patch events b. Instantiate Documents in JSON-store (mongo) for near real-time indexing 2. Versioned Relational Document Store (“hardish”) a. Add “real-time” indexing to the above and restrict to “shallow” json linked by relations. 3. Fully Versioned Indices (“hard”) a. Implement HMAT or similar (see https://github.com/tobgu/pyrsistent) 4. Make it all efficient (“very hard”) a. Cythonize b. Tweak caching, disk layout, etc.

Versioned JSON Store Read Cluster ● Two ideas for write server (“Transactor”): ○ Append-only Mongo Collection ○ Redis-queue persisted to S3/Minio flat file Mongo/ Mongo/ Mongo/ ES ES ES ● Format: Cluster Cluster Cluster ○ Json-patch files (“transaction”) ○ Json-schema library to validate ● Read/Index servers: ETL via JSONPATCH ○ Cluster of Mongo/ES servers “instantiating” docs using jsonpatch library. Writer options ○ Indexing “almost” real-time but only Redis “current” timestamp. Only capturing some benefits (little contention, versioning, some schema independence). ● Only server-side processing. ● No versioned index. Minio Mongo (distributed blob ● Essentially: just versioned replication storage)

Versioned Relational JSON Store ● Enforce “shallow” JSON ○ Only strings, ints, lists, etc as values Sample patch: ○ Dicts/subdocs replaced by “ids” { ● Index in real-time: _id: 334, ○ Directly index append-log collection prev_checksum: “6d96617b37f4f662783c957”, ○ Manual index in redis patches: [ ○ Index individual “facts” not just docs {“op”: “add”, “path”: “/first_name”, “value”: “sheer”}, ● Transactor manually validates patched JSON: {“op”: “add”, “path”: “/company”, “value”: Id(443)}, ] ○ Enforce existence on IDs (foreign keys) } ○ Ensure patch respects schema Notes ● JSON schemas are JSON docs so can be versioned as special collection in DB. ● Subtle issue: do we enforce FK constraints if doc/entity “retracted”? ● Indexing/reads are still largely centralized (redis/mongo index on append-log).

Fully Versioned Index Core Ideas: ● Can we move in-process? ○ HMAT or other persistent index? (see https://github.com/tobgu/pyrsistent) ● Cache/access individual facts or index subtrees directly in process (with two tiered external cache like Datomics). ● Need to implement in-memory query engine using lazy-access cached persistent index. ○ Pandas (like?) ○ See https://github.com/tobgu/qcache ● Transactor or Indexer has to be able to merge “live” index (recent facts) with persistent index.

Make It All Very Efficient ● Cythonize (what parts?) ○ Jsonpatch ○ Schema validation ● Use dicts/native-types instead of JSON ● Fast serialization: msgpack, pickle, etc.. ● Can we borrow existing OSS technologies for indexing, etc..

Some Comments ● An index is a tree: ○ Represent in JSON/dict and version/store like log? ● Schemas are also just documents ○ Can version and manage as a special collection ● Transactions are also “first class” so can carry additional metadata ● More complex ops (+=1) require Transactor to translate op into atomic updates.

Interested in working on this? Contact me! Slides will go on blog.lore.ai

Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore - PowerPoint PPT Presentation

Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore Ai www.lore.ai Were Hiring! Overview 1. Problem statement (~2 minute) 2. (Proprietary) Solution: Datomics (~10 minutes) 3. Proposed Open Source Solution: (~8 minutes)

Last time System F K 1 is a kind K 2 is a kind -kind K 1 K 2 is a kind A :: K 1

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Surprise of the Kingdom 1. A different kind of King 2. A different kind of Kingdom 3. A

IN-KIND CONTRIBUTION TO CERN ALICE EXPERIMENT Doc. chief engineer Eija Tuominen / IKBest5

SWARM SWARM INTELLIGENCE INTELLIGENCE Milad Abolhassani Supervisor: Hamid Mir Vaziri 3 WHY?

The Kind 2 Model Checker Adrien Champion Alain Mebsout Christoph Sticksel Cesare Tinelli Kind

An invitation to inner model theory Grigor Sargsyan Department of Mathematics, UCLA 03.25.2011

McMambo V1: A new kind of Latin Dance Mambo Watson Ladd University of California, Berkeley

A New Kind of A New Kind of Leader Leader Presenter: FL Conference Pathfinder & Adventurer

National Address Database National Address Database What is a National Address Database?

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

3/1/2016 Pest Management Regulatory Agency Overview Environmental Risk Assessment o The PMRA o

Autosegmental phonology John A Goldsmith February 23, 2016 1 Autosegmental Phonology 1976: 2

Silve ilver Te r Team B B Joe Dieguez Phil Garcia Th The S Shea N Nut Pre Press Th The

Chapter 15 15.1 Properties of Stars Surveying the Stars Our goals for learning How do

BIG DATA IN HYBRID WORLDS The Story of M H i ! Im Florian CEO of Dataiku maker Data

Phonetics & Inflection M&R 114, 187 193 ENG240Y Old English / Wed 15 Sep 2010

POLLINATING FOOD ENTERPRISES C R E A T I V E N E W M O D E L S F O R S T A R T I N G , S U P

Scaling eBay Kleinanzeigen Intro Myself Manuel Aldana TU-Berlin eBayK #2 Intro Company !=

Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore - PowerPoint PPT Presentation

Invitation to a New Kind of Database Sheer El Showk Cofounder, Lore Ai www.lore.ai Were Hiring! Overview 1. Problem statement (~2 minute) 2. (Proprietary) Solution: Datomics (~10 minutes) 3. Proposed Open Source Solution: (~8 minutes)

Last time System F K 1 is a kind K 2 is a kind -kind K 1 K 2 is a kind A :: K 1

Database Utilities 10/17/2007 DC/Win Database Utilities Opening Database Utilities From File on

NEBC Database Course 2008 Database Servers Database Interfaces Tim Booth : tbooth@ceh.ac.uk

Surprise of the Kingdom 1. A different kind of King 2. A different kind of Kingdom 3. A

IN-KIND CONTRIBUTION TO CERN ALICE EXPERIMENT Doc. chief engineer Eija Tuominen / IKBest5

SWARM SWARM INTELLIGENCE INTELLIGENCE Milad Abolhassani Supervisor: Hamid Mir Vaziri 3 WHY?

The Kind 2 Model Checker Adrien Champion Alain Mebsout Christoph Sticksel Cesare Tinelli Kind

An invitation to inner model theory Grigor Sargsyan Department of Mathematics, UCLA 03.25.2011

McMambo V1: A new kind of Latin Dance Mambo Watson Ladd University of California, Berkeley

A New Kind of A New Kind of Leader Leader Presenter: FL Conference Pathfinder &amp; Adventurer

National Address Database National Address Database What is a National Address Database?

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SECURITY CS4750 Database Systems Prof. Nada Basit Email: basit@virginia.edu Fall

DATABASE SYSTEMS Database programming in a web environment Database System Course, 2016-2017

DATABASE SYSTEMS Database programming in a web environment Database System Course AGENDA FOR

Advanced Database CS 525: Organization? Advanced Database =Database Implementation

3/1/2016 Pest Management Regulatory Agency Overview Environmental Risk Assessment o The PMRA o

Autosegmental phonology John A Goldsmith February 23, 2016 1 Autosegmental Phonology 1976: 2

Silve ilver Te r Team B B Joe Dieguez Phil Garcia Th The S Shea N Nut Pre Press Th The

Chapter 15 15.1 Properties of Stars Surveying the Stars Our goals for learning How do

BIG DATA IN HYBRID WORLDS The Story of M H i ! Im Florian CEO of Dataiku maker Data

Phonetics &amp; Inflection M&amp;R 114, 187 193 ENG240Y Old English / Wed 15 Sep 2010

POLLINATING FOOD ENTERPRISES C R E A T I V E N E W M O D E L S F O R S T A R T I N G , S U P

Scaling eBay Kleinanzeigen Intro Myself Manuel Aldana TU-Berlin eBayK #2 Intro Company !=

A New Kind of A New Kind of Leader Leader Presenter: FL Conference Pathfinder & Adventurer

Phonetics & Inflection M&R 114, 187 193 ENG240Y Old English / Wed 15 Sep 2010