an introduction reach1to1 - 1 / 25 what we do big data solutions - - PowerPoint PPT Presentation

an introduction
SMART_READER_LITE
LIVE PREVIEW

an introduction reach1to1 - 1 / 25 what we do big data solutions - - PowerPoint PPT Presentation

an introduction reach1to1 - 1 / 25 what we do big data solutions for capturing, storing, searching and analyzing structured and unstructured data from multiple sources reach1to1 - 2 / 25 big data technology benefits cluster of low cost


slide-1
SLIDE 1

reach1to1 - 1 / 25

an introduction

slide-2
SLIDE 2

reach1to1 - 2 / 25

what we do

big data solutions for capturing, storing, searching and analyzing structured and unstructured data from multiple sources

slide-3
SLIDE 3

reach1to1 - 3 / 25

big data technology benefits

distributed computing distributed computing

  • cluster of low cost

commodity servers

  • capable of handling

unlimited growth in data size

  • distributed parallel

processing models

  • no loss in performance with

increasing data size

  • no licensing costs - primarily
  • pen source
slide-4
SLIDE 4

reach1to1 - 4 / 25

big data technologies

  • pen source technologies that are developed and

being used by companies like Google, Facebook, Twittter and LinkedIn

slide-5
SLIDE 5

reach1to1 - 5 / 25

case studies

  • patent document repository
  • Large international chemical manufacturer requires a high performance

document repository capable of handling large volume of patent documents with advanced search capabilities

  • log file analysis
  • Large telecom provider requires to analyze log files generated from

automated customer support calls and call center logs without manual data collation

  • customer activity analysis
  • Fast growing low-cost airline requires to analyze customer activity to enable

promotional fares to increase market share

slide-6
SLIDE 6

reach1to1 - 6 / 25

why reach1to1?

  • combined experience of over 20 years in NoSQL database

technologies

  • expertise in entire product development life cycle
  • handled range of enterprise applications using NoSQL

databases including

  • sales monitoring and analytics
  • customer order tracking
  • accounts receivable tracking
  • customer support tracking
slide-7
SLIDE 7

reach1to1 - 7 / 25

patent document repository

  • a case study outline
slide-8
SLIDE 8

reach1to1 - 8 / 25

c1 c2 d1 d1 d1 pf3 pf2 pf1 c3

folders

f3 f2 f1

documents are organized into multiple folders, that determine access rights and represent logical collections documents are also grouped into into patent families, that determine relationships that are based on priority codes assigned to each document users review documents and add comments that represent their views on the researched topic

document families comments documents

data requirements

slide-9
SLIDE 9

reach1to1 - 9 / 25

repository

d1 d1 search crud comment

batch operations user operations

functional requirements

slide-10
SLIDE 10

reach1to1 - 10 / 25

documents are added or replaced in the repository in batches consisting of up to thousands of documents the critical performance metrics for batch operations are throughput and access delay batch throughput is the rate of processing of documents access delay is the time it takes from the start of the batch till documents are available for user operations

repository

d1 d1

batch operations

slide-11
SLIDE 11

reach1to1 - 11 / 25

repository

create / retrieve / update / delete – documents based on access rights comment – crud operations on comments – comments can be private or public search – facility for advanced full text search features – facility for faceted search for drilling down into search results – search results need contain highlights for matching terms – search based on concordance

search add / update comment

user operations

slide-12
SLIDE 12

reach1to1 - 12 / 25

repository application server

client application

  • bject oriented

database advanced full text search client API synchronization storage & retrieval indexing

architecture

document families relationships

slide-13
SLIDE 13

reach1to1 - 13 / 25

  • bject oriented

database

Hbase used for persistence provides random, real-time read/write access capable of hosting very large data can use clusters of servers multi-value and hierarchical parameters mapped to column families and columns links between documents and related objects stored as linked

  • bject ids

persistence

slide-14
SLIDE 14

reach1to1 - 14 / 25

f3 c1 c2 f2 f1 d1d1d1 p3 p2 p1 c3

folders documen ents paten ents com

  • mments

data model

slide-15
SLIDE 15

reach1to1 - 15 / 25

advanced full text

Solr provides powerful full-text search, hit highlighting, faceted search, dynamic clustering highly scalable, distributed search and index replication documents, comments and patents are indexed in a 1+n+m denormalized index structure field collapsing is used to group multiple search results pivoted faceting is used to provide accurate facet results due to duplicate entries

indexing

slide-16
SLIDE 16

reach1to1 - 16 / 25

1 folder 1 document 2 patents 3 comments

+ + + + + + + + + + +

1 n m

=>

6 index entries

1+n+m

=> => => => =>

folder+document properties folder+document + patent properties folder+document + comment properties

indexing model

slide-17
SLIDE 17

reach1to1 - 17 / 25

graph traversal

Neo4j for mapping a graph of documents based on their tags a high performance graph database with transaction support documents, tags and families are created as vertices edges between document and tag vertices family is a fully connected sub-graph

relationships

slide-18
SLIDE 18

reach1to1 - 18 / 25 document vertex tags family family family

grouping into families

slide-19
SLIDE 19

reach1to1 - 19 / 25

repository application server

  • odebe is a synchronization engine

that is based on node.js provides a consistent client api that encapsulates combined synchronous

  • perations across multiple big data

repository components includes a scripting engine includes advanced sequencing patterns

  • serial, parallel, waterfall, concurrent

queues etc. provides for multiple concurrent

  • perations with provision for

logical object-level locks

client API

slide-20
SLIDE 20

reach1to1 - 20 / 25

d1 d1 search crud comment

synchronization server

add/update document delete document add/update comment delete comment add/update folder start batch batch status retrieve document retrieve comment search query

  • bject oriented

database graph index full text search index

delete folder

batch operations user operations scripts web services client API

slide-21
SLIDE 21

reach1to1 - 21 / 25

search query

1.3 secs

retrieve document retrieve comment

0.3 secs 0.3 secs

add/update document delete document add/update comment delete comment add/update folder delete folder

0.3 secs 0.25 secs 0.24 secs 0.25 secs not implemented not measured

performance benchmarks

note: timings are average across a pre-defined set of operations

slide-22
SLIDE 22

reach1to1 - 22 / 25

data size

processing speed

hadoop scales to thousands of commodity computers using all cores and spindles simultaneously proven data size scalability – e.g. Facebook has 21 pb data in a single hadoop cluster solr has built-in capabilities for replication that allows it to scale up for very high query volumes without loss of performance – e.g. solr has production instances of over 200+ mn items neo4j enterprise version includes high availability clustering and can traverse up to 1-2 mn hops per second

data complexity

scalability

slide-23
SLIDE 23

reach1to1 - 23 / 25

data size processing speed

hbase column families and columns provide a flexible way to manage sparse data structures using object links allows additional objects to be linked to documents neo4j can be used to handle more hierarchical data structures that require traversals solr schema can be extended easily for adding new, though re-indexing is required after a change additional index servers can be added to manage new types of queries and synchronized by oodebe synchronization scripts

data complexity

scalability

slide-24
SLIDE 24

reach1to1 - 24 / 25

data size

processing speed

node.js allows clusters of worker processes with facility to monitor and automatically manage them batch throughput can be optimized by using concurrent queues and multiple worker processes custom client applications can be developed that manage complex processes faster, and invoked through synchronization scripts solr batch updates and caching can be used to speed up updates and queries respectively

data complexity

scalability

slide-25
SLIDE 25

reach1to1 - 25 / 25

thank you

info@reach1to1.com +91-98201-94408