reach1to1 - 1 / 25
an introduction reach1to1 - 1 / 25 what we do big data solutions - - PowerPoint PPT Presentation
an introduction reach1to1 - 1 / 25 what we do big data solutions - - PowerPoint PPT Presentation
an introduction reach1to1 - 1 / 25 what we do big data solutions for capturing, storing, searching and analyzing structured and unstructured data from multiple sources reach1to1 - 2 / 25 big data technology benefits cluster of low cost
reach1to1 - 2 / 25
what we do
big data solutions for capturing, storing, searching and analyzing structured and unstructured data from multiple sources
reach1to1 - 3 / 25
big data technology benefits
distributed computing distributed computing
- cluster of low cost
commodity servers
- capable of handling
unlimited growth in data size
- distributed parallel
processing models
- no loss in performance with
increasing data size
- no licensing costs - primarily
- pen source
reach1to1 - 4 / 25
big data technologies
- pen source technologies that are developed and
being used by companies like Google, Facebook, Twittter and LinkedIn
reach1to1 - 5 / 25
case studies
- patent document repository
- Large international chemical manufacturer requires a high performance
document repository capable of handling large volume of patent documents with advanced search capabilities
- log file analysis
- Large telecom provider requires to analyze log files generated from
automated customer support calls and call center logs without manual data collation
- customer activity analysis
- Fast growing low-cost airline requires to analyze customer activity to enable
promotional fares to increase market share
reach1to1 - 6 / 25
why reach1to1?
- combined experience of over 20 years in NoSQL database
technologies
- expertise in entire product development life cycle
- handled range of enterprise applications using NoSQL
databases including
- sales monitoring and analytics
- customer order tracking
- accounts receivable tracking
- customer support tracking
reach1to1 - 7 / 25
patent document repository
- a case study outline
reach1to1 - 8 / 25
c1 c2 d1 d1 d1 pf3 pf2 pf1 c3
folders
f3 f2 f1
documents are organized into multiple folders, that determine access rights and represent logical collections documents are also grouped into into patent families, that determine relationships that are based on priority codes assigned to each document users review documents and add comments that represent their views on the researched topic
document families comments documents
data requirements
reach1to1 - 9 / 25
repository
d1 d1 search crud comment
batch operations user operations
functional requirements
reach1to1 - 10 / 25
documents are added or replaced in the repository in batches consisting of up to thousands of documents the critical performance metrics for batch operations are throughput and access delay batch throughput is the rate of processing of documents access delay is the time it takes from the start of the batch till documents are available for user operations
repository
d1 d1
batch operations
reach1to1 - 11 / 25
repository
create / retrieve / update / delete – documents based on access rights comment – crud operations on comments – comments can be private or public search – facility for advanced full text search features – facility for faceted search for drilling down into search results – search results need contain highlights for matching terms – search based on concordance
search add / update comment
user operations
reach1to1 - 12 / 25
repository application server
client application
- bject oriented
database advanced full text search client API synchronization storage & retrieval indexing
architecture
document families relationships
reach1to1 - 13 / 25
- bject oriented
database
Hbase used for persistence provides random, real-time read/write access capable of hosting very large data can use clusters of servers multi-value and hierarchical parameters mapped to column families and columns links between documents and related objects stored as linked
- bject ids
persistence
reach1to1 - 14 / 25
f3 c1 c2 f2 f1 d1d1d1 p3 p2 p1 c3
folders documen ents paten ents com
- mments
data model
reach1to1 - 15 / 25
advanced full text
Solr provides powerful full-text search, hit highlighting, faceted search, dynamic clustering highly scalable, distributed search and index replication documents, comments and patents are indexed in a 1+n+m denormalized index structure field collapsing is used to group multiple search results pivoted faceting is used to provide accurate facet results due to duplicate entries
indexing
reach1to1 - 16 / 25
1 folder 1 document 2 patents 3 comments
+ + + + + + + + + + +
1 n m
=>
6 index entries
1+n+m
=> => => => =>
folder+document properties folder+document + patent properties folder+document + comment properties
indexing model
reach1to1 - 17 / 25
graph traversal
Neo4j for mapping a graph of documents based on their tags a high performance graph database with transaction support documents, tags and families are created as vertices edges between document and tag vertices family is a fully connected sub-graph
relationships
reach1to1 - 18 / 25 document vertex tags family family family
grouping into families
reach1to1 - 19 / 25
repository application server
- odebe is a synchronization engine
that is based on node.js provides a consistent client api that encapsulates combined synchronous
- perations across multiple big data
repository components includes a scripting engine includes advanced sequencing patterns
- serial, parallel, waterfall, concurrent
queues etc. provides for multiple concurrent
- perations with provision for
logical object-level locks
client API
reach1to1 - 20 / 25
d1 d1 search crud comment
synchronization server
add/update document delete document add/update comment delete comment add/update folder start batch batch status retrieve document retrieve comment search query
- bject oriented
database graph index full text search index
delete folder
batch operations user operations scripts web services client API
reach1to1 - 21 / 25
search query
1.3 secs
retrieve document retrieve comment
0.3 secs 0.3 secs
add/update document delete document add/update comment delete comment add/update folder delete folder
0.3 secs 0.25 secs 0.24 secs 0.25 secs not implemented not measured
performance benchmarks
note: timings are average across a pre-defined set of operations
reach1to1 - 22 / 25
data size
processing speed
hadoop scales to thousands of commodity computers using all cores and spindles simultaneously proven data size scalability – e.g. Facebook has 21 pb data in a single hadoop cluster solr has built-in capabilities for replication that allows it to scale up for very high query volumes without loss of performance – e.g. solr has production instances of over 200+ mn items neo4j enterprise version includes high availability clustering and can traverse up to 1-2 mn hops per second
data complexity
scalability
reach1to1 - 23 / 25
data size processing speed
hbase column families and columns provide a flexible way to manage sparse data structures using object links allows additional objects to be linked to documents neo4j can be used to handle more hierarchical data structures that require traversals solr schema can be extended easily for adding new, though re-indexing is required after a change additional index servers can be added to manage new types of queries and synchronized by oodebe synchronization scripts
data complexity
scalability
reach1to1 - 24 / 25
data size
processing speed
node.js allows clusters of worker processes with facility to monitor and automatically manage them batch throughput can be optimized by using concurrent queues and multiple worker processes custom client applications can be developed that manage complex processes faster, and invoked through synchronization scripts solr batch updates and caching can be used to speed up updates and queries respectively
data complexity
scalability
reach1to1 - 25 / 25
thank you
info@reach1to1.com +91-98201-94408