Building a large scale SaaS app Open Source, Storage and Scalability - - PowerPoint PPT Presentation

building a large scale saas app
SMART_READER_LITE
LIVE PREVIEW

Building a large scale SaaS app Open Source, Storage and Scalability - - PowerPoint PPT Presentation

Building a large scale SaaS app Open Source, Storage and Scalability Dan Hanley, CTO http://www.magus.co.uk 14 March, 2008 1 Agenda Who are Magus? What do we do? Who do we do it for? How do we do it? SOA Scalability


slide-1
SLIDE 1

Building a large scale SaaS app

Open Source, Storage and Scalability Dan Hanley, CTO http://www.magus.co.uk

14 March, 2008

1

slide-2
SLIDE 2

Agenda

Who are Magus? What do we do? Who do we do it for? How do we do it? SOA Scalability Storage F/OSS

2

slide-3
SLIDE 3

The Magus proposition

  • Leading provider of innovative web-content

engineering solutions to global corporations g g g p

  • Specialise in managed applications that help

clients build value from their online assets and from clients build value from their online assets and from the wider web

  • Three main applications:

Three main applications:

ActiveStandards RemoteSearch RemoteSearch CrucialInformation

  • Delivering solutions since 1995
slide-4
SLIDE 4

Our managed applications

Delivering Software-as-a-service (ASP model)

  • ActiveStandards designed to help companies stay on-brand, on-line

by tracking and managing corporate web standards compliance, worldwide

  • RemoteSearch a multi-site search engine providing integrated search
  • RemoteSearch a multi site search engine, providing integrated search

frameworks for enterprise websites C i lI f ti i t i d li i

  • CrucialInformation a premium current awareness service delivering

high-quality, strategic intelligence from the web and syndicated services

4

slide-5
SLIDE 5

ActiveStandards

5

slide-6
SLIDE 6

RemoteSearch

6

slide-7
SLIDE 7

CrucialInformation

7

slide-8
SLIDE 8

Social Networking

8

slide-9
SLIDE 9

Our clients

9

slide-10
SLIDE 10

Technically - where we were

1 product W b d i b i

  • Web design business
  • All home grown
  • No appservers
  • No failover

No failover

  • No common

infrastructure infrastructure

  • Scalability worries

No ersion control

  • No version control
  • Unclear methodology

10

slide-11
SLIDE 11

Technically – where we are now

  • 3 main applications

pp

  • Bespoke capability
  • Common
  • Common

infrastructure

  • Platform of services
  • Platform of services
  • Fault tolerant
  • Scalable
  • Defined process &

methodology

11

slide-12
SLIDE 12

Approach

  • Do a lot with a little – 35 people, punching

p p , p g above our weight

  • Don't reinvent the wheel
  • Extract commonality – keep it DRY

12

slide-13
SLIDE 13

The components of the stack

  • Trawl

Harvest

  • Routing

Store

  • Harvest
  • Index
  • Search
  • Store
  • Quartz
  • ClientEngine

Search

  • Analysis
  • Monitor

ClientEngine

  • Profile
  • LinkChecker

13

slide-14
SLIDE 14

Logical architecture

REST (not SOAP)

14

slide-15
SLIDE 15

Trawl

  • Responsible for managing the gathering of data in its raw

form into the Store.

  • Currently have Trawlers for:

HTTP FTP (several flavors) RSS A RSS, Atom etc SMTP G l Google Technorati M Moreover FT (several flavors)

15

slide-16
SLIDE 16

Trawler service

Pluggable architecture based on JMX Mbean service

16

slide-17
SLIDE 17

Harvest

  • Responsible for extracting explicit data from Links

d t i th fi ld d d t i th d t b d th and storing the fielded data in the database, and the non fielded data in the Store.

17

slide-18
SLIDE 18

Harvest service

Pluggable architecture based on JMX Mbean service

18

slide-19
SLIDE 19

Index

  • Responsible for building, purging, maintaining indices.

19

slide-20
SLIDE 20

Search

  • Responsible for searching indices and delivering results.

20

slide-21
SLIDE 21

Analysis

  • Responsible for deriving scores for information implicit in the page

Sentiment Sentiment Readability Language detection etc g g

21

slide-22
SLIDE 22

Monitor

− Badly named, should be called “Classifier” − Responsible for creating filings between Links and Categories.

A Li k b b k k i bl i l

− A Link can be a bookmark, news item, blog article etc. − A Category can be Users Bookmarks, News Topic, an AST Guideline etc.

22

slide-23
SLIDE 23

Classifier (monitor) service

Pluggable architecture based on JMX Mbean service

23

slide-24
SLIDE 24

LinkChecker

  • Responsible for checking the life of links and removing them correctly

from the system when they have expired from the system when they have expired.

24

slide-25
SLIDE 25

Routing

  • Manages the workflow of jobs through the stack

Has the capability to dynamically loadbalance workloads

  • Has the capability to dynamically loadbalance workloads.

25

slide-26
SLIDE 26

Content stores

We needed a multiple terabyte (currently 24 TB) distributed, fail

f fil safe, filesystem

NFS was crumbling under load ZFS was vapourware ZFS was vapourware Lustre was too complex We built our own! Magus Contentstores, responsible for holding both the raw and

processed non fielded content of links which have been trawled and harvested and harvested

26

slide-27
SLIDE 27

Content stores - configuration

<mbean code="uk.co.magus.store.service.StoreService" name="magus.service.store:service=StoreServiceLocalCalls"> <attribute name="JndiName">magus/services/StoreServiceLocalCalls</attribute> <attribute name="Config"> <TryEachStripeStore> <List> <MirrorStore> <List> <List> <RemoteStore>nas:1299;StoreServiceRemoteCallsInvokeTarget</RemoteStore> <RemoteStore>m4:1099;StoreServiceRemoteCallsInvokeTarget</RemoteStore> </List> </MirrorStore> <MirrorStore> <List> <RemoteStore>nas:1199;StoreServiceRemoteCallsInvokeTarget</RemoteStore> <RemoteStore>m5:1099;StoreServiceRemoteCallsInvokeTarget</RemoteStore> </List> </List> </MirrorStore> </List> </TryEachStripeStore> </attribute> <depends>jboss:service=Naming</depends> </mbean>

27

slide-28
SLIDE 28

28

slide-29
SLIDE 29

Store Interfaces

29

slide-30
SLIDE 30

Store JMX Beans

30

slide-31
SLIDE 31

Contentstore - engines

Can use many types of engine on a node Currently supports: Currently supports:

Mysql SleepyCat SleepyCat Filesystem

These can be decorated to enhance functionality

slide-32
SLIDE 32

Content Store Classes

32

slide-33
SLIDE 33

Quartz

  • Responsible for firing messages on time.
  • The “heartbeat” of the stack.

33

slide-34
SLIDE 34

Client Engine

  • Responsible for stack based processing for Client

A li ti Applications.

  • Keeps “heavy lifting” out of the Web Tier.
  • Coordinates Client Applications requests across

multiple stack services.

34

slide-35
SLIDE 35

Management Application

Manage taxonomy

g y

Manage rules Manage scheduling Manage scheduling Focus on managing the business

L i t t JMX b

Leave service management to JMX or web

consoles

Swing

35

slide-36
SLIDE 36

Management App

slide-37
SLIDE 37

Management App

slide-38
SLIDE 38

Management App

slide-39
SLIDE 39

Management App

slide-40
SLIDE 40

Profile

  • An internal service used to collect metrics on

t id f system wide performance

40

slide-41
SLIDE 41

41

slide-42
SLIDE 42

Infrastructure hit t architecture

42

slide-43
SLIDE 43

Methodology

  • Agility – sprints

g y p

  • Issue tracking – Jira
  • Issue tracking – Jira

R l h d l d d l t

  • Regular, scheduled, deployments
  • Consolidated build & version control

43

slide-44
SLIDE 44

Deployment

Deployment

Subversion (Code repository )

  • 1. C heck out
  • 2. C ode /

Developer Local Box

  • 2. C ode /

Local Test

  • 3. C heck In

D eveloper

  • 4. auto C heck out

6 & 12 N otify

Bamboo

5 . Build / U nit Tests / Metrics 7 . Publish results

  • 8. D eploy
  • 10. D eploy D ependencies
P a y to $

R elease N ote 13 . Prepare R elease N ote D ev C luster 9 & 11 . FIT Tests Product Ow ner / D ev TL R elease N ote Granite TL

  • 14. Get R elease N ote

Test C luster

  • 18. D eploy Applications

Stress Test 15 . R eject R elease 17 . Manage Test & Production Environments

  • 16. Get Application Artifacts

Production 19 . D eploy Applications

Jboss ON

44

slide-45
SLIDE 45

Throughput

  • 11,000 sources in system

, y

  • ~16 000 000 pages rolling store
  • 16,000,000 pages rolling store

200 000 d

  • ~200,000 new pages per day
  • Average < 2 minutes from page detection to

fully classified and indexed.

45

slide-46
SLIDE 46

Cost comparisons

  • Apples and oranges?

pp g

Proprietary Licence Free Product Per CPU CPUs Total Product Per CPU CPUs Total O l 20 000 00 10 $200 000 M S l $0 00 10 $0 00 Oracle 20,000.00 10 $200,000 MySql $0.00 10 $0.00 Weblogic AS 10,000.00 38 $380,000 Jboss AS $0.00 38 $0.00 MS Windows Server 3,919.00 48 $188,112 Redhat/Apa $0.00 48 $0.00 Visual Team Studio 1,000.00 12 $12,000 Eclipse $0.00 12 $0.00 ClearCase 4,125.00 1 $4,125 Subversion $0.00 1 $0.00 Jira 2 000 00 1 $2 000 Trac $0 00 1 $0 00 Jira 2,000.00 1 $2,000 Trac $0.00 1 $0.00 Autonomy IDOL bundl 75,000.00 2 $150,000 Carrot2 $0.00 12 $0.00 IBM Intelligent Datami 132,000.00 1 $132,000 LingPipe $0.00 12 $0.00 Verity K2 50,000.00 2 $100,000 Lucene $0.00 8 $0.00 UIMA $0.00 12 $0.00 $1,068,237 $0.00 £580,531.26 €849,629.77

46

slide-47
SLIDE 47

Thank you

Questions? Questions?

47