CAPTO Gennaio 2010 1 The problem to solve Nowadays information - - PowerPoint PPT Presentation

capto gennaio 2010 1 the problem to solve
SMART_READER_LITE
LIVE PREVIEW

CAPTO Gennaio 2010 1 The problem to solve Nowadays information - - PowerPoint PPT Presentation

CAPTO Gennaio 2010 1 The problem to solve Nowadays information published on internet is not manageable any more; the consequence is that any internet search is not precise. Due to the overwhelming amount of information and the


slide-1
SLIDE 1

CAPTO Gennaio 2010 1

slide-2
SLIDE 2

CAPTO Gennaio 2010

The problem to solve

2

  • Nowadays information published on internet is not manageable any

more; the consequence is that any internet search is not precise.

  • Due to the overwhelming amount of information and the inherent nature
  • f internet (polling protocol), manual internet retrieval can be a human

exhaustive activity;

  • The relevant information is only a fraction of the available one;
  • All these problems, that lead to a loss of information (hence power),

pertain to the information created by a company as well;

slide-3
SLIDE 3

CAPTO Gennaio 2010

The Goal

3

To have a way to retrieve an information: On time => When needed Precise => Noise reduction Fruitful => Structured and harmonized Complete => Extracted from any media

slide-4
SLIDE 4

CAPTO Gennaio 2010 4

The solution

Capto is the complete solution to create information acquiring and indexing media from multiple sources

slide-5
SLIDE 5

CAPTO Gennaio 2010

Characteristics

5

  • Focus on relevant information;
  • A unique portal to retrieve all the information you need;
  • Users can subscribe to ‘information channels’, being notified when new

pertinent information is created;

  • A complete information management workflow;
slide-6
SLIDE 6

CAPTO Gennaio 2010

Technical Characteristics

6

  • Enhanced crawling capabilities (authentication, javascript processing,

WEB 2.0);

  • Distributed and scalable acquisition from internet sources;
  • Enhanced Text Indexing (stemming, ranking (BM25), probabilistic

search,…);

  • An highly configurable CMS portal (Jsr-168 compatible portlets, can be

registered in any legacy CMS);

  • Can scale up to millions of indexed documents;
slide-7
SLIDE 7

CAPTO Gennaio 2010

Application domains

7

  • Data Monitoring:
  • Finance, stock markets…
  • Information monitoring and analysis (document repositories, news, web

press, news feeds, blogs, mails,…)

  • Brand analysis (brand monitoring, sentiment analysis,…)
  • Massive text indexing and retrieval
  • …by and large any domain where the retrieval and analysis of information

creates new (and more useful) information;

slide-8
SLIDE 8

CAPTO Gennaio 2010

The architecture

8

Domain dependent Domain independent

www

External File System, DBMS,…

slide-9
SLIDE 9

CAPTO Gennaio 2010

PA Case history:Edison

9

The problem: monitoring of Italian laws and regulations on the environmental impact related with the production of Energy The solution:

  • Automatic acquisition from several national,

regional, federal and local web portals;

  • A complete validation workflow;
  • Information precision:

before (manual acquisition) <50%, after ~100%

slide-10
SLIDE 10

CAPTO Gennaio 2010

Other products on the market

10

Text indexing and ranking:

  • Apache Lucene (http://lucene.apache.org)
  • ClusterClick (www.clusterclick.com)
  • Amberfish (http://www.etymon.com/tr.html)
  • Terrier (http://ir.dcs.gla.ac.uk/terrier/)

Document Management:

  • OpenText (www.opentext.com)
  • SearchExpress (www.searchexpress.com)
  • IndexData (www.indexdata.com)
  • AutonomyVirage (www.virage.com)

Internet Information Retrieval:

  • HtDig (www.htdig.org)
slide-11
SLIDE 11

CAPTO Gennaio 2010

Conclusions

11

  • Can be used to monitor the acquisition of multimedia from internet sources;
  • Can be used to index and retrieve textual information from any archived media;
  • Can be used to shorten the time-to-information;
  • Can be used to provide a more precise information (and to map the information

you have);

  • Can be easily adopted (low cost of software adoption)
  • Domain agnostic and multi-language