CAPTO Gennaio 2010 1 The problem to solve Nowadays information - - PowerPoint PPT Presentation

▶

Mar 05, 2023 205 likes •318 views

CAPTO Gennaio 2010 1 The problem to solve Nowadays information published on internet is not manageable any more; the consequence is that any internet search is not precise. Due to the overwhelming amount of information and the

SLIDE 1

CAPTO Gennaio 2010 1

SLIDE 2

CAPTO Gennaio 2010

The problem to solve

Nowadays information published on internet is not manageable any

more; the consequence is that any internet search is not precise.

Due to the overwhelming amount of information and the inherent nature
f internet (polling protocol), manual internet retrieval can be a human

exhaustive activity;

The relevant information is only a fraction of the available one;
All these problems, that lead to a loss of information (hence power),

pertain to the information created by a company as well;

SLIDE 3

CAPTO Gennaio 2010

The Goal

To have a way to retrieve an information: On time => When needed Precise => Noise reduction Fruitful => Structured and harmonized Complete => Extracted from any media

SLIDE 4

CAPTO Gennaio 2010 4

The solution

Capto is the complete solution to create information acquiring and indexing media from multiple sources

SLIDE 5

CAPTO Gennaio 2010

Characteristics

Focus on relevant information;
A unique portal to retrieve all the information you need;
Users can subscribe to ‘information channels’, being notified when new

pertinent information is created;

A complete information management workflow;

SLIDE 6

CAPTO Gennaio 2010

Technical Characteristics

Enhanced crawling capabilities (authentication, javascript processing,

WEB 2.0);

Distributed and scalable acquisition from internet sources;
Enhanced Text Indexing (stemming, ranking (BM25), probabilistic

search,…);

An highly configurable CMS portal (Jsr-168 compatible portlets, can be

registered in any legacy CMS);

Can scale up to millions of indexed documents;

SLIDE 7

CAPTO Gennaio 2010

Application domains

Data Monitoring:
Finance, stock markets…
Information monitoring and analysis (document repositories, news, web

press, news feeds, blogs, mails,…)

Brand analysis (brand monitoring, sentiment analysis,…)
Massive text indexing and retrieval
…by and large any domain where the retrieval and analysis of information

creates new (and more useful) information;

SLIDE 8

CAPTO Gennaio 2010

The architecture

Domain dependent Domain independent

www

External File System, DBMS,…

SLIDE 9

CAPTO Gennaio 2010

PA Case history:Edison

The problem: monitoring of Italian laws and regulations on the environmental impact related with the production of Energy The solution:

Automatic acquisition from several national,

regional, federal and local web portals;

A complete validation workflow;
Information precision:

before (manual acquisition) <50%, after ~100%

SLIDE 10

CAPTO Gennaio 2010

Other products on the market

Text indexing and ranking:

Apache Lucene (http://lucene.apache.org)
ClusterClick (www.clusterclick.com)
Amberfish (http://www.etymon.com/tr.html)
Terrier (http://ir.dcs.gla.ac.uk/terrier/)

Document Management:

OpenText (www.opentext.com)
SearchExpress (www.searchexpress.com)
IndexData (www.indexdata.com)
AutonomyVirage (www.virage.com)

Internet Information Retrieval:

HtDig (www.htdig.org)

SLIDE 11

CAPTO Gennaio 2010

Conclusions

Can be used to monitor the acquisition of multimedia from internet sources;
Can be used to index and retrieve textual information from any archived media;
Can be used to shorten the time-to-information;
Can be used to provide a more precise information (and to map the information

you have);

Can be easily adopted (low cost of software adoption)
Domain agnostic and multi-language