CAPTO Gennaio 2010 1
CAPTO Gennaio 2010 1 The problem to solve Nowadays information - - PowerPoint PPT Presentation
CAPTO Gennaio 2010 1 The problem to solve Nowadays information - - PowerPoint PPT Presentation
CAPTO Gennaio 2010 1 The problem to solve Nowadays information published on internet is not manageable any more; the consequence is that any internet search is not precise. Due to the overwhelming amount of information and the
CAPTO Gennaio 2010
The problem to solve
2
- Nowadays information published on internet is not manageable any
more; the consequence is that any internet search is not precise.
- Due to the overwhelming amount of information and the inherent nature
- f internet (polling protocol), manual internet retrieval can be a human
exhaustive activity;
- The relevant information is only a fraction of the available one;
- All these problems, that lead to a loss of information (hence power),
pertain to the information created by a company as well;
CAPTO Gennaio 2010
The Goal
3
To have a way to retrieve an information: On time => When needed Precise => Noise reduction Fruitful => Structured and harmonized Complete => Extracted from any media
CAPTO Gennaio 2010 4
The solution
Capto is the complete solution to create information acquiring and indexing media from multiple sources
CAPTO Gennaio 2010
Characteristics
5
- Focus on relevant information;
- A unique portal to retrieve all the information you need;
- Users can subscribe to ‘information channels’, being notified when new
pertinent information is created;
- A complete information management workflow;
CAPTO Gennaio 2010
Technical Characteristics
6
- Enhanced crawling capabilities (authentication, javascript processing,
WEB 2.0);
- Distributed and scalable acquisition from internet sources;
- Enhanced Text Indexing (stemming, ranking (BM25), probabilistic
search,…);
- An highly configurable CMS portal (Jsr-168 compatible portlets, can be
registered in any legacy CMS);
- Can scale up to millions of indexed documents;
CAPTO Gennaio 2010
Application domains
7
- Data Monitoring:
- Finance, stock markets…
- Information monitoring and analysis (document repositories, news, web
press, news feeds, blogs, mails,…)
- Brand analysis (brand monitoring, sentiment analysis,…)
- Massive text indexing and retrieval
- …by and large any domain where the retrieval and analysis of information
creates new (and more useful) information;
CAPTO Gennaio 2010
The architecture
8
Domain dependent Domain independent
www
External File System, DBMS,…
CAPTO Gennaio 2010
PA Case history:Edison
9
The problem: monitoring of Italian laws and regulations on the environmental impact related with the production of Energy The solution:
- Automatic acquisition from several national,
regional, federal and local web portals;
- A complete validation workflow;
- Information precision:
before (manual acquisition) <50%, after ~100%
CAPTO Gennaio 2010
Other products on the market
10
Text indexing and ranking:
- Apache Lucene (http://lucene.apache.org)
- ClusterClick (www.clusterclick.com)
- Amberfish (http://www.etymon.com/tr.html)
- Terrier (http://ir.dcs.gla.ac.uk/terrier/)
Document Management:
- OpenText (www.opentext.com)
- SearchExpress (www.searchexpress.com)
- IndexData (www.indexdata.com)
- AutonomyVirage (www.virage.com)
Internet Information Retrieval:
- HtDig (www.htdig.org)
CAPTO Gennaio 2010
Conclusions
11
- Can be used to monitor the acquisition of multimedia from internet sources;
- Can be used to index and retrieve textual information from any archived media;
- Can be used to shorten the time-to-information;
- Can be used to provide a more precise information (and to map the information
you have);
- Can be easily adopted (low cost of software adoption)
- Domain agnostic and multi-language