FireCite: Lightweight real-time reference string extraction from web pages
Ching Hoi Andy Hong Jesse Prabawa Gozali Min-Yen Kan School of Computing National University of Singapore
FireCite: Lightweight real-time reference string extraction from - - PowerPoint PPT Presentation
FireCite: Lightweight real-time reference string extraction from web pages Ching Hoi Andy Hong Jesse Prabawa Gozali Min-Yen Kan School of Computing National University of Singapore Outline Introduction Reference String Recognition
Ching Hoi Andy Hong Jesse Prabawa Gozali Min-Yen Kan School of Computing National University of Singapore
2
Introduction Reference String Recognition Reference String Parsing Firefox Extension Conclusion
3
4
5
Introduction Reference String Recognition Reference String Parsing Firefox Extension Conclusion
6
Are there reference strings? Where are the reference strings?
7
8
9
Selection Parsing Verification Selection Parsing Verification Selection Parsing
Text segments Parsed text segments
Excluded text segments Excluded text segments
10
0.960 0.960 Recall 0.655 0.575 Precision System F1-measure FireCite (All 40 pages) 0.719 FireCite (Only 20 pages with reference strings) 0.779
Test set: 40 staff homepages from 4 universities Reference strings found: 364/379 (96.0%) False positives: 269 (42.5%)
11
Introduction Related Work Reference String Recognition Reference String Parsing Firefox Extension Conclusion
12
Jesse Prabawa Gozali and Min-Yen Kan (2007) A Rich OPAC User Interface with AJAX, In Proceedings of the Joint Conference on Digital Libraries (JCDL '07). Vancouver, Canada, June, pp. 329-330. Short paper.
Purpose
Only identify authors, title, date
13
Tokenising
Atlas , L . , and S . Shamma , “ Joint Acoustic and Modulation Frequency , ” EURASIP JASP , 2003 .
Advantages Reduce number of computations Allow information-richer learning features
14
Labelling
−
−
Repairs
−
15
0.878 0.966 0.889 0.917 0.651 0.960 0.708 All Tokens 0.948 0.916 0.836 Overall (274) 0.989 1.000 0.847 F (45) 1.000 0.875 0.692 E (8) 0.889 0.968 0.753 D (68) 0.774 0.304 0.684 C (29) 0.990 0.957 0.953 B (52) 0.988 0.893 0.902 A (72) Date Authors Title Token-level F-measure Page (No. of
16
0.97 0.98 0.95 0.93 FLUX-CiM 0.94 0.97 0.99 0.96 ParsCit 0.94 0.97 0.96 0.92 FireCite Overall Date Authors Title Field-level F-measure System Name
17
Parser Classifier Type Size of classifier model (KB) Size of dictionary (KB) FireCite Decision Tree 12.6 0.0 FLUX-CiM Knowledge-Based >786 (estimated) 0.0 ParsCit Conditional Random Fields 7339 1722
18
133 544 6 All pages 74 222 6 Pages without reference strings 192 544 90 Pages with reference strings Average Maximum Minimum Time taken (milliseconds) Page Type
19
Introduction Reference String Recognition Reference String Parsing Firefox Extension Conclusion
20
21
Results
−
−
−
22
Future work
−
−
−