Improving Temporal Language Models for Determining Time of - PowerPoint PPT Presentation

Improving Temporal Language Models for Determining Time of Non-Timestamped Documents Nattiya Kanhabua and Kjetil Nø ørv rvå åg g Nattiya Kanhabua and Kjetil N Dept. of Computer Science, Dept. of Computer Science, Norwegian University of Science and Technology, Norwegian University of Science and Technology, Trondheim, Norway Trondheim, Norway ECDL 2008 Conference, Å Århus rhus Denmark Denmark ECDL 2008 Conference,

Agenda � Motivation and Challenge � Preliminaries � Our Approaches � Evaluation � Conclusion ECDL 2008 Norwegian University of Science and 2 Technology

Motivation Answer Research Question Answer Research Question Extend keyword search with a “ How to improve search “ temporal information -- results in long-term archives of Temporal text-containment search digital documents? ” ” [Nørvåg’04] Temporal Information � Timestamp, e.g. the created or updated date � In local archives, timestamp can be found in document metadata which is trustable Q: Is document timestamp in WWW archive also trustable ? A: Not always, some problems: 1. A lack of metadata preservation 2. A time gap between crawling and indexing 3. Relocation of web documents ECDL 2008 Norwegian University of Science and 3 Technology

Challenge “For a given document with uncertain timestamp, can the contents be “ used to determine the timestamp with a sufficiently high confidence?” ” I found a bible-like document. But I have Let’s me see… no idea when it was This document is probably created ? originated in 850 A.C. with 95% confidence. You should ask Guru! ECDL 2008 Norwegian University of Science and 4 Technology

Preliminaries “A model for dating documents” Temporal Language Models presented in [de Jong et al. ’04] � Based on the statistic usage of words over time. � Compare a non-timestamped document with a reference corpus. � A reference time partition mostly overlaps in term usage -- the � tentative timestamp. Temporal Language Models A non-timestamped Partition Word document 1999 tsunami 1999 Japan tsunami tsunami tsunami Thailand Thailand Thailand 1999 tidal wave 2004 tsunami 2004 Thailand Partition score Partition score Partition score 2004 earthquake “1999”: 1 “1999”: 1 = 1 “1999”: 1 ECDL 2008 Norwegian University of Science and 5 “2004”: 1 “2004”: 1 + 1 “2004”: 1 + 1 = 2 � most likely timestamp Technology

Proposed Approaches Three ways in improving: temporal language models 1) Data preprocessing 2) Word interpolation 3) Similarity score ECDL 2008 Norwegian University of Science and 6 Technology

Data Preprocessing A direct comparison between extracted words in a document vs. temporal language models limits accuracy. . Semantic- Semantic -based Preprocessing based Preprocessing Description Description Part- -of of- -speech tagging speech tagging Most interesting classes of words are Part selected, e.g. nouns, verbs, and adjectives Co-occurrence of different words can alter Collocation extraction Collocation extraction the meaning, e.g. “United States” Word sense disambiguation Identifying the correct sense of word by Word sense disambiguation analyzing context in a sentence, e.g. “bank” Concept extraction Comparing 2 language models on concept Concept extraction level avoids a less frequency word problem Only the top-ranked N according to TF-IDF Word filtering Word filtering scores will be selected as index terms ECDL 2008 Norwegian University of Science and 7 Technology

Word Interpolation “ A word is categorized into one of two classes depending on When a word has zero probability for a time partition according to “ characteristics occurring in time: recurring or non-recurring . ” a limited size of a corpus collection, it could have a non-zero ” frequency in that period in other documents outside a corpus. Recurring Non-recurring � Related to periodic events. � Words that are not recurring not recurring. � For example, “Summer Olympic”, � For example, “Terrorism”, “World Cup”, “French Open” “Tsunami” Identify recurring words by looking at overlap of words overlap of words distribution at the (flexible) endpoint of possible periods: every year or 4 years ECDL 2008 Norwegian University of Science and 8 Technology

Word Interpolation (cont’) “ How to interpolate words depends on which category a word belongs to: recurring or non-recurring . ” Non-recurring Recurring 6000 6000 10000 10000 9000 9000 5000 5000 8000 8000 4000 7000 7000 4000 Frequency Frequency NR1 Frequency Frequency NR2 6000 6000 3000 3000 5000 5000 4000 4000 2000 NR3 2000 3000 3000 2000 2000 1000 1000 1000 1000 0 0 0 0 1996 2000 2004 2008 0 1996 1 2 2000 3 4 2004 5 6 2008 7 8 0 1 2 3 4 5 6 7 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Year Year Year Year (b) "Olympic games" after interpolating (a) "Olympic games" before interpolating (b) "Terrorism" after interpolating (a) "Terrorism" before interpolating ECDL 2008 Norwegian University of Science and 9 Technology

Similarity Score “A term weighting concerns temporality, temporal entropy based on the term selection method presented in [Lochbaum,Steeter’89].” Temporal Entropy A measure of temporal information which a word conveys. Captures the importance of a term in a document collection whereas TF-IDF weights a term in a particular document. Tells how good a term is in separating a partition from others. N p is the total number of A probability of a partition partitions in a corpus p containing a term w i A term occurring in few partitions has higher temporal entropy compared to one appearing in many partitions . The higher temporal entropy a term has, the better representative of a partition . ECDL 2008 Norwegian University of Science and 10 Technology

Similarity Score (cont’) “ By analyzing search statistics [Google Zeitgeist], we can increase the probability for a particular time partition. ” An inverse partition frequency, ipf = log N/n f(R) converts a ranked number into weight. The P(wi) is the probability that wi occurs: higher ranked query is P(wi) = 1.0 if a gaining query more important. P(wi) = 0.5 if a declining query (a) (b) A linear combination of a GZ score to an original similarity score [de Jong et al. ’04] ECDL 2008 Norwegian University of Science and 11 Technology

Experimental Setting Partition Word Probability 1999 tsunami 0.015 1999 Japan 0.003 Build 1999 tidal wave 0.009 2004 tsunami 0.091 2004 Thailand 0.012 2004 earthquake 0.080 A reference corpus Temporal Language Models • A list of words and its probability in • Documents with known dates. each time partition. • Collected from the Internet Archive. • News history web pages, e.g. ABC • Intended to capture word usage News, CNN, NewYork Post, etc. within a certain time period. ECDL 2008 Norwegian University of Science and 12 Technology

Experiments Constraints of a training set: 1. Cover the domain of a document to be dated. 2. Cover the time period of a document to be dated. A reference corpus A reference corpus Precision = the fraction of documents (15 sources) (15 sources) correctly dated Recall = the fraction of correctly dated documents processed Randomly select 1000 Select 10 news documents for testing sources from from 5 new sources various domains. (different from training sources) A training set A testing set ECDL 2008 Norwegian University of Science and 13 Technology

Experiment (cont’) Experiment Experiment Evaluation Aspects Evaluation Aspects Description Description A Semantic-based Various combinations of semantics: preprocessing 1) POS – WSD – CON – FILT 2) POS – COLL – WSD – FILT 3) POS – COLL – WSD – CON – FILT B Temporal Entropy, Combination TE,GZ with semantic- Google Zeitgeist based preprocessing, or without. C Dating task and Similar to other classification tasks, confidence a system should be able to tell how much confidence it has in assigning a timestamp. Confidence is measured by the distance between scores of the 1 st and 2 nd ranked partitions. ECDL 2008 Norwegian University of Science and 14 Technology

Results Baseline Baseline 80 80 A.1 TE A.2 GZ 70 70 A.3 S-TE S-GZ 60 60 Precision (%) Precision (%) 50 50 40 40 30 30 20 20 10 10 0 0 1-w 1-m 3-m 6-m 12-m 1-w 1-m 3-m 6-m 12-m Granularities Granularities (b) (a) Semantic-based preprocessing Temporal Entropy, Google Zeitgeist • Increase precision in almost all • By applying semantic-based first, TE and GZ granularities except 1-week obtain high improvement • In a small granularity, it is hard to gain • Semantic-based preprocessing generates high accuracy collocation and concepts • Weighted high by TE and GZ (most of search statistics are noun phrases) ECDL 2008 Norwegian University of Science and 15 Technology

Results (cont’) The higher the confidence, the more reliable results. Precision 110 100 Recall 90 Percentage (%) 80 70 60 50 40 30 20 10 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Confidence level (c) Confidence levels and document dating accuracy ECDL 2008 Norwegian University of Science and 16 Technology

Improving Temporal Language Models for Determining Time of - PowerPoint PPT Presentation

Improving Temporal Language Models for Determining Time of Non-Timestamped Documents Nattiya Kanhabua and Kjetil N rv rv g g Nattiya Kanhabua and Kjetil N Dept. of Computer Science, Dept. of Computer Science, Norwegian University

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Outline Temporal and Real-Time Temporal database Databases: A survey Real-time database

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Sequential Data Types of data Temporal (focusing on this one today) Bi-Temporal (Physical Time

Temporal and Modal Logic Based on paper: E.A. Emerson. Temporal and Modal Logic J. van Leeuwen,

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Improving String Processing for Temporal Relations Tim Fernando David Woods ADAPT Centre

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Lecture 1 Spatio-temporal data & Linear Models Colin Rundel 1/18/2017 1 Spatio-temporal

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Temporal Planning Planning with Temporal and Concurrent Actions Literature Malik Ghallab,

netdev 1.1 Alexander Aring Pengutronix <aar@pengutronix.de> Slide 1 -

GIT ESSENTIALS October 2011 This image is the Linux kernel as visualised by Gource Why

Iteration hypotheses and the strong sealing of universally Baire sets W. Hugh Woodin Harvard

FREE IF : HOW TO OMIT INACTIVE BRANCHES AND IMPLEMENT S-UNIVERSAL GARBLED CIRCUIT ALMOST FOR

1 Choose your language Languages spoken by assistants: German (several varieties) English

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to

Infrastructure for Smart Cities Bridging research and production Sandy Taylor | Software

Crown Resorts Limited 2018 Half Year Results Presentation 22 February 2018 7 Crown Resorts

Improving Temporal Language Models for Determining Time of - PowerPoint PPT Presentation

Improving Temporal Language Models for Determining Time of Non-Timestamped Documents Nattiya Kanhabua and Kjetil N rv rv g g Nattiya Kanhabua and Kjetil N Dept. of Computer Science, Dept. of Computer Science, Norwegian University

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Outline Temporal and Real-Time Temporal database Databases: A survey Real-time database

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Sequential Data Types of data Temporal (focusing on this one today) Bi-Temporal (Physical Time

Temporal and Modal Logic Based on paper: E.A. Emerson. Temporal and Modal Logic J. van Leeuwen,

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Improving String Processing for Temporal Relations Tim Fernando David Woods ADAPT Centre

Improving Improving Finances, Finances, Improving Improving Lives Lives www.jeanchatzky.com

Lecture 1 Spatio-temporal data &amp; Linear Models Colin Rundel 1/18/2017 1 Spatio-temporal

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Temporal Planning Planning with Temporal and Concurrent Actions Literature Malik Ghallab,

netdev 1.1 Alexander Aring Pengutronix &lt;aar@pengutronix.de&gt; Slide 1 -

GIT ESSENTIALS October 2011 This image is the Linux kernel as visualised by Gource Why

Iteration hypotheses and the strong sealing of universally Baire sets W. Hugh Woodin Harvard

FREE IF : HOW TO OMIT INACTIVE BRANCHES AND IMPLEMENT S-UNIVERSAL GARBLED CIRCUIT ALMOST FOR

1 Choose your language Languages spoken by assistants: German (several varieties) English

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to

Infrastructure for Smart Cities Bridging research and production Sandy Taylor | Software

Crown Resorts Limited 2018 Half Year Results Presentation 22 February 2018 7 Crown Resorts

Lecture 1 Spatio-temporal data & Linear Models Colin Rundel 1/18/2017 1 Spatio-temporal

netdev 1.1 Alexander Aring Pengutronix <aar@pengutronix.de> Slide 1 -