Improving Temporal Language Models for Determining Time of - - PowerPoint PPT Presentation

improving temporal language models for determining time
SMART_READER_LITE
LIVE PREVIEW

Improving Temporal Language Models for Determining Time of - - PowerPoint PPT Presentation

Improving Temporal Language Models for Determining Time of Non-Timestamped Documents Nattiya Kanhabua and Kjetil N rv rv g g Nattiya Kanhabua and Kjetil N Dept. of Computer Science, Dept. of Computer Science, Norwegian University


slide-1
SLIDE 1

Improving Temporal Language Models for Determining Time of Non-Timestamped Documents

Nattiya Kanhabua and Kjetil N Nattiya Kanhabua and Kjetil Nø ørv rvå åg g

  • Dept. of Computer Science,
  • Dept. of Computer Science,

Norwegian University of Science and Technology, Norwegian University of Science and Technology, Trondheim, Norway Trondheim, Norway ECDL 2008 Conference, ECDL 2008 Conference, Å Århus rhus Denmark Denmark

slide-2
SLIDE 2

ECDL 2008 Norwegian University of Science and Technology 2

Agenda

Motivation and Challenge Preliminaries Our Approaches Evaluation Conclusion

slide-3
SLIDE 3

ECDL 2008 Norwegian University of Science and Technology 3

Motivation

Research Question Research Question

“ “ How to improve search results in long-term archives of digital documents? ” ” Answer Answer

Extend keyword search with a temporal information --

Temporal text-containment search

[Nørvåg’04]

Temporal Information

Timestamp, e.g. the created or updated date In local archives, timestamp can be found in document metadata which is trustable Q: Is document timestamp in WWW archive also trustable ? A: Not always, some problems: 1. A lack of metadata preservation 2. A time gap between crawling and indexing 3. Relocation of web documents

slide-4
SLIDE 4

ECDL 2008 Norwegian University of Science and Technology 4

Challenge

I found a bible-like

  • document. But I have

no idea when it was created? You should ask Guru! Let’s me see… This document is probably

  • riginated in 850 A.C.

with 95% confidence.

“ “For a given document with uncertain timestamp, can the contents be used to determine the timestamp with a sufficiently high confidence?” ”

slide-5
SLIDE 5

ECDL 2008 Norwegian University of Science and Technology 5

Preliminaries

“A model for dating documents”

  • Temporal Language Models presented in [de Jong et al. ’04]
  • Based on the statistic usage of words over time.
  • Compare a non-timestamped document with a reference corpus.
  • A reference time partition mostly overlaps in term usage -- the

tentative timestamp.

earthquake 2004 Thailand 2004 tsunami 2004 tidal wave 1999 Japan 1999 tsunami 1999 Word Partition

Temporal Language Models

tsunami Thailand

A non-timestamped document

tsunami Thailand tsunami Thailand

Partition score “1999”: 1 “2004”: 1 Partition score “1999”: 1 “2004”: 1 + 1 Partition score “1999”: 1 = 1 “2004”: 1 + 1 = 2 most likely timestamp

slide-6
SLIDE 6

ECDL 2008 Norwegian University of Science and Technology 6

Proposed Approaches

Three ways in improving: temporal language models

1) Data preprocessing 2) Word interpolation 3) Similarity score

slide-7
SLIDE 7

ECDL 2008 Norwegian University of Science and Technology 7

Data Preprocessing

A direct comparison between extracted words in a document vs. temporal language models limits accuracy. .

Only the top-ranked N according to TF-IDF scores will be selected as index terms

Word filtering Word filtering

Comparing 2 language models on concept level avoids a less frequency word problem

Concept extraction Concept extraction

Identifying the correct sense of word by analyzing context in a sentence, e.g. “bank”

Word sense disambiguation Word sense disambiguation

Co-occurrence of different words can alter the meaning, e.g. “United States”

Collocation extraction Collocation extraction

Most interesting classes of words are selected, e.g. nouns, verbs, and adjectives

Part Part-

  • of
  • f-
  • speech tagging

speech tagging Description Description Semantic Semantic-

  • based Preprocessing

based Preprocessing

slide-8
SLIDE 8

ECDL 2008 Norwegian University of Science and Technology 8

Word Interpolation

When a word has zero probability for a time partition according to a limited size of a corpus collection, it could have a non-zero frequency in that period in other documents outside a corpus. “ “ A word is categorized into one of two classes depending on characteristics occurring in time: recurring or non-recurring. ” ”

Related to periodic events. For example, “Summer Olympic”, “World Cup”, “French Open” Words that are not recurring not recurring. For example, “Terrorism”, “Tsunami”

Recurring Non-recurring Identify recurring words by looking at overlap of words

  • verlap of words distribution

at the (flexible) endpoint of possible periods: every year or 4 years

slide-9
SLIDE 9

ECDL 2008 Norwegian University of Science and Technology 9

Word Interpolation (cont’)

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 Year (a) "Terrorism" before interpolating Frequency 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 Year (b) "Terrorism" after interpolating Frequency

NR1 NR2 NR3

Non-recurring Recurring “ How to interpolate words depends on which category a word belongs to: recurring or non-recurring. ”

1000 2000 3000 4000 5000 6000 1996 2000 2004 2008 Year (a) "Olympic games" before interpolating Frequency 1000 2000 3000 4000 5000 6000 1996 2000 2004 2008 Year (b) "Olympic games" after interpolating Frequency

slide-10
SLIDE 10

ECDL 2008 Norwegian University of Science and Technology 10

Similarity Score

“A term weighting concerns temporality, temporal entropy based

  • n the term selection method presented in [Lochbaum,Steeter’89].”

The higher temporal entropy a term has, the better representative of a partition. A term occurring in few partitions has higher temporal entropy compared to one appearing in many partitions. Tells how good a term is in separating a partition from others. Captures the importance of a term in a document collection whereas TF-IDF weights a term in a particular document. A measure of temporal information which a word conveys.

Temporal Entropy

A probability of a partition p containing a term wi Np is the total number of partitions in a corpus

slide-11
SLIDE 11

ECDL 2008 Norwegian University of Science and Technology 11

Similarity Score (cont’)

“ By analyzing search statistics [Google Zeitgeist], we can increase the probability for a particular time partition. ”

(b) (a)

P(wi) is the probability that wi occurs: P(wi) = 1.0 if a gaining query P(wi) = 0.5 if a declining query f(R) converts a ranked number into weight. The higher ranked query is more important.

A linear combination of a GZ score to an original similarity score [de Jong et al. ’04]

An inverse partition frequency, ipf = log N/n

slide-12
SLIDE 12

ECDL 2008 Norwegian University of Science and Technology 12

Experimental Setting

A reference corpus

  • Documents with known dates.
  • Collected from the Internet Archive.
  • News history web pages, e.g. ABC

News, CNN, NewYork Post, etc. earthquake Thailand tsunami tidal wave Japan tsunami Word 0.080 0.012 0.091 0.009 0.003 0.015 Probability 2004 2004 2004 1999 1999 1999 Partition

Temporal Language Models

  • A list of words and its probability in

each time partition.

  • Intended to capture word usage

within a certain time period.

Build

slide-13
SLIDE 13

ECDL 2008 Norwegian University of Science and Technology 13

Experiments

Constraints of a training set:

  • 1. Cover the domain of a document to be dated.
  • 2. Cover the time period of a document to be dated.

A reference corpus A reference corpus (15 sources) (15 sources)

A training set A testing set

Select 10 news sources from various domains. Randomly select 1000 documents for testing from 5 new sources (different from training sources)

Precision = the fraction of documents correctly dated Recall = the fraction of correctly dated documents processed

slide-14
SLIDE 14

ECDL 2008 Norwegian University of Science and Technology 14

Experiment (cont’)

Similar to other classification tasks, a system should be able to tell how much confidence it has in assigning a timestamp. Confidence is measured by the distance between scores of the 1st and 2nd ranked partitions. Dating task and confidence C Combination TE,GZ with semantic- based preprocessing, or without. Temporal Entropy, Google Zeitgeist B Various combinations of semantics: 1) POS – WSD – CON – FILT 2) POS – COLL – WSD – FILT 3) POS – COLL – WSD – CON – FILT Semantic-based preprocessing A

Description Description Evaluation Aspects Evaluation Aspects Experiment Experiment

slide-15
SLIDE 15

ECDL 2008 Norwegian University of Science and Technology 15

10 20 30 40 50 60 70 80 1-w 1-m 3-m 6-m 12-m Granularities (b) Precision (%)

Baseline TE GZ S-TE S-GZ

10 20 30 40 50 60 70 80 1-w 1-m 3-m 6-m 12-m Granularities (a) Precision (%)

Baseline A.1 A.2 A.3

Results

Semantic-based preprocessing Temporal Entropy, Google Zeitgeist

  • Increase precision in almost all

granularities except 1-week

  • In a small granularity, it is hard to gain

high accuracy

  • By applying semantic-based first, TE and GZ
  • btain high improvement
  • Semantic-based preprocessing generates

collocation and concepts

  • Weighted high by TE and GZ (most of search

statistics are noun phrases)

slide-16
SLIDE 16

ECDL 2008 Norwegian University of Science and Technology 16

Results (cont’)

10 20 30 40 50 60 70 80 90 100 110 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Confidence level (c) Percentage (%)

Precision Recall

Confidence levels and document dating accuracy

The higher the confidence, the more reliable results.

slide-17
SLIDE 17

ECDL 2008 Norwegian University of Science and Technology 17

Conclusion

Our approaches considerably increase quality compared to the baseline based on the previous approach. Applications that require high precision can select only documents where the timestamp has been determined with high confidence. Future research: Apply other classification algorithms in documents dating Introduce a weighting scheme to words and interpolate only significant words

slide-18
SLIDE 18

ECDL 2008 Norwegian University of Science and Technology 18

Questions

Questions are welcome ☺

slide-19
SLIDE 19

ECDL 2008 Norwegian University of Science and Technology 19

Related Works

A small amount of works on determining time of documents. Divided into two categories: determining time of creation of document/contents, determining time of topic of contents.

  • Employ two techniques: learning

learning-

  • based

based and non non-

  • learning

learning.

Non Non-

  • learning

learning Learning Learning-

  • based

based

Learns from a set of training documents. Does not require a corpus collection. [Swan,Allan’99] and [Swan,Jensen’00] use a statistical method called hypothesis testing. In [de Jong et al.’05] is based on a statistic language model. In [Mani,Wilson’00] and [Llidó et al’01], require explicitly time-tagged documents which will be resolved into a concrete date or an absolute date. Gives a summary of time of events appeared in the document content. Gives the most likely originated time which is similar to written time of a document.

slide-20
SLIDE 20

ECDL 2008 Norwegian University of Science and Technology 20

Temporal Language Models

Given a collection of corpus documents C={d1,d2,…,dn} A document model is defined as di={{w1,w2,…,wn}, (ti, ti+1)}

  • where

where ti<ti+1 and and ti<Time(di) <ti+1 Similarity between two language models

“A normalized log-likelihood ratio[Kraaij’05]”

Score(di,pj) = ∑wd P(w|di) · log P(w|pj) P(w|C)

A probability of word w in a document di A probability of word w in a time partition pi A probability of word w in a corpus collection C