Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006 This - - PowerPoint PPT Presentation

leon derczynski supervised by dr amanda sharkey 2006
SMART_READER_LITE
LIVE PREVIEW

Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006 This - - PowerPoint PPT Presentation

Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006 This abstract relates to a document about low-price movies This document contains the words cheap film, but is not useful - Little human feedback is gathered on what makes a


slide-1
SLIDE 1

Leon Derczynski - Supervised by Dr Amanda Sharkey - 2006

slide-2
SLIDE 2

This abstract relates to a document about low-price movies This document contains the words “cheap film”, but is not useful

  • Little human feedback is gathered on what makes a document

relevant; it’s mainly automated.

  • The algorithms that decide relevancy are extremely complex and

need to built from scratch. In 2003, Google used over 120 independent variables to sort results. Is it possible to teach a system how to identify relevant documents without defining any explicit rules?

slide-3
SLIDE 3

To teach a system how to distinguish relevant documents from irrelevant, a large amount of training data is required. A wide range of documents and queries are needed to give a realistic model. Early work in indexing documents – dating back to the 1960s – provides collections of sample queries, matched up to relevant document content. Cyril Cleverdon pioneered work on organising information, and creating indexes. He led creation of a 1400-strong set of aerospace documents, accompanied by hundreds of natural language queries. A list of matching documents was also manually created for each query. This set of documents, queries and relevance judgements were known as the

slide-4
SLIDE 4

Searching all documents for a given query is a very time consuming process. Documents can be indexed according to the words they contain. This shrinks search space considerably. This allows documents containing keywords to be rapidly identified –

  • nly one lookup needs to be performed for each word in the query!

Document A The aerodynamic properties

  • f wing surfaces under

pressure change according to temperature. The amount

  • f pressure will also risk

deforming the wing, thus moving any heat spots and adjusting flow. Document B High pressure water hoses are a fantastic tool for cleaning your garden. They also have uses in farming, where cattle enjoy a high hygiene standard due to regular washdowns. deforming pressure properties surfaces washdowns standard A A,B A A B B

slide-5
SLIDE 5

Identify document features

A set of statistics can be used to describe a document. They can be about the document itself, or about a particular word in the document. These numeric descriptions then become training examples for a machine learning algorithm.

Independent stats Overall keyword info Localised keyword info Independent stats Overall keyword info Localised keyword info

Human judgement, from reference collection

Positive example Negative example For example, two documents can be assessed based on a query such as: “ what chem i cal ki net i c syst em i s appl i cabl e t o hyper soni c aer odynam i c Pr obl em s” A set of statistics describing each document relative to the query can then be derived.

slide-6
SLIDE 6

Decision trees are acyclic graphs that have a decision at each branch, based on an attribute of an example, and end at leaves which classify a document as relevant or not relevant.

(Other half of the tree)

A C4.5 Decision Tree, produced in an effort to emulate the decisions of the Cranfield judges. The full version of this tree attained an 80.4% accuracy rate.

First position of keyword Ratio of sentences missing keyword to those containing it Absolute average word length Keyword density in keyword sentences Number of sentences in doc Proportion of paragraphs containing keyword Absolute position of paragraphs containing keyword Mean position in paragraph

  • f sentences with keyword

Keyword frequency Mean position in sentence of keyword Keyword density <= 0.093 <= 11.3 <= 5.98 > 0.045 <=1.54 <= 4.74 <= 9.84 <= 0.0098 > 0.0098 > 9.84 > 4.74 > 1.54 <= 0.045 > 5.98 > 11.3 > 0.093 > 6 <= 6 <= 0.59 > 0.59 <= 1.1 > 1.1 Negative Negative Negative Negative Positive Positive Positive Positive Positive Positive Positive

slide-7
SLIDE 7

Neural nets

Neural nets have a set of nodes, each of which has various weights assigned to

  • inputs. These are coupled

with attributes, and when a certain internal value is reached, the output value changes. Backpropagation is used to help converge on a net that solves the problem.

K-Nearest Neighbour

K-Nearest Neighbour plots all training data as points in multi- dimensional space, with one dimension for each attribute. New examples are classified by working out the weighted average classification of the k nearest training examples.

Document A Query Document B

slide-8
SLIDE 8

The task is possible, with all algorithms managing to learn to identify a good amount of relevant documents.

MED accuracy 50.00 55.00 60.00 65.00 70.00 75.00 80.00 85.00 90.00 95.00 100.00 1 2 3 4 5 6 7 8 9 1011121314151617181920 Hidden units accuracy acc1 acc2 acc3

0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 1 2 3 4 Set Accuracy difference 100 200 300 400 500 600 Negative examples Difference Negatives

Not every document suggested as relevant by human judges could be matched by the system. Sometimes, words were used that did not occur in the document. Adding synonym lookup or a thesaurus should help.