Template- -Based Information Mining Based Information Mining - - PDF document

template based information mining based information
SMART_READER_LITE
LIVE PREVIEW

Template- -Based Information Mining Based Information Mining - - PDF document

Outline Outline Template- -Based Information Mining Based Information Mining Template The Web Information Mining Problem from HTML Documents from HTML Documents A Model of Electronic Documents Document Templates Template-based


slide-1
SLIDE 1

Template Template-

  • Based Information Mining

Based Information Mining from HTML Documents from HTML Documents

Jane Yung-jen Hsu & Wen-tau Yih Computer Science and Information Engineering National Taiwan University

AAAI-97

Outline Outline

The Web Information Mining Problem A Model of Electronic Documents Document Templates Template-based Information Extraction A Case Study: FAQ Miner Conclusion

AAAI-97

Web Information Mining Web Information Mining

Search for relevant documents

Search engines Web guides White & yellow pages

Extract target information from documents

Document analysis Information extraction in resource discovery Smart web shopping

AAAI-97

The Myth about The Myth about Keywords Keywords

Relevant information can be found using

keyword-based methods. e.g.

Search for relevant documents Filter undesirable information Extract useful information

Are keywords sufficient to satisfy most of

  • ur informational needs?

AAAI-97

Problem with Keywords: Example Problem with Keywords: Example

AAAI-97

slide-2
SLIDE 2

AAAI-97

Sample FAQ Documents Sample FAQ Documents

AAAI-97

Semi Semi-

  • Structured Document Hypothesis

Structured Document Hypothesis

A semi-structured document, e.g. an

HTML document with tags, provides sufficient structural hints to enable effective extraction of semantically meaningful information.

Machine readable

↑ machine usable

AAAI-97

Basic Elements of A Document Basic Elements of A Document

Content the actual data in a document Format the visual presentation of a document Structure the logical elements and their relationships

AAAI-97

Content Content

MEMORANDUM TO: JOHN SMITH, GRADUATE OFFICE FROM: MARK SAM SUBJ: STUDENT APPEALS MEETING DATE: 8 APR, 1997 There will be a meeting of the Committee

  • n Student Appeals on Wednesday,

June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234.

MEMORANDUM

TO: JOHN SMITH, GRADUATE OFFICE FROM: MARK SAM SUBJ: STUDENT APPEALS MEETING DATE: 8 APR, 1997 There will be a meeting of the Committee on Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234. AAAI-97

Format Format

BLAHBLAHBL

BLA HBLA HBLAHB LBLAHBLA HBLAHB BLAHA HBLA HBL BLAHB LBLAHBL ABLAHBL ABLAHBLA BLAHB L AHBL AHBL Blahb lahb la h blahbla hb lah Blahblahb la Hblahb Lahbla hb Blahblahbl Ahbl ahb lahb la hblah blah bl ahbl ahbl ah Blah bla Hblahblahb Blahbl blah blahb lahbla hb lahblahb Bl ahb lahbla hblahbl ahblah blahbla Hbla Hblahbl ahbl hblah

MEMORANDUM

TO: JOHN SMITH, GRADUATE OFFICE FROM: MARK SAM SUBJ: STUDENT APPEALS MEETING DATE: 8 APR, 1997 There will be a meeting of the Committee on Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234.

AAAI-97

Structure Structure

MEMORANDUM

TO: JOHN SMITH, GRADUATE OFFICE FROM: MARK SAM SUBJ: STUDENT APPEALS MEETING DATE: 8 APR, 1997 There will be a meeting of the Committee on Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234.

Memorandum

Title Header Block

Receiver field Sender field Subject field Date field

Memo Body

Paragraph 1 Paragraph 2

slide-3
SLIDE 3

AAAI-97

A Model of Electronic Documents A Model of Electronic Documents

S: a set of structural components

Title, Header Block, Memo Body

C: the sequence of content symbols

MEMORANDUM TO: JOHN SMITH, GRADUATE OFFICE ....

F: format properties of elements in C

Blah blah blah

A partial ordering over S A mapping between C and S

AAAI-97

Properties of Document Structure Properties of Document Structure

Context continuity Partial order

between levels

Total order within

the same level

Order-preserving

Memo Title header body Sender receiver Subject Date Paragraph 1 Paragraph 2 AAAI-97

Template Template-

  • based Information Extraction

based Information Extraction

MEMORANDUM

TO: JO HN SM ITH , GR AD UATE O FFIC E FRO M : M AR K SAM SU B J: STU D ENT APPEALS M EETING D A TE: 8 APR, 1997 There w ill be a m eeting of the C

  • m

m ittee on Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m . to 1:00p.m . in R

  • om

504 C ullim

  • re.

Please m ake every effort to attend. If you cannot attend, please contact M ary Arm

  • ur, ext. 1234.

Receiver Sender Subject Date Paragraph 1 Paragraph 2

Header Body Title

AAAI-97

Memo in SGML Memo in SGML

<memorandum> <title> MEMORANDUM </title> <header> <rec> TO: JOHN SMITH, GRADUATE OFFICE </rec> <send> FROM: MARK SAM </send> <subj> SUBJ: STUDENT APPEALS MEETING </sub> <date> DATE: 8 APR, 1997 </date> </header> <body> <paragraph> There will be a meeting of the Committee on Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. </paragraph> <paragraph> Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234. </paragraph> </body> </memorandum>

MEMORANDUM

  • TO: JOHN SMITH, GRADUATE OFFICE
  • FROM: MARK SAM
  • SUBJ: STUDENT APPEALS MEETING
  • DATE: 8 APR, 1997
  • There will be a meeting of the Committee on Student

Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234.

AAAI-97

Memo in HTML (1/2) Memo in HTML (1/2)

<BODY> <H1> MEMORANDUM </H1> <HR> <UL> <LI> TO: JOHN SMITH, GRADUATE OFFICE </ LI > <LI> FROM: MARK SAM </ LI > <LI> SUBJ: STUDENT APPEALS MEETING </ LI > <LI > DATE: 8 APR, 1997 </ LI > </UL> <HR> <P> There will be a meeting of the Committee on Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. </P> <P> Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234. </P> </BODY>

MEMORANDUM

TO: JOHN SMITH, GRADUATE OFFICE FROM: MARK SAM SUBJ: STUDENT APPEALS MEETING DATE: 8 APR, 1997

  • There will be a meeting of the Committee on

Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. Please make every effort to attend. If you cannot attend, please contact Mary Armour, ext. 1234.

AAAI-97

Memo in HTML (2/2) Memo in HTML (2/2)

<BODY> <P> <FONT SIZE=6> <B> MEMORANDUM </B></FONT> </P> <HR> <UL> <LI> TO: JOHN SMITH, GRADUATE OFFICE </LI > <LI> FROM: MARK SAM </LI > <LI> SUBJ: STUDENT APPEALS MEETING </LI > <LI> DATE: 8 APR, 1997 </LI > </UL> <HR> There will be a meeting of the Committee on Student Appeals on Wednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m. in Room 504 Cullimore. <BR> <BR> Please make every effort to attend. If you cannot attend , please contact Mary Armour, ext. 1234. </BODY>

M EM ORANDUM

TO: JOHN SM ITH, GRADUATE OFFICE FROM : M ARK SAM SUBJ: STUDENT APPEALS M EETING DATE: 8 APR, 1997

  • There will be a meeting of the Com

mittee on Student Appeals on W ednesday, June 10, 1997 at 10:00 a.m. to 1:00 p.m . in Room 504 Cullim

  • re.

Please m ake every effort to attend. If you cannot attend, please contact M ary Armour, ext. 1234.

slide-4
SLIDE 4

AAAI-97

Template: FAQ Documents Template: FAQ Documents

Standard_TFAQ ¡ Title <TITLE> TERM_faq_title </TITLE> ¡ toc index_indicator TERM_TOC_indicator index_body (ordered_list <OL> list_item* </OL> | unordered_list <UL> list_item* </UL>) ¡ q_a_pairs question_answer_paragraph* list_item <LI> Hyperlink_Anchor TERM_question </A> </LI>

AAAI-97

The FAQ Agent The FAQ Agent

FAQ Worm FAQ Miner FAQ Knowledge Base FAQ Documents User Input FAQ Answer Finder Answers FAQ Information

AAAI-97

FAQ Miner Architecture FAQ Miner Architecture

Template Matching Template KB Extract Target Information Template Modification FAQ Document

Fail Success

Learning Modules

AAAI-97

Sample FAQ documents Sample FAQ documents

AAAI-97

Experimental Results Experimental Results

Template # of documents Success Ratio Standard_TFAQ 62 56.4% No_TOC_Indicator 10 9.1% Near Pass 13 11.8% Difficult 25 22.7%

AAAI-97

Concluding Remarks Concluding Remarks

Document structure facilitates information

extraction.

HTML documents are tree-structured. HTML tags provide hints for structural elements. Effective information mining is possible. What¡s next?

Tree-structured document templates Semantic parsing