Automatic text classification and extraction of Automatic text - - PowerPoint PPT Presentation

automatic text classification and extraction of automatic
SMART_READER_LITE
LIVE PREVIEW

Automatic text classification and extraction of Automatic text - - PowerPoint PPT Presentation

Automatic text classification and extraction of Automatic text classification and extraction of entities and their properties from the text entities and their properties from the text Anton Kolonin Webstructor project 2015, SIBIRCON/SibMedInfo


slide-1
SLIDE 1

http://www.webstructor.net/

Automatic text classification and extraction of Automatic text classification and extraction of entities and their properties from the text entities and their properties from the text

Anton Kolonin Webstructor project 2015, SIBIRCON/SibMedInfo

slide-2
SLIDE 2

2 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Here’s the Tylenol twist: Before they began writing, half of each group received acetaminophen while the other half swallowed a

  • placebo. Even among those

people who wrote about death, the Tylenol takers set bail at roughly $300—a sign that acetaminophen may significantly reduce feelings of existential anxiety, explains study lead author Daniel Randles, a PhD candidate in UBC’s department of... psychology. acetaminophen may significantly reduce feelings of existential anxiety, explains study lead author Daniel Randles.

Category: “Healthcare” Entity (Case): “Treatment: Healing anxiety with Tylenol” Brand: Tylenol Substance: acetaminophen Reliability: medium Effect: positive Diagnosis: Anxiety Reporter: Daniel Randles

acetaminophen may significantly reduce feelings of existential anxiety, explains study lead author Daniel Randles.

H A S IS

tylenol acetaminophen placebo significantly reduce feelings study acetaminophen may reduce anxiety explains

Unified approach : Different cases

slide-3
SLIDE 3

3 http://www.webstructor.net/

Token Feature

TextFeature

Text

TextToken FeatureToken Text Category CategoryFeature

Category Domain

Feature instantiation Category instantiation There are three sub-processes contributing to the learning process. The first process is Category instantiation which takes the attributes defined for text in training corpus (either encoded in the text as tags or taken from respective database table fields) and creates categories for them, given the domain indicated by the attribute. The second process id Feature instantiation which takes the text in training corpus and decomposes it into tokens and features accordingly to the parser, tokenizer and feature builder depending on the implementation. The two processes above are independent, but they precede the third process which is Category Feature inference. It employs statistics to infer the relevance of features encountered in the texts to the categories associated with those texts.

Item descriptions Keywords and phrases Multiple categories and specific attributes

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Category Feature inference

Classification with feature vectors : Training Process

slide-4
SLIDE 4

4 http://www.webstructor.net/

Token Feature

TextFeature

Text

TextToken FeatureToken Text Category CategoryFeature

Category Domain

There are two sub-processes contributing to the rule applying process and the following process flow diagram depicts the dependency between the sub-processes and the data. The first process is Feature detection which takes the text in novel data and decomposes it into tokens and features accordingly to the parser, tokenizer and feature builder depending on the implementation. This process is similar to Feature instantiation in the course of learning, but the key difference is that only the features instantiated earlier in the course of learning can be detected, no new features are instantiated. The second process is Text Category inference. It employs statistics to infer the relevance of texts to the categories associated with those texts through the features detected in the texts and learned for those categories. Text Category inference

Categories assigned to particular descriptions Category assignments learned from training corpus

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Feature detection

Classification with feature vectors : Recognition process

slide-5
SLIDE 5

5 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Webcat: Plain user interface

slide-6
SLIDE 6

6 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Webcat: Expert user interface

slide-7
SLIDE 7

7 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Webcat: Extracting entity properties

slide-8
SLIDE 8

8 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Sparse N-gram: “tylenol significantly reduce feelings of existential anxiety” = “tylenol ... reduce ... anxiety” Priority on order: “reduce ... feelings” is more important than “reduce AND feelings” Boolean ranking: “acetaminophen AND tylenol” is more important than “placebo”, regardless of statistics Contextual scoping: “tylenol” implies that “may” is reliability measure, not month of the year Compression of vector space: “disease OR illness” is much faster than “disease OR illness OR ail OR blast OR sick” (even if little bit less accurate)

Webcat: Complex patterns and rules

slide-9
SLIDE 9

9 http://www.webstructor.net/

76%-96% Software matches Human 1-8% Software performs better than Human 1-8% Human typos in training corpus 1-8% Lack of data in training corpus 1-8% Software lacks Human-level intelligence

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Accuracy : Human vs. Computer : Sources of errors

slide-10
SLIDE 10

10 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Statistical «fuzzy» learning vs. «rigid» patterns and rules

Accuracy 0% 100% Familiar Novel Situations Based on rules and patterns Based on statistics

slide-11
SLIDE 11

11 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Finding entities with properties: Hierarchical patterns

acetaminophen may significantly reduce feelings of existential anxiety

acetaminophen tylenol aspirin acetylsalicylic acid reduce reduces OR OR treat treats OR OR existential anxiety frustration OR AND “positive effect” “drug” “disease” OR “procedure” “negative effect” OR NOUN VERB AND head cold

slide-12
SLIDE 12

12 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Hierarchical patterns: Definition <pattern> := <token> | <regexp> | <variable> | <set> <set> := <conjunctive-set> | <N-gram> | <syn-set> <conjunctive-set> := ( <pattern> * ) <N-gram> := [ <pattern> * ] <syn-set> := { <pattern> * } Examples

{[$description catheter] [$coating coating] [$inner-diameter {diameter inner-diameter}] [$tip tip] [$pattern pattern]} X Convey Guiding Catheter. Unique hydrophilic coating. Small atraumatic soft tip. Ultra-thin 1 × 2 flat wire braid pattern = { coating : 'hydrophillic', description : 'convey guiding', pattern : 'ultra-thin 1 × 2 flat wire braid', tip : soft }

slide-13
SLIDE 13

13 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Visual Perception Olfactory Perception Tactile Perception Audial Perception

Elower Eupper

r г

Objective Cognition Social Cognition Linguistic Cognition Abstract Cognition

Actdocument Actaction

Emotional Perception

I Я Big Picture

Goal: Learn patterns experientially

slide-14
SLIDE 14

14 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Part-of-speech tagging? Need full semantic context to be precise Косой косой косарь Косой с косой косой косил на косе косо

Где? Кто (профессия)? Какой (состояние опьянения)? Какой (свойство зрения)? Как? Что делал? Чем? С чем? Кто (имя, кличка)?

slide-15
SLIDE 15

15 http://www.webstructor.net/

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Current implementation

slide-16
SLIDE 16

16 http://www.webstructor.net/

Thank you for attention! Thank you for attention!

Automatic text classification and extraction of entities and their properties Automatic text classification and extraction of entities and their properties

Anton Kolonin Webstructor project 2015, SIBIRCON/SibMedInfo