[PPT] - OpenTag : Open Attribute Value Extraction From Product Profiles PowerPoint Presentation

SLIDE 1

graph product

KDD 2018

OpenTag: Open Attribute Value Extraction From Product Profiles

Guineng Zheng, Subhabrata MukherjeeΔ, Xin Luna DongΔ, FeiFei Li

ΔAmazon.com, *University of Utah

1

SLIDE 2

graph product

KDD 2018

Motivation

2

Alexa, what are the flavors of nescafe? Nescafe Coffee flavors include caramel, mocha, vanilla, coconut, cappuccino, original/regular, decaf, espresso, and cafe au lait decaf.

SLIDE 3

graph product

KDD 2018

Attribute value extraction from product profiles

3

Flavor Brand

SLIDE 4

graph product

KDD 2018

Characteristics of Attribute Extraction

Open World Assumption

No Predefined Attribute Value
New Attribute Value Discovery

Limited semantics, irregular syntax

Most titles have 10-15 words
Most bullets have 5-6 words
Phrases not Sentences
Lack of regular grammatical

structure in titles and bullets

Attribute stacking
1. Rachael Ray Nutrish Just 6 Natural Dry Dog

Food, Lamb Meal & Brown Rice Recipe

2. Lamb Meal is the #1 Ingredient
1. beef flavor
2. lamb flavor
3. venison flavor

4

SLIDE 5

graph product

KDD 2018

Prior Work and Our Contributions

5

Open World Assumption No Lexicon, No Hand-crafted Features Active Learning

Ghani et al. 2003, Putthividhya et al. 2011, Ling et al. 2012, Petrovski et

al. 2017

Huang et al. 2015, Kozareva et al. 2016 Kozareva et al. 2016, Lample et al. 2016, Ma et al. 2016 OpenTag (this work)

SLIDE 6

graph product

KDD 2018

Outline

6

Problem Definition
Models
Experiments
Active Learning
Experiments

SLIDE 7

graph product

KDD 2018

Recap: Problem Statement

7

Input Product Profile Output Extractions

Title Description Bullets Flavor Brand

…

CESAR Canine Cuisine Variety Pack Filet Mignon & Porterhouse Steak Dog Food (Two 12-Count Cases) A Delectable Meaty Meal for a Small Canine Looking for the right food … This delicious dog treat contains tender slices of meat in gravy and is formulated to meet the nutritional needs of small dogs.

Filet Mignon

Flavor;

Porterhouse Steak

Flavor;

CESAR Canine

Cuisine provides complete and balanced nutrition … 1.filet mignon 2.porterhouse steak cesar canine cuisine

Given product profiles (e.g., titles, descriptions, bullets) and a set of attributes: extract values of attributes from profile texts

SLIDE 8

graph product

KDD 2018

Attribute Extraction as Sequence Tagging

w1 w2 w3 w4 w5

x

w6 beef meal & ranch raised lamb w7 recipe B I O E B I O E B I O E B I O E B I O E B I O E B I O E

y

t1 t2 t3 t4 t5 t6 t7 B I O E

Beginning of attribute value Inside of attribute value Outside of attribute value End of attribute value

x={w1,w2,…,wn} input sequence y={t1,t2,…,tn} tagging decision {beef meal} {ranch raise lamb} Flavor Extractions

8

O U T P U T I N P U T

SLIDE 9

graph product

KDD 2018

Outline

Introduction
Models
BiLSTM
BiLSTM + CRF
Attention Mechanism
OpenTag Architecture
Active Learning

9

SLIDE 10

graph product

KDD 2018

OpenTag Architecture

10

SLIDE 11

graph product

KDD 2018

OpenTag Architecture (1/4): Word Embedding

11

Map ‘beef’, ‘chicken’, ‘pork’ to nearby points in Flavor– embedding space

SLIDE 12

graph product

KDD 2018

OpenTag Architecture (2/4): Bidirectional LSTM

12

Capture long and short range dependencies in input sequence via forward and backward hidden states

SLIDE 13

graph product

KDD 2018

OpenTag Architecture (3/4): CRF

13

Bi-LSTM captures dependency between token sequences, but not

between output tags

Conditional Random Field (CRF) enforces tagging consistency

SLIDE 14

graph product

KDD 2018

OpenTag Architecture (4/4): Attention

14

Focus on important hidden concepts, downweight the rest => attention!
Attention matrix A to attend to important BiLSTM hidden states (ht)
αt,tʹ ∈ A captures importance of ht w.r.t. htʹ
Attention-focused representation lt of token xt given by:

SLIDE 15

graph product

KDD 2018

OpenTag Architecture

15

SLIDE 16

graph product

KDD 2018

Experimental Discussions: Datasets

16

SLIDE 17

graph product

KDD 2018

Results

17

Overall, OpenTag obtains high F-score of 82.8%

SLIDE 18

graph product

KDD 2018

Results

18

Highest improvement in F-score of 5.3% over

BiLSTM-CRF for product descriptions

However, less accurate than titles

SLIDE 19

graph product

KDD 2018

OpenTag discovers new attribute-values not seen during training with 82.4% F-score

19

No overlap in attribute value between train and test splits

SLIDE 20

graph product

KDD 2018

Interpretability via Attention

20

SLIDE 21

graph product

KDD 2018

OpenTag achieves better concept clustering

Distribution of word vectors before attention Distribution of word vectors after attention

21

SLIDE 22

graph product

KDD 2018

Semantically related words come closer in the embedding space

22

SLIDE 23

graph product

KDD 2018

Outline

Introduction
Models
BiLSTM
BiLSTM + CRF
Attention Mechanism
OpenTag Architecture
Active Learning

23

SLIDE 24

graph product

KDD 2018

Active Learning: Motivation

Annotating training data is expensive and time-consuming
Does not scale to thousands of verticals with hundreds of

attributes and thousands of values in each domain

24

SLIDE 25

graph product

KDD 2018

Active Learning (Settles, 2009)

25

Query selection strategy like uncertainty sampling selects

sample with highest uncertainty for annotation

Ignores difficulty in estimating individual tags

SLIDE 26

graph product

KDD 2018

Tag Flip as Query Strategy

Simulate a committee of OpenTag learners over multiple epochs
Most informative sample => major disagreement among committee

members for tags of its tokens across epochs

Use dropout mechanism for simulating committee of learners

26

duck , fillet mignon and ranch raised lamb flavor B O B E O B I E O B O B O O O O B O

Tag flips = 4

Most informative sample has highest tag flips across all the epochs

SLIDE 27

graph product

KDD 2018

Tag Flip (red) better than Uncertainty Sampling (blue)

TF v.v. LC on detergent data TF v.v. LC on multi extraction

27

SLIDE 28

graph product

KDD 2018

OpenTag reduces burden of human annotation by 3.3x

Learning from scratch on detergent data Learning from scratch on multi extraction

28

OpenTag requires only 500 training samples to obtain > 90% P-R
Active learning brings it down to 150 training samples to match

similar performance

SLIDE 29

graph product

KDD 2018

Production Impact

Previous Coverage of Existing Production System (%) OpenTag Coverage (%) Increase in Coverage (%) Attribute_1 23 78 53 Attribute_2 21 72 45 Attribute_3 < 1 56 50 Attribute_4 < 1 49 48

29

SLIDE 30

graph product

KDD 2018

Summary

OpenTag models open world assumption (OWA), multi-word and multiple

attribute value extraction with sequence tagging

Word embeddings + Bi-LSTM + CRF + attention
OpenTag + Active learning reduces burden of human annotation (by 3.3x)
Method of tag flip as query strategy
Interpretability
Better concept clustering, attention heatmap, etc.

30

SLIDE 31

graph product

KDD 2018

Summary

OpenTag models open world assumption (OWA), multi-word and multiple

attribute value extraction with sequence tagging

Word embeddings + Bi-LSTM + CRF + attention
OpenTag + Active learning reduces burden of human annotation (by 3.3x)
Method of tag flip as query strategy
Interpretability
Better concept clustering, attention heatmap, etc.

31

Thank you for your attention!

SLIDE 32

graph product

KDD 2018

Backup Slides

32

SLIDE 33

graph product

KDD 2018

Word Embedding

Map words co-occurring in a similar context to nearby points in

embedding space

Pre-trained embeddings learn single representation for each word
But ‘duck’ as a Flavor should have different embedding than ‘duck’ as a Brand
OpenTag learns word embeddings conditioned on attribute-tags

33

SLIDE 34

graph product

KDD 2018

Bi-directional LSTM

LSTM (Hochreiter, 1997) capture long and short range dependencies between

tokens, suitable for modeling token sequences

Bi-directional LSTM’s improve over LSTM’s capturing both forward (ft) and

backward (bt) states at each timestep ‘t’

Hidden state ht at each timestep generated as: ht = !([bt, ft])

34

SLIDE 35

graph product

KDD 2018

Bi-directional LSTM

w1

ranch raised beef flavor

e1 e2 e3 e4 f1 f2 f3 f4 b1 b2 b3 b4

h1 h2 h3 h4 B I O E B I O E B I O E B I O E

Word Index Word Embedding

glove embedding 50

Forward LSTM

100 units

Backward LSTM

100 units

Hidden Vector

100+100=200 units

Cross Entropy Loss

w2 w3 w4

35

SLIDE 36

graph product

KDD 2018

Conditional Random Fields (CRF)

Bi-LSTM captures dependency between token sequences, but not

between output tags

Likelihood of a token-tag being ‘E’ (end) or ‘I’ (intermediate) increases, if

the previous token-tag was ‘I’ (intermediate)

Given an input sequence x = {x1,x2, …, xn} with tags y = {y1, y2, …, yn}:

linear-chain CRF models:

SLIDE 37

graph product

KDD 2018

Bi-directional LSTM + CRF

w1

ranch

w2

raised

w3

beef

w4

flavor

e1 e2 e3 e4 f1 f2 f3 f4 b1 b2 b3 b4

h1 h2 h3 h4 B I O E B I O E B I O E B I O E

Word Index Embedding

glove embedding 50

Forward LSTM

100 units

Backward LSTM

100 units

Hidden Vector

100+100=200 units

Cross Entropy Loss

CRF Conditional Random Fields CRF feature space formed by Bi- LSTM hidden states

37

SLIDE 38

graph product

KDD 2018

Attention Mechanism

Not all hidden states equally important for the CRF
Focus on important concepts, downweight the rest => attention!
Attention matrix A to attend to important BiLSTM hidden states (ht)
αt,tʹ ∈ A captures importance of ht w.r.t. htʹ
Attention-focused representation lt of token xt given by:

38

h1 h2 h3 h4

l1 l2 l3 l4

SLIDE 39

graph product

KDD 2018

Final Classification

39

Maximize log-likelihood of joint distribution Best possible tag sequence with highest conditional probability

CRF feature space formed by attention-focused representation of hidden states

SLIDE 40

graph product

KDD 2018

Uncertainty Sampling: Probability as Query Strategy

Select instance with maximum uncertainty
Best possible tag sequence from CRF:
Label instance with maximum uncertainty:
Considers entire label sequence y, ignores difficulty in estimating

individual tags yt ∈ y

40

SLIDE 41

graph product

KDD 2018

Tag Flip as Query Strategy

Most informative instance has maximum tag flips aggregated over all
f its tokens across all the epochs:
Top B samples with the highest number of flips are manually

annotated with tags

duck , fillet mignon and ranch raised lamb flavor B O B E O B I E O B O B O O O O B O Tag flips = 4

41

SLIDE 42

graph product

KDD 2018

Multiple attribute values

Predicting multiple attribute values jointly
Modify tagging strategy to have separate tag-set {Ba, Ia, Oa, Ea} for

each attribute ‘a’

42

OpenTag: Open Attribute Value Extraction From Product Profiles

Guineng Zheng*, Subhabrata MukherjeeΔ, Xin Luna DongΔ, FeiFei Li*

Motivation

Attribute value extraction from product profiles

Flavor Brand

Characteristics of Attribute Extraction

Prior Work and Our Contributions

Outline

Recap: Problem Statement

Given product profiles (e.g., titles, descriptions, bullets) and a set of attributes: extract values of attributes from profile texts

Attribute Extraction as Sequence Tagging

x

y

Outline

OpenTag Architecture

OpenTag Architecture (1/4): Word Embedding

OpenTag Architecture (2/4): Bidirectional LSTM

OpenTag Architecture (3/4): CRF

OpenTag Architecture (4/4): Attention

OpenTag Architecture

Experimental Discussions: Datasets

Results

Overall, OpenTag obtains high F-score of 82.8%

Results

BiLSTM-CRF for product descriptions

OpenTag discovers new attribute-values not seen during training with 82.4% F-score

No overlap in attribute value between train and test splits

Interpretability via Attention

OpenTag achieves better concept clustering

Semantically related words come closer in the embedding space

Outline

Active Learning: Motivation

attributes and thousands of values in each domain

Active Learning (Settles, 2009)

Tag Flip as Query Strategy

members for tags of its tokens across epochs

Tag Flip (red) better than Uncertainty Sampling (blue)

OpenTag reduces burden of human annotation by 3.3x

similar performance

Production Impact

Summary

attribute value extraction with sequence tagging

Summary

attribute value extraction with sequence tagging

Thank you for your attention!

Backup Slides

Word Embedding

embedding space

Bi-directional LSTM

tokens, suitable for modeling token sequences

backward (bt) states at each timestep ‘t’

Bi-directional LSTM

w1

e1 e2 e3 e4 f1 f2 f3 f4 b1 b2 b3 b4

w2 w3 w4

Conditional Random Fields (CRF)

between output tags

the previous token-tag was ‘I’ (intermediate)

linear-chain CRF models:

Bi-directional LSTM + CRF

w1

w2

w3

w4

e1 e2 e3 e4 f1 f2 f3 f4 b1 b2 b3 b4

Attention Mechanism

h1 h2 h3 h4

Final Classification

Uncertainty Sampling: Probability as Query Strategy

individual tags yt ∈ y

Tag Flip as Query Strategy

annotated with tags

Multiple attribute values

each attribute ‘a’

Guineng Zheng, Subhabrata MukherjeeΔ, Xin Luna DongΔ, FeiFei Li