OpenTag : Open Attribute Value Extraction From Product Profiles - - PowerPoint PPT Presentation

opentag open attribute value extraction from product
SMART_READER_LITE
LIVE PREVIEW

OpenTag : Open Attribute Value Extraction From Product Profiles - - PowerPoint PPT Presentation

OpenTag : Open Attribute Value Extraction From Product Profiles Guineng Zheng*, Subhabrata Mukherjee , Xin Luna Dong , FeiFei Li* Amazon.com, *University of Utah product KDD 2018 graph 1 Motivation Alexa , what are the flavors of


slide-1
SLIDE 1

graph product

KDD 2018

OpenTag: Open Attribute Value Extraction From Product Profiles

Guineng Zheng*, Subhabrata MukherjeeΔ, Xin Luna DongΔ, FeiFei Li*

ΔAmazon.com, *University of Utah

1

slide-2
SLIDE 2

graph product

KDD 2018

Motivation

2

Alexa, what are the flavors of nescafe? Nescafe Coffee flavors include caramel, mocha, vanilla, coconut, cappuccino, original/regular, decaf, espresso, and cafe au lait decaf.

slide-3
SLIDE 3

graph product

KDD 2018

Attribute value extraction from product profiles

3

Flavor Brand

slide-4
SLIDE 4

graph product

KDD 2018

Characteristics of Attribute Extraction

Open World Assumption

  • No Predefined Attribute Value
  • New Attribute Value Discovery

Limited semantics, irregular syntax

  • Most titles have 10-15 words
  • Most bullets have 5-6 words
  • Phrases not Sentences
  • Lack of regular grammatical

structure in titles and bullets

  • Attribute stacking
  • 1. Rachael Ray Nutrish Just 6 Natural Dry Dog

Food, Lamb Meal & Brown Rice Recipe

  • 2. Lamb Meal is the #1 Ingredient
  • 1. beef flavor
  • 2. lamb flavor
  • 3. venison flavor

4

slide-5
SLIDE 5

graph product

KDD 2018

Prior Work and Our Contributions

5

Open World Assumption No Lexicon, No Hand-crafted Features Active Learning

Ghani et al. 2003, Putthividhya et al. 2011, Ling et al. 2012, Petrovski et

  • al. 2017

Huang et al. 2015, Kozareva et al. 2016 Kozareva et al. 2016, Lample et al. 2016, Ma et al. 2016 OpenTag (this work)

slide-6
SLIDE 6

graph product

KDD 2018

Outline

6

  • Problem Definition
  • Models
  • Experiments
  • Active Learning
  • Experiments
slide-7
SLIDE 7

graph product

KDD 2018

Recap: Problem Statement

7

Input Product Profile Output Extractions

Title Description Bullets Flavor Brand

CESAR Canine Cuisine Variety Pack Filet Mignon & Porterhouse Steak Dog Food (Two 12-Count Cases) A Delectable Meaty Meal for a Small Canine Looking for the right food … This delicious dog treat contains tender slices of meat in gravy and is formulated to meet the nutritional needs of small dogs.

  • Filet Mignon

Flavor;

  • Porterhouse Steak

Flavor;

  • CESAR Canine

Cuisine provides complete and balanced nutrition … 1.filet mignon 2.porterhouse steak cesar canine cuisine

Given product profiles (e.g., titles, descriptions, bullets) and a set of attributes: extract values of attributes from profile texts

slide-8
SLIDE 8

graph product

KDD 2018

Attribute Extraction as Sequence Tagging

w1 w2 w3 w4 w5

x

w6 beef meal & ranch raised lamb w7 recipe B I O E B I O E B I O E B I O E B I O E B I O E B I O E

y

t1 t2 t3 t4 t5 t6 t7 B I O E

Beginning of attribute value Inside of attribute value Outside of attribute value End of attribute value

x={w1,w2,…,wn} input sequence y={t1,t2,…,tn} tagging decision {beef meal} {ranch raise lamb} Flavor Extractions

8

O U T P U T I N P U T

slide-9
SLIDE 9

graph product

KDD 2018

Outline

  • Introduction
  • Models
  • BiLSTM
  • BiLSTM + CRF
  • Attention Mechanism
  • OpenTag Architecture
  • Active Learning

9

slide-10
SLIDE 10

graph product

KDD 2018

OpenTag Architecture

10

slide-11
SLIDE 11

graph product

KDD 2018

OpenTag Architecture (1/4): Word Embedding

11

Map ‘beef’, ‘chicken’, ‘pork’ to nearby points in Flavor– embedding space

slide-12
SLIDE 12

graph product

KDD 2018

OpenTag Architecture (2/4): Bidirectional LSTM

12

Capture long and short range dependencies in input sequence via forward and backward hidden states

slide-13
SLIDE 13

graph product

KDD 2018

OpenTag Architecture (3/4): CRF

13

  • Bi-LSTM captures dependency between token sequences, but not

between output tags

  • Conditional Random Field (CRF) enforces tagging consistency
slide-14
SLIDE 14

graph product

KDD 2018

OpenTag Architecture (4/4): Attention

14

  • Focus on important hidden concepts, downweight the rest => attention!
  • Attention matrix A to attend to important BiLSTM hidden states (ht)
  • αt,tʹ ∈ A captures importance of ht w.r.t. htʹ
  • Attention-focused representation lt of token xt given by:
slide-15
SLIDE 15

graph product

KDD 2018

OpenTag Architecture

15

slide-16
SLIDE 16

graph product

KDD 2018

Experimental Discussions: Datasets

16

slide-17
SLIDE 17

graph product

KDD 2018

Results

17

Overall, OpenTag obtains high F-score of 82.8%

slide-18
SLIDE 18

graph product

KDD 2018

Results

18

  • Highest improvement in F-score of 5.3% over

BiLSTM-CRF for product descriptions

  • However, less accurate than titles
slide-19
SLIDE 19

graph product

KDD 2018

OpenTag discovers new attribute-values not seen during training with 82.4% F-score

19

No overlap in attribute value between train and test splits

slide-20
SLIDE 20

graph product

KDD 2018

Interpretability via Attention

20

slide-21
SLIDE 21

graph product

KDD 2018

OpenTag achieves better concept clustering

Distribution of word vectors before attention Distribution of word vectors after attention

21

slide-22
SLIDE 22

graph product

KDD 2018

Semantically related words come closer in the embedding space

22

slide-23
SLIDE 23

graph product

KDD 2018

Outline

  • Introduction
  • Models
  • BiLSTM
  • BiLSTM + CRF
  • Attention Mechanism
  • OpenTag Architecture
  • Active Learning

23

slide-24
SLIDE 24

graph product

KDD 2018

Active Learning: Motivation

  • Annotating training data is expensive and time-consuming
  • Does not scale to thousands of verticals with hundreds of

attributes and thousands of values in each domain

24

slide-25
SLIDE 25

graph product

KDD 2018

Active Learning (Settles, 2009)

25

  • Query selection strategy like uncertainty sampling selects

sample with highest uncertainty for annotation

  • Ignores difficulty in estimating individual tags
slide-26
SLIDE 26

graph product

KDD 2018

Tag Flip as Query Strategy

  • Simulate a committee of OpenTag learners over multiple epochs
  • Most informative sample => major disagreement among committee

members for tags of its tokens across epochs

  • Use dropout mechanism for simulating committee of learners

26

duck , fillet mignon and ranch raised lamb flavor B O B E O B I E O B O B O O O O B O

Tag flips = 4

  • Most informative sample has highest tag flips across all the epochs
slide-27
SLIDE 27

graph product

KDD 2018

Tag Flip (red) better than Uncertainty Sampling (blue)

TF v.v. LC on detergent data TF v.v. LC on multi extraction

27

slide-28
SLIDE 28

graph product

KDD 2018

OpenTag reduces burden of human annotation by 3.3x

Learning from scratch on detergent data Learning from scratch on multi extraction

28

  • OpenTag requires only 500 training samples to obtain > 90% P-R
  • Active learning brings it down to 150 training samples to match

similar performance

slide-29
SLIDE 29

graph product

KDD 2018

Production Impact

Previous Coverage of Existing Production System (%) OpenTag Coverage (%) Increase in Coverage (%) Attribute_1 23 78 53 Attribute_2 21 72 45 Attribute_3 < 1 56 50 Attribute_4 < 1 49 48

29

slide-30
SLIDE 30

graph product

KDD 2018

Summary

  • OpenTag models open world assumption (OWA), multi-word and multiple

attribute value extraction with sequence tagging

  • Word embeddings + Bi-LSTM + CRF + attention
  • OpenTag + Active learning reduces burden of human annotation (by 3.3x)
  • Method of tag flip as query strategy
  • Interpretability
  • Better concept clustering, attention heatmap, etc.

30

slide-31
SLIDE 31

graph product

KDD 2018

Summary

  • OpenTag models open world assumption (OWA), multi-word and multiple

attribute value extraction with sequence tagging

  • Word embeddings + Bi-LSTM + CRF + attention
  • OpenTag + Active learning reduces burden of human annotation (by 3.3x)
  • Method of tag flip as query strategy
  • Interpretability
  • Better concept clustering, attention heatmap, etc.

31

Thank you for your attention!

slide-32
SLIDE 32

graph product

KDD 2018

Backup Slides

32

slide-33
SLIDE 33

graph product

KDD 2018

Word Embedding

  • Map words co-occurring in a similar context to nearby points in

embedding space

  • Pre-trained embeddings learn single representation for each word
  • But ‘duck’ as a Flavor should have different embedding than ‘duck’ as a Brand
  • OpenTag learns word embeddings conditioned on attribute-tags

33

slide-34
SLIDE 34

graph product

KDD 2018

Bi-directional LSTM

  • LSTM (Hochreiter, 1997) capture long and short range dependencies between

tokens, suitable for modeling token sequences

  • Bi-directional LSTM’s improve over LSTM’s capturing both forward (ft) and

backward (bt) states at each timestep ‘t’

  • Hidden state ht at each timestep generated as: ht = !([bt, ft])

34

slide-35
SLIDE 35

graph product

KDD 2018

Bi-directional LSTM

w1

ranch raised beef flavor

e1 e2 e3 e4 f1 f2 f3 f4 b1 b2 b3 b4

h1 h2 h3 h4 B I O E B I O E B I O E B I O E

Word Index Word Embedding

glove embedding 50

Forward LSTM

100 units

Backward LSTM

100 units

Hidden Vector

100+100=200 units

Cross Entropy Loss

w2 w3 w4

35

slide-36
SLIDE 36

graph product

KDD 2018

Conditional Random Fields (CRF)

  • Bi-LSTM captures dependency between token sequences, but not

between output tags

  • Likelihood of a token-tag being ‘E’ (end) or ‘I’ (intermediate) increases, if

the previous token-tag was ‘I’ (intermediate)

  • Given an input sequence x = {x1,x2, …, xn} with tags y = {y1, y2, …, yn}:

linear-chain CRF models:

slide-37
SLIDE 37

graph product

KDD 2018

Bi-directional LSTM + CRF

w1

ranch

w2

raised

w3

beef

w4

flavor

e1 e2 e3 e4 f1 f2 f3 f4 b1 b2 b3 b4

h1 h2 h3 h4 B I O E B I O E B I O E B I O E

Word Index Embedding

glove embedding 50

Forward LSTM

100 units

Backward LSTM

100 units

Hidden Vector

100+100=200 units

Cross Entropy Loss

CRF Conditional Random Fields CRF feature space formed by Bi- LSTM hidden states

37

slide-38
SLIDE 38

graph product

KDD 2018

Attention Mechanism

  • Not all hidden states equally important for the CRF
  • Focus on important concepts, downweight the rest => attention!
  • Attention matrix A to attend to important BiLSTM hidden states (ht)
  • αt,tʹ ∈ A captures importance of ht w.r.t. htʹ
  • Attention-focused representation lt of token xt given by:

38

h1 h2 h3 h4

l1 l2 l3 l4

slide-39
SLIDE 39

graph product

KDD 2018

Final Classification

39

Maximize log-likelihood of joint distribution Best possible tag sequence with highest conditional probability

CRF feature space formed by attention-focused representation of hidden states

slide-40
SLIDE 40

graph product

KDD 2018

Uncertainty Sampling: Probability as Query Strategy

  • Select instance with maximum uncertainty
  • Best possible tag sequence from CRF:
  • Label instance with maximum uncertainty:
  • Considers entire label sequence y, ignores difficulty in estimating

individual tags yt ∈ y

40

slide-41
SLIDE 41

graph product

KDD 2018

Tag Flip as Query Strategy

  • Most informative instance has maximum tag flips aggregated over all
  • f its tokens across all the epochs:
  • Top B samples with the highest number of flips are manually

annotated with tags

duck , fillet mignon and ranch raised lamb flavor B O B E O B I E O B O B O O O O B O Tag flips = 4

41

slide-42
SLIDE 42

graph product

KDD 2018

Multiple attribute values

  • Predicting multiple attribute values jointly
  • Modify tagging strategy to have separate tag-set {Ba, Ia, Oa, Ea} for

each attribute ‘a’

42