Learning to Read by Spelling Towards Unsupervised Text Recognition - - PowerPoint PPT Presentation

learning to read by spelling
SMART_READER_LITE
LIVE PREVIEW

Learning to Read by Spelling Towards Unsupervised Text Recognition - - PowerPoint PPT Presentation

Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta Andrea Vedaldi Andrew Zisserman Visual Geometry Group (VGG) University of Oxford ICVGIP 2018, Hyderabad Text Recognition Imaged Text ASCII Text tion


slide-1
SLIDE 1

Learning to Read by Spelling

Towards Unsupervised Text Recognition

Ankush Gupta Andrea Vedaldi Andrew Zisserman

Visual Geometry Group (VGG) University of Oxford ICVGIP 2018, Hyderabad

slide-2
SLIDE 2

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition

tion of regular fits of the gout, one or more joints

Imaged Text ASCII Text Assumes localisation is given

  • Word / line level bounding boxes
slide-3
SLIDE 3

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Let’s solve this

slide-4
SLIDE 4

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition: Sequence Learning 101

ConvNet

a .. n

  • ..

<stop>

Sequence Model (e.g. RNNs)

Paired Image / Annotations tion of regular fits of the gout one or more joints part of the brain the tunica arachnoides was rated and of a pale yellow colour and with the

Lots of paired data

slide-5
SLIDE 5

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition: Paired Data?

Manual labour

  • Expensive
  • Boring..

Synthetic Data

  • New engine for each domain
  • Complex pipelines
  • Domain gap

Jaderberg et al., NIPS DLW 2014 Gupta et al., CVPR16, BMVC18

slide-6
SLIDE 6

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Can we do without paired data?

slide-7
SLIDE 7

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

We leverage this structure for supervision. Language is highly structured

  • There are ~8 billion random strings of length 7 (26 characters)

but only 15K are valid English strings

  • The frequency of characters and words, and their co-occurrence

(n-grams etc.) further constrain the output

slide-8
SLIDE 8

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Method

slide-9
SLIDE 9

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Recognition

2 sub-problems

  • 1. Segment text image into characters, and cluster to consistent class
  • 2. Assign each cluster to correct “character” label

à Solve for a |A | x |A | permutation matrix where, A is the alphabet, e.g.: 26 English letters {a,b,c,…,z}

visual language

slide-10
SLIDE 10

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Unpaired Text Recognition

Learn from unaligned text corpora, and text-images

Fully-Conv Recognition Net images with <= L characters |A A | “fake” text

Valid Language Strings e.g. from: WMT, NewsGroup etc.

Discriminator “real” text Adversarial Loss L

28: English letters {a,b,c,…,z} + space + pad

Softmax each position

  • ne-hot

visual language

slide-11
SLIDE 11

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Unpaired Text Recognition

slide-12
SLIDE 12

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

The “recognizer” can generate valid text without “recognizing”

Pitfall: Uncorrelated Predictions

à Fool the discriminator without solving the task e.g. use “text-image” as noise à learn generator for valid English strings Fully-Conv Recognition Net “ this is a valid English string which looks real to the discriminator ” invalid recognition

slide-13
SLIDE 13

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Uncorrelated Predictions: Solution

Local recognizer restricted receptive field (~ 3 characters)

. . . . .

tion of regular fits of the gout one or more joints

Global discriminator checks validity of the entire sentence

Discriminator

No Reconstruction unlike CycleGAN

Because text à image is highly ambiguous

slide-14
SLIDE 14

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Experiments

slide-15
SLIDE 15

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Synthetic Text Images

  • Fixed-width font
  • WMT Newscrawl text source (EMNLP datasets)
  • Control over nuisance factors à used for analysis
  • 100K training sequences, 1K test sequences
slide-16
SLIDE 16

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

  • Google Books scans
  • Non-fixed width font
  • Varying word spacing due to alignment
  • “See-through” from back
  • Different case (small / capital)
  • Italics / noise / fading etc.
  • 3K training lines, 300 test lines

(no overlapping pages)

  • Use Google OCR output as

unaligned “text source”

slide-17
SLIDE 17

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Training Strategy

Sample images and valid strings independently

Fully-Conv Recognition Net

Valid Language Strings

Discriminator Adversarial Loss

softmax predictions

  • ne-hot

strings

slide-18
SLIDE 18

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Results

slide-19
SLIDE 19

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Synthetic Text Images

  • ~99% character accuracy
  • ~95% word accuracy
  • Trained on 24-length sequences à test on 3, 5, 7, 9, 11, 24, 32, 48

à generalization to other lengths

test accuracy

slide-20
SLIDE 20

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

Why?

Varying spacing / non-fixed width font challenging for fully-conv. recognizer

45% word accuracy!

slide-21
SLIDE 21

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

Why?

Varying spacing / non-fixed width challenging for fully-conv. recognizer à Let features travel using a “skip-RNN” in the last layer

Conv RNN Skip

slide-22
SLIDE 22

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Images

85% character accuracy (now vs. 45% before)! 96.2% character accuracy.

slide-23
SLIDE 23

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Real Text Example

the different forms in which this disease ap pears have rendered it necessary to divide it into regular and aregular gout in the former the attacks of which are known by the denomina tion of regular fits of the got onne o more joints

  • f the extremities become inflamed painful and

tender and frequently in an exquisite deqree a symptoneatic fever proportioned to the degree of pain and inflammation with evening exacerba tions accompand the other complaints which dis tress the patient for uncertain perions sometimes for several weeks wo he the fit goes off the piont which have been the seat of the disease are always found to have become rigid and inflexible in pro portion to the degree in which the disease has existed in themuffeequently remaining enlarged and incapable of free motion for a considerable timer cen he other hand the patient at the sas time experiences so perfect an exemption from disease as generally to lead to the opinion that the fit has occasioned the most salutary changes in the sten in ■ne■ e erd y i er in the irregular gout the affection of the joints is much less confined than in the former some e times it leaves the joints at fort tttached and fixes on some distant part■ and sometimes after harassing the patient by making a circuit in■ Ground Truth Prediction the different forms in which this disease ap pears have rendered it necessary to divide it into regular and irregular gout in the former the attacks of which are known by the denomina tion of regular fits of the gout one or more joint

  • f the extremities become inflamed painful and

tender and frequently in an exquisite degree a symptomatic fever proportioned to the degree of pain and inflammation with evening exacerba tions accompany the other complaints which dis tress the patient for uncertain periods sometimes for several weeks when the fit goes off the joints which have been the seat of the disease are always found to have become rigid and inflexible in pro portion to the degree in which the disease has existed in them■ frequently remaining enlarged and incapable of free motion for a considerable time on the other hand the patient at the same time experiences so perfect an exemption from disease ag generallyto leadu to the opinion that the fit has occasioned the most salutary changes in the system ■ ■■■ ■ ■i■■ in the irregular gout the affection of the joints is much less confined than in the former yiomeu timcs it leaves the joints at first attacked and fixes on some distant part■ and sometimes after harassing the patient by making a circuit in■ Image

slide-24
SLIDE 24

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Analysis

slide-25
SLIDE 25

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Effect of Sequence Length on Training Convergence

Training with sequences of different lengths:

  • Short lengths 3-5:

no convergence

  • Longer sequence à

faster convergence 13 > 11 > 9 > 7

slide-26
SLIDE 26

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Which symbol is learnt first?

learning order 1 2 3 4 training run s i e u a

  • l

t d h g k m y c p r n f b x w v z j q g e i s n l a u r d

  • h

t k y m c v w x p b j f q z g i e

  • a

s u p t d l c n r w h k y v m b z q f j x a e t s d i u l

  • h

r n y m v w b j x f z c g p k q low high

  • Symbols are learnt roughly in the order of their frequency.

(Spearman’s rank correlation coefficient ρ = 0.80, p-value < 1e−5).

slide-27
SLIDE 27

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Text Corpus Ablation

  • Completely unrelated lexicon (#3): small adverse effect
  • Related lexicon (#2): no such effect

88 90 92 94 96 98 100

WMT WMT (No overlap) War & Peace

char word

slide-28
SLIDE 28

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Extensions

  • Not text-image specific

à apply to any input domain, as long as output is still language

  • Examples:
  • Speech
  • Sign language / gestures
  • Lip reading
slide-29
SLIDE 29

Learning to Read by Spelling

Towards Unsupervised Text Recognition

Ankush Gupta Andrea Vedaldi Andrew Zisserman

Any Questions?

Visual Geometry Group (VGG) University of Oxford ICVGIP 2018, Hyderabad

slide-30
SLIDE 30

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Which symbol is learnt first?

2000 4000 6000 8000 10000 iterations 0.0 0.2 0.4 0.6 0.8 1.0 character accuracy a e t s d i u l

  • h

r n y m v w b j x f z c g p k q

slide-31
SLIDE 31

Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, Visual Geometry Group (VGG), Oxford | ICVGIP 2018, Hyderabad

Learning Dynamics

tttttttttttttttttttttttttttttttttttttttttttrssssss ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz bcorote to whol th ticunthss tidio tiostolonzzzz trougfht to ferr oy disectins it has dicomered brought to view by dissection it was discovered

Training iterations

  • 1. First {space} / segmentation into words is learnt
  • 2. Next, common words like {to, it}
  • 3. Last, less frequent characters like {v, w}.

The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy).