d eep s tructured o utput l earning for u nconstrained t
play

D EEP S TRUCTURED O UTPUT L EARNING FOR U NCONSTRAINED T EXT R - PowerPoint PPT Presentation

D EEP S TRUCTURED O UTPUT L EARNING FOR U NCONSTRAINED T EXT R ECOGNITION Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, Department Engineering Science, University of Oxford, UK 1 T EXT R ECOGNITION


  1. D EEP S TRUCTURED O UTPUT L EARNING FOR U NCONSTRAINED T EXT R ECOGNITION Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Visual Geometry Group, Department Engineering Science, University of Oxford, UK 1

  2. T EXT R ECOGNITION Localized text image as input, character string as output DISTRIBUTED COSTA DENIM FOCAL

  3. T EXT R ECOGNITION State of the art — constrained text recognition � word classification [Jaderberg, NIPS DLW 2014] � static ngram and word language model [Bissacco, ICCV 2013] APARTMENTS

  4. T EXT R ECOGNITION State of the art — constrained text recognition � word classification [Jaderberg, NIPS DLW 2014] � static ngram and word language model [Bissacco, ICCV 2013] Random string ? New, unmodeled word ?

  5. T EXT R ECOGNITION Unconstrained text recognition � e.g. for house numbers [Goodfellow, ICLR 2014] business names, phone numbers, emails, etc Random string RGQGAN323 New, unmodeled word TWERK

  6. O VERVIEW • Two models for text recognition [Jaderberg, NIPS DLW 2014] ‣ Character Sequence Model ‣ Bag-of-N-grams Model � • Joint formulation ‣ CRF to construct graph ‣ Structured output loss ‣ Use back-propagation for joint optimization � • Experiments ‣ Generalize to perform zero-shot recognition ‣ When constrained recover performance �

  7. C HARACTER S EQUENCE M ODEL Deep CNN to encode image. Per-character decoder. 1 ⨉ 1 ⨉ 4096 8 ⨉ 25 ⨉ 512 4 ⨉ 13 ⨉ 512 32 ⨉ 100 ⨉ 1 8 ⨉ 25 ⨉ 256 32 ⨉ 100 ⨉ 64 16 ⨉ 50 ⨉ 128 1 ⨉ 1 ⨉ 4096 x 5 convolutional layers, 2 FC layers, ReLU, max-pooling 23 output classifiers for 37 classes (0-9,a-z,null) � Fixed 32x100 input size — distorts aspect ratio

  8. C HARACTER S EQUENCE M ODEL Deep CNN to encode image. Per-character decoder. 0 e z Ø char 1 P ( c 1 | Φ ( x )) ⋮ ⋮ 32 ⨉ 100 ⨉ 1 ⋮ s CHAR CNN char 5 ⋮ ⋮ x char 6 ⋮ ⋮ ⋮ char 23 P ( c 23 | Φ ( x )) ⋮ ⋮ 1 ⨉ 1 ⨉ 37

  9. B AG - OF -N- GRAMS M ODEL Represent string by the character N-grams contained within the string s � p � i � 1-grams r � e � sp � pi � ir � spires 2-grams re � es � spi � pir � 3-grams ire � res � spir � pire � 4-grams ires

  10. B AG - OF -N- GRAMS M ODEL Deep CNN to encode image. N-grams detection vector output. Limited (10k) set of modeled N-grams. N-gram detection vector 1 ⨉ 1 ⨉ 10000 a b 1 ⨉ 1 ⨉ 4096 ⋮ 8 ⨉ 25 ⨉ 512 ak 4 ⨉ 13 ⨉ 512 32 ⨉ 100 ⨉ 1 8 ⨉ 25 ⨉ 256 32 ⨉ 100 ⨉ 64 16 ⨉ 50 ⨉ 128 ke ra aba 1 ⨉ 1 ⨉ 4096 ⋮ rake raze

  11. J OINT M ODEL Can we combine these two representations? 0 r z Ø char 1 ⋮ ⋮ 32 ⨉ 100 ⨉ 1 ⋮ e CHAR CNN char 4 ⋮ ⋮ char 5 ⋮ ⋮ ⋮ char 23 ⋮ ⋮ 1 ⨉ 1 ⨉ 37 1 ⨉ 1 ⨉ 10000 a b 32 ⨉ 100 ⨉ 1 ⋮ ak NGRAM ke CNN ra aba ⋮ rake raze

  12. J OINT M ODEL CHAR f ( x ) CNN a e k q r

  13. J OINT M ODEL maximum number of chars CHAR f ( x ) CNN a e k q r NGRAM g ( x ) CNN

  14. J OINT M ODEL w ∗ = arg max S ( w, x ) CHAR f ( x ) w CNN beam search a e k q r NGRAM g ( x ) CNN

  15. S TRUCTURED O UTPUT L OSS Score of ground-truth word should be greater than or equal to the highest scoring incorrect word + margin. � where Enforcing as soft constraint leads to a hinge loss

  16. S TRUCTURED O UTPUT L OSS

  17. E XPERIMENTS

  18. D ATASETS All models trained purely on synthetic data � [Jaderberg, NIPS DLW 2014] Font rendering Border/shadow & color Composition Projective distortion Natural image blending Realistic enough to transfer to test on real-world images

  19. D ATASETS Synth90k � Lexicon of 90k words. 9 million images, training + test splits Download from http://www.robots.ox.ac.uk/~vgg/data/text/

  20. D ATASETS ICDAR 2003, 2013 � Street View Text IIIT 5k-word

  21. T RAINING Pre-train CHAR and NGRAM model independently. � Use them to initialize joint model and continue jointly training.

  22. E XPERIMENTS - J OINT I MPROVEMENT Train Data Test Data CHAR JOINT Synth90k 87.3 91.0 joint model IC03 85.9 89.6 outperforms character Synth90k sequence model 71.7 SVT 68.0 alone 81.8 IC13 79.5 CHAR: grahaws � JOINT: grahams � GT: grahams CHAR: mediaal � JOINT: medical � GT: medical CHAR: chocoma_ � JOINT: chocomel � GT: chocomel CHAR: iustralia � JOINT: australia � GT: australia

  23. J OINT M ODEL C ORRECTIONS edge down-weighted in graph edges up-weighted in graph

  24. E XPERIMENTS - Z ERO - SHOT R ECOGNITION Train Data Test Data CHAR JOINT Synth90k 87.3 91.0 - Synth72k-90k 87.3 large difference for CHAR model when - Synth45k-90k 87.3 Synth90k not trained on test IC03 85.9 89.6 words SVT 68.0 71.7 joint model recovers IC13 79.5 81.8 performance 89.7 Synth1-72k Synth72k-90k 82.4 Synth1-45k Synth45k-90k 80.3 89.1

  25. E XPERIMENTS - C OMPARISON No Lexicon IC03- IC03 SVT IC13 Full Model Type Model Unconstrained Baseline (ABBYY) - - - 55.0 Wang, ICCV ‘11 - - - 62.0 Bissacco, ICCV ‘13 - 78.0 87.6 Yao, CVPR ‘14 - - - 80.3 Language Constrained Jaderberg, ECCV ‘14 - - - 91.5 Gordo, arXiv ‘14 - - - Jaderberg, NIPSDLW ‘14 98.6 80.7 90.8 98.6 CHAR 85.9 68.0 79.5 96.7 Unconstrained JOINT 89.6 71.7 81.8 97.0

  26. E XPERIMENTS - C OMPARISON No Lexicon Fixed Lexicon IC03- SVT-50 IIIT5k IIIT5k- IC03 SVT IC13 Full -50 1k Model Type Model Unconstrained Baseline (ABBYY) - - - 55.0 35.0 24.3 - Wang, ICCV ‘11 - - - 62.0 57.0 - - Bissacco, ICCV ‘13 - 78.0 87.6 - 90.4 - - Yao, CVPR ‘14 - - - 80.3 75.9 80.2 69.3 Language Constrained Jaderberg, ECCV ‘14 - - - 91.5 86.1 - - Gordo, arXiv ‘14 - - - - 90.7 93.3 86.6 Jaderberg, NIPSDLW ‘14 98.6 80.7 90.8 98.6 95.4 97.1 92.7 CHAR 85.9 68.0 79.5 96.7 93.5 95.0 89.3 Unconstrained JOINT 89.6 71.7 81.8 97.0 93.2 95.5 89.6

  27. S UMMARY • Two models for text recognition � • Joint formulation ‣ Structured output loss ‣ Use back-propagation for joint optimization � • Experiments ‣ Joint model improves accuracy on language-based data. ‣ Degrades elegantly when not from language (N- gram model doesn’t contribute much) ‣ Set benchmark for unconstrained accuracy, competes with purely constrained models.

  28. jaderberg@google.com

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend