Natural Language Processing Historical Document Transcription Dan - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Historical Document Transcription Dan - - PowerPoint PPT Presentation

Natural Language Processing Historical Document Transcription Dan Klein UC Berkeley Joint work with Taylor Berg-Kirkpatrick and Greg Durrett [ACL 2013] Historical Document Historical Document Old Bailey Court Proceedings 1775 Transcription


slide-1
SLIDE 1

Natural Language Processing

Historical Document Transcription

Dan Klein — UC Berkeley

Joint work with Taylor Berg-Kirkpatrick and Greg Durrett [ACL 2013]

slide-2
SLIDE 2

Historical Document

slide-3
SLIDE 3

Historical Document

Old Bailey Court Proceedings 1775

slide-4
SLIDE 4

Transcription

Document Image

slide-5
SLIDE 5

Transcription

Document Image

and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold eievén pair of than for xiirce guincas, and dclivcrcd the rcll'l:.in- d:r hack lo :11: prifuner. 1 fold ftvcn pairof filk to Mark Simpcr : nncpuir of mixcd. and. mo pair of Ifircad to lhz: foolnun, and on: pair of zhrzad to lh: barber. ' Q: What is the foolmarfs name? Fraum Mgfzr. I dun’: know. Hairy Hzrvir. l was flandingar the Camp Icr waizin far the thcrrilfs ufliceruo employ in: : Mo 3‘: daughter came for me to 0 am! take the prifoncr. 1 Wm! to |hc Old aailcy

Transcription (Google Tesseract)

slide-6
SLIDE 6

Pipelined Approach

slide-7
SLIDE 7

Pipelined Approach

slide-8
SLIDE 8

Pipelined Approach

slide-9
SLIDE 9

Pipelined Approach

slide-10
SLIDE 10

Pipelined Approach

slide-11
SLIDE 11

Pipelined Approach

m

slide-12
SLIDE 12

Pipelined Approach

m o

slide-13
SLIDE 13

Pipelined Approach

m o d

slide-14
SLIDE 14

Historical Document

slide-15
SLIDE 15

Unknown Fonts

slide-16
SLIDE 16

Unknown Fonts

po

slide-17
SLIDE 17

Unknown Fonts

po

slide-18
SLIDE 18

Unknown Fonts

po

slide-19
SLIDE 19

Unknown Fonts

long s glyph

slide-20
SLIDE 20

Wandering Baseline

slide-21
SLIDE 21

Wandering Baseline

slide-22
SLIDE 22

Wandering Baseline

slide-23
SLIDE 23

Wandering Baseline

slide-24
SLIDE 24

Uneven Inking

slide-25
SLIDE 25

Uneven Inking

slide-26
SLIDE 26

Uneven Inking

slide-27
SLIDE 27

Uneven Inking

slide-28
SLIDE 28

Various Historical Documents

1725 1875 1823 1883:

slide-29
SLIDE 29

Our Approach

slide-30
SLIDE 30

Our Approach

po

slide-31
SLIDE 31

Our Approach

po

slide-32
SLIDE 32

Our Approach

po

slide-33
SLIDE 33

Generative Model

p r i s o n e r

slide-34
SLIDE 34

Generative Model

p r i s o n e r

slide-35
SLIDE 35

Generative Model

p r i s o n e r

slide-36
SLIDE 36

Generative Model

p r i s o n e r

slide-37
SLIDE 37

Language Model

Generative Model

p r i s o n e r

slide-38
SLIDE 38

Typesetting Model

Generative Model

p r i s o n e r

slide-39
SLIDE 39

Typesetting Model

Generative Model

p r i s o n e r

slide-40
SLIDE 40

Typesetting Model

Generative Model

p r i s o n e r

slide-41
SLIDE 41

Typesetting Model

Generative Model

p r i s o n e r

slide-42
SLIDE 42

Typesetting Model

Generative Model

p r i s o n e r

slide-43
SLIDE 43

Typesetting Model

Generative Model

p r i s o n e r

slide-44
SLIDE 44

Typesetting Model

Generative Model

p r i s o n e r

slide-45
SLIDE 45

Typesetting Model

Generative Model

p r i s o n e r

slide-46
SLIDE 46

Generative Model

p r i s o n e r

Typesetting Model

slide-47
SLIDE 47

Generative Model

p r i s o n e r

Typesetting Model

slide-48
SLIDE 48

Generative Model

p r i s o n e r

Typesetting Model

slide-49
SLIDE 49

Generative Model

p r i s o n e r

Typesetting Model

slide-50
SLIDE 50

Generative Model

p r i s o n e r

Typesetting Model

slide-51
SLIDE 51

Generative Model

p r i s o n e r

Typesetting Model

slide-52
SLIDE 52

Generative Model

p r i s o n e r

Rendering Model

slide-53
SLIDE 53

Generative Model

p r i s o n e r

Rendering Model

slide-54
SLIDE 54

Generative Model

p r i s o n e r

Rendering Model

slide-55
SLIDE 55

Generative Model

p r i s o n e r

slide-56
SLIDE 56

Generative Model

Language Model

P(E)

E

p r i s o n e r

slide-57
SLIDE 57

Generative Model

Language Model

P(E)

E

Typesetting Model

· P(T|E)

T

p r i s o n e r

slide-58
SLIDE 58

Generative Model

Language Model

P(E)

E

Typesetting Model

· P(T|E)

T X

Rendering Model

P(X|E, T)

p r i s o n e r

slide-59
SLIDE 59

Generative Model

Language Model

P(E)

E

Typesetting Model

· P(T|E)

T X

Rendering Model

P(X|E, T)

slide-60
SLIDE 60

Language Model

E

slide-61
SLIDE 61

Language Model

E

slide-62
SLIDE 62

Language Model

a

ei

r t

ei−1

ei+1

E

slide-63
SLIDE 63

Language Model

a

ei

r t

ei−1

ei+1

Kneser-Ney smoothed character 6-gram

E

slide-64
SLIDE 64

Typesetting Model

a

ei

slide-65
SLIDE 65

T

Typesetting Model

a

ei

slide-66
SLIDE 66

T

Typesetting Model

1 5

Left pad width

li

a

ei

slide-67
SLIDE 67

T

Typesetting Model

1 5

Left pad width

1

30

Glyph box width

gi li

a

ei

slide-68
SLIDE 68

T

Typesetting Model

1 5

Left pad width

1 5

Right pad width

1

30

Glyph box width

gi li ri

a

ei

slide-69
SLIDE 69

T

a

Typesetting Model

1 5

Left pad width

1 5

Right pad width

1

30

Glyph box width

gi li ri

a

ei

slide-70
SLIDE 70

T

a

Typesetting Model

1 5

Left pad width

1 5

Right pad width

1

30

Glyph box width

gi li ri

a

ei

slide-71
SLIDE 71

T

a

Typesetting Model

vi

1 5

Left pad width

1 5

Right pad width

1

30

Glyph box width

a a a

Vertical offset

gi li ri

a

ei

slide-72
SLIDE 72

T

a

Typesetting Model

di vi

1 5

Left pad width

1 5

Right pad width

1

30

Glyph box width

a a

a

a a

a

a

a

a

a

a

a a

a

a a

a

a a

a

a

a

a

a

a

a

a

a

a a

a

a

a

a a

a

a

a

Inking level

a a a

Vertical offset

gi li ri

a

ei

slide-73
SLIDE 73

Rendering Model

slide-74
SLIDE 74

Rendering Model

slide-75
SLIDE 75

Rendering Model

Glyph box

slide-76
SLIDE 76

Rendering Model

gi

Glyph box width Glyph box

slide-77
SLIDE 77

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

Glyph box

slide-78
SLIDE 78

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box

slide-79
SLIDE 79

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box

slide-80
SLIDE 80

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Glyph shape parameters

slide-81
SLIDE 81

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Glyph shape parameters

slide-82
SLIDE 82

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Bernoulli pixel probs Glyph shape parameters

slide-83
SLIDE 83

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-84
SLIDE 84

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Glyph shape parameters Sample pixels Bernoulli pixel probs

slide-85
SLIDE 85

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Glyph shape parameters Sample pixels Bernoulli pixel probs

slide-86
SLIDE 86

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-87
SLIDE 87

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-88
SLIDE 88

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-89
SLIDE 89

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-90
SLIDE 90

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-91
SLIDE 91

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-92
SLIDE 92

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-93
SLIDE 93

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-94
SLIDE 94

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Sample pixels Bernoulli pixel probs Glyph shape parameters

slide-95
SLIDE 95

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Glyph shape parameters Sample pixels Bernoulli pixel probs

slide-96
SLIDE 96

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Glyph shape parameters Sample pixels Bernoulli pixel probs

slide-97
SLIDE 97

X

Rendering Model

gi

Glyph box width

vi

Vertical

  • ffset

di

Inking level Glyph box Glyph shape parameters Sample pixels Bernoulli pixel probs

slide-98
SLIDE 98

Log-linear Interpolation

slide-99
SLIDE 99

Log-linear Interpolation

Glyph shape parameters φ Bernoulli pixel probs θ

slide-100
SLIDE 100

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ

slide-101
SLIDE 101

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ

slide-102
SLIDE 102

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ

slide-103
SLIDE 103

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ

slide-104
SLIDE 104

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ Interpolation weights α

slide-105
SLIDE 105

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ Interpolation weights α Dot product α>φ

slide-106
SLIDE 106

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ Interpolation weights α Dot product α>φ Apply logistic

slide-107
SLIDE 107

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ Interpolation weights α

slide-108
SLIDE 108

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ Interpolation weights α

slide-109
SLIDE 109

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ Interpolation weights α

slide-110
SLIDE 110

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ

j

j

Interpolation weights α

slide-111
SLIDE 111

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ

j

j

θj ∝

Interpolation weights α

slide-112
SLIDE 112

Log-linear Interpolation

Bernoulli pixel probs θ Glyph shape parameters φ

j

j

θj ∝ exp[α>

j φ]

Interpolation weights α

slide-113
SLIDE 113

Learning and Inference

E

X T

slide-114
SLIDE 114

Learning and Inference

  • Learn font parameters using EM

E

X T

slide-115
SLIDE 115

Learning and Inference

  • Learn font parameters using EM

E

X T

1 5 1 5 1 30
slide-116
SLIDE 116

Learning and Inference

  • Learn font parameters using EM

E

X T

1 5 1 5 1 30
slide-117
SLIDE 117

Learning and Inference

  • Learn font parameters using EM
  • Initialize font parameters with

mixtures of modern fonts

E

X T

1 5 1 5 1 30
slide-118
SLIDE 118

Learning and Inference

  • Learn font parameters using EM
  • Initialize font parameters with

mixtures of modern fonts

  • Semi-Markov DP to compute

expectations

E

X T

1 5 1 5 1 30
slide-119
SLIDE 119

Learning and Inference

  • Learn font parameters using EM
  • Initialize font parameters with

mixtures of modern fonts

  • Semi-Markov DP to compute

expectations

  • Efficient inference using a coarse-

to-fine approach

E

X T

1 5 1 5 1 30
slide-120
SLIDE 120

System Output Example

slide-121
SLIDE 121

System Output Example

1 5 1 5 1 30
slide-122
SLIDE 122

System Output Example how the murderers came to

1 5 1 5 1 30
slide-123
SLIDE 123

System Output Example how the murderers came to

1 5 1 5 1 30
slide-124
SLIDE 124

System Output Example how the murderers came to

1 5 1 5 1 30
slide-125
SLIDE 125

System Output Example how the murderers came to

1 5 1 5 1 30
slide-126
SLIDE 126

System Output Example how the murderers came to

1 5 1 5 1 30
slide-127
SLIDE 127

System Output Example

slide-128
SLIDE 128

System Output Example

1 5 1 5 1 30
slide-129
SLIDE 129

System Output Example

taken ill and taken away -- I remember

1 5 1 5 1 30
slide-130
SLIDE 130

System Output Example

taken ill and taken away -- I remember

1 5 1 5 1 30
slide-131
SLIDE 131

System Output Example

taken ill and taken away -- I remember

1 5 1 5 1 30
slide-132
SLIDE 132

System Output Example

taken ill and taken away -- I remember

1 5 1 5 1 30
slide-133
SLIDE 133

Experiments

slide-134
SLIDE 134

Experiments

Test data

slide-135
SLIDE 135
  • Old Bailey (1715-1905)

Experiments

Test data

slide-136
SLIDE 136
  • Old Bailey (1715-1905)

20 images, 30 lines each

Experiments

Test data

slide-137
SLIDE 137
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

Experiments

Test data

slide-138
SLIDE 138
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

Test data

slide-139
SLIDE 139
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

Baselines Test data

slide-140
SLIDE 140
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

  • Google Tesseract

Baselines Test data

slide-141
SLIDE 141
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

  • Google Tesseract
  • ABBYY FineReader 11

Baselines Test data

slide-142
SLIDE 142
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

  • Google Tesseract
  • ABBYY FineReader 11

Baselines Language models Test data

slide-143
SLIDE 143
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

  • Google Tesseract
  • ABBYY FineReader 11

Baselines

  • New

York Times

Language models Test data

slide-144
SLIDE 144
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

  • Google Tesseract
  • ABBYY FineReader 11

Baselines

  • New

York Times 34M words NYT Gigaword

Language models Test data

slide-145
SLIDE 145
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

  • Google Tesseract
  • ABBYY FineReader 11

Baselines

  • New

York Times 34M words NYT Gigaword

  • Old Bailey

Language models Test data

slide-146
SLIDE 146
  • Old Bailey (1715-1905)

20 images, 30 lines each

  • Trove (1803-1893)

10 images, 30 lines each

Experiments

  • Google Tesseract
  • ABBYY FineReader 11

Baselines

  • New

York Times 34M words NYT Gigaword

  • Old Bailey

32M words manually transcribed

Language models Test data

slide-147
SLIDE 147

Results

10 20 30 40 50 60

Word Error Rate

Old Bailey Court Proceedings (1715-1905)

slide-148
SLIDE 148

Results

10 20 30 40 50 60

54.8

Word Error Rate

Old Bailey Court Proceedings (1715-1905)

Google Tesseract

slide-149
SLIDE 149

Results

10 20 30 40 50 60

40.0 54.8

Word Error Rate

Old Bailey Court Proceedings (1715-1905)

Google Tesseract ABBYY FineReader

slide-150
SLIDE 150

Results

10 20 30 40 50 60

28.1

40.0 54.8

Word Error Rate

Old Bailey Court Proceedings (1715-1905)

Google Tesseract ABBYY FineReader Ocular w/ NYT

[Berg-Kirkpatrick et al. 2013]

slide-151
SLIDE 151

Results

10 20 30 40 50 60 24.1

28.1

40.0 54.8

Word Error Rate

Old Bailey Court Proceedings (1715-1905)

Google Tesseract ABBYY FineReader Ocular w/ NYT Ocular w/ OB

[Berg-Kirkpatrick et al. 2013]

slide-152
SLIDE 152

Trove Historical Newspapers (1803-1893)

Results

10 20 30 40 50 60

Word Error Rate

slide-153
SLIDE 153

Trove Historical Newspapers (1803-1893)

Results

10 20 30 40 50 60

59.3

Word Error Rate

Google Tesseract

slide-154
SLIDE 154

Trove Historical Newspapers (1803-1893)

Results

10 20 30 40 50 60

49.2 59.3

Word Error Rate

Google Tesseract ABBYY FineReader

slide-155
SLIDE 155

Trove Historical Newspapers (1803-1893)

Results

10 20 30 40 50 60

33.0 49.2 59.3

Word Error Rate

Google Tesseract ABBYY FineReader Ocular w/ NYT

slide-156
SLIDE 156

Trove Historical Newspapers (1803-1893)

Results

10 20 30 40 50 60

25.6 49.2 59.3

Word Error Rate

Google Tesseract ABBYY FineReader Ocular w/ NYT

[Berg-Kirkpatrick et al. 2014]

slide-157
SLIDE 157

Transcription

and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold eievén pair of than for xiirce guincas, and dclivcrcd the rcll'l:.in- d:r hack lo :11: prifuner. 1 fold ftvcn pairof filk to Mark Simpcr : nncpuir of mixcd. and. mo pair of Ifircad to lhz: foolnun, and on: pair of zhrzad to lh: barber. ' Q: What is the foolmarfs name? Fraum Mgfzr. I dun’: know. Hairy Hzrvir. l was flandingar the Camp Icr waizin far the thcrrilfs ufliceruo employ in: : Mo 3‘: daughter came for me to 0 am! take the prifoncr. 1 Wm! to |hc Old aailcy

Google Tesseract

slide-158
SLIDE 158

Transcription

and Ch’: priftmer anhc bar. Jacob Lazarus and his IHP1 uh: prifoner. were both together when! rcccivcd lhczn. I fold eievén pair of than for xiirce guincas, and dclivcrcd the rcll'l:.in- d:r hack lo :11: prifuner. 1 fold ftvcn pairof filk to Mark Simpcr : nncpuir of mixcd. and. mo pair of Ifircad to lhz: foolnun, and on: pair of zhrzad to lh: barber. ' Q: What is the foolmarfs name? Fraum Mgfzr. I dun’: know. Hairy Hzrvir. l was flandingar the Camp Icr waizin far the thcrrilfs ufliceruo employ in: : Mo 3‘: daughter came for me to 0 am! take the prifoncr. 1 Wm! to |hc Old aailcy

Google Tesseract

the prisoner at the bar. Jacob Lazarus and his wife, the prisoners were both together when I received them. I sold eleven pair of them for three guineas, and delivered the remain- der back to the prisoner. I sold, seven pair of silk to Mark Simpert one pair of mixed, and two pair of thread to the footman, and one pair of thread to the barber,

  • Ms. What in the footman's name?

Franco Asyut, I don't know- Nearly Norris. I was standing at the Comp- ter waiting for the sherrill's officers to employ me a Moses's daughter came for me to go and take the prisoner. I went to the Old Bailey

Ocular

slide-159
SLIDE 159

Learned Fonts

Initializer

slide-160
SLIDE 160

Learned Fonts

Initializer

g

slide-161
SLIDE 161

Learned Fonts

Initializer

g

slide-162
SLIDE 162

1700 1740 1780 1820 1860 1900

Learned Fonts

Initializer

slide-163
SLIDE 163

Unobserved Pixels

slide-164
SLIDE 164

Unobserved Pixels

slide-165
SLIDE 165

Unobserved Pixels

slide-166
SLIDE 166

Unobserved Pixels

slide-167
SLIDE 167

Conclusion

slide-168
SLIDE 168

Conclusion

  • Unsupervised font learning yields state-of-the-art

results on documents where font is unknown

slide-169
SLIDE 169

Conclusion

  • Unsupervised font learning yields state-of-the-art

results on documents where font is unknown

  • Generatively modeling sources of noise specific to

printing-press era documents is effective

slide-170
SLIDE 170

Conclusion

  • Unsupervised font learning yields state-of-the-art

results on documents where font is unknown

  • Generatively modeling sources of noise specific to

printing-press era documents is effective

  • Ocular available as a downloadable tool:

nlp.cs.berkeley.edu/ocular.shtml

slide-171
SLIDE 171

Conclusion

Thanks!