Creating a Handwriting Recognition Corpus for Bushman Languages - - PowerPoint PPT Presentation

▶

Sep 06, 2023 444 likes •708 views

Creating a Handwriting Recognition Corpus for Bushman Languages Kyle Williams and Hussein Suleman BUSHMAN PEOPLE Bushman people of Southern Africa Earliest inhabitants of Earth Unique view of the world No living speakers of many

SLIDE 1

Creating a Handwriting Recognition Corpus for Bushman Languages

Kyle Williams and Hussein Suleman

SLIDE 2

BUSHMAN PEOPLE

Bushman people of Southern Africa
Earliest inhabitants of

Earth

Unique view of the world
No living speakers of

many Bushman languages

Digital Libraries Laboratory, University of Cape Town

SLIDE 3

BLEEK AND LLOYD COLLECTION

Collection contains notebooks, art and dictionaries
Bushman culture encoded in metaphorical stories
Preserving this collection → preserving Bushman

culture

Digital Libraries Laboratory, University of Cape Town

SLIDE 4

BLEEK AND LLOYD COLLECTION

Digital Libraries Laboratory, University of Cape Town

SLIDE 5

BLEEK AND LLOYD COLLECTION

Digital Libraries Laboratory, University of Cape Town

Already have systems for preservation and viewing

collection

Next step involves enhancing use
Make text searchable
Index text
Reprint of text in books
Text-to-speech
Need a corpus of transcriptions

SLIDE 6

BUSHMAN TEXT

Digital Libraries Laboratory, University of Cape Town

Text contains complex diacritics
Stacked above and below characters
Span multiple characters

SLIDE 7

BUSHMAN TEXT

Digital Libraries Laboratory, University of Cape Town

Diacritics cannot be represented using Unicode
No one left that speaks the |xam language!
Over 137 different diacritics (more still being found)

SLIDE 8

ENCODING

Digital Libraries Laboratory, University of Cape Town

Bushman text cannot be encoded using Unicode
Latex IPA package contains diacritics
Allows for custom macros to be created
Stacked, nested, multiple characters
\uline{a} →
\xbelow{\uline{a}} →
\xbelow{aa} →

SLIDE 9

ENCODING

Digital Libraries Laboratory, University of Cape Town

SLIDE 10

XÒÄ'XÒÄ - “TO WRITE”

Digital Libraries Laboratory, University of Cape Town

An AJAX tool to create a Bushman corpus
Automatic algorithms
User input
Preprocessing
Line and word segmentation
Transcription
Job and user management

SLIDE 11

TEXT SELECTION

Digital Libraries Laboratory, University of Cape Town

SLIDE 12

LINE SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

Projection profile-based line segmentation
Count foreground-background transitions for each

row

Minima suggest space between lines
Could represent space between base character and

diacritics

Gaussian smoothing of projection profile

SLIDE 13

LINE SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

SLIDE 14

LINE SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

SLIDE 15

WORD SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

Line slant is automatically corrected
Connected components in text lines are identified
Distances between adjacent components are

calculated

Distances above threshold separate words

SLIDE 16

WORD SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

SLIDE 17

TRANSCRIPTION

Digital Libraries Laboratory, University of Cape Town

SLIDE 18

CORPUS CREATION WORKSHOPS

Digital Libraries Laboratory, University of Cape Town

Workshop held to create Bushman corpus
29 data capturers recruited
900 pages from 2 authors randomly selected
729 pages were segmented into lines and words
1547 text lines were transcribed
452 text lines could not be transcribed
Interface didn't support characters, noise, English

SLIDE 19

CORPUS CREATION WORKSHOPS

Digital Libraries Laboratory, University of Cape Town

Quality and efficiency of data capturers evaluated
5 data capturers asked to return
1700 more line recruited
More efficient and potentially fewer errors

SLIDE 20

CORPUS CREATION WORKSHOP

Digital Libraries Laboratory, University of Cape Town

SLIDE 21

USER CONTRIBUTIONS

Digital Libraries Laboratory, University of Cape Town

SLIDE 22

DATA QUALITY

Digital Libraries Laboratory, University of Cape Town

Quality represented by accuracy and correctness of

transcriptions

Useful in planning for follow on workshops
Random transcriptions by each user evaluated by

research assistant

Wrong diacritics, characters, etc.
Average of 0.48 errors per text line
Acceptable for lay persons?

SLIDE 23

EFFICIENCY VS QUALITY

Digital Libraries Laboratory, University of Cape Town

SLIDE 24

CONCLUSIONS

Digital Libraries Laboratory, University of Cape Town

Creation of corpora for historical texts is often

difficult due to complexities of script

Semi-automatic tool allowed for more efficient and

less expensive creation of corpus

Currently being used in handwriting recognition

study

Applicable to other historical collections

SLIDE 25

THANK YOU

Questions?

Digital Libraries Laboratory, University of Cape Town