Creating a Handwriting Recognition Corpus for Bushman Languages - - PowerPoint PPT Presentation

creating a handwriting recognition corpus for bushman
SMART_READER_LITE
LIVE PREVIEW

Creating a Handwriting Recognition Corpus for Bushman Languages - - PowerPoint PPT Presentation

Creating a Handwriting Recognition Corpus for Bushman Languages Kyle Williams and Hussein Suleman BUSHMAN PEOPLE Bushman people of Southern Africa Earliest inhabitants of Earth Unique view of the world No living speakers of many


slide-1
SLIDE 1

Creating a Handwriting Recognition Corpus for Bushman Languages

Kyle Williams and Hussein Suleman

slide-2
SLIDE 2

BUSHMAN PEOPLE

  • Bushman people of Southern Africa
  • Earliest inhabitants of

Earth

  • Unique view of the world
  • No living speakers of

many Bushman languages

Digital Libraries Laboratory, University of Cape Town

slide-3
SLIDE 3

BLEEK AND LLOYD COLLECTION

  • Collection contains notebooks, art and dictionaries
  • Bushman culture encoded in metaphorical stories
  • Preserving this collection → preserving Bushman

culture

Digital Libraries Laboratory, University of Cape Town

slide-4
SLIDE 4

BLEEK AND LLOYD COLLECTION

Digital Libraries Laboratory, University of Cape Town

slide-5
SLIDE 5

BLEEK AND LLOYD COLLECTION

Digital Libraries Laboratory, University of Cape Town

  • Already have systems for preservation and viewing

collection

  • Next step involves enhancing use
  • Make text searchable
  • Index text
  • Reprint of text in books
  • Text-to-speech
  • Need a corpus of transcriptions
slide-6
SLIDE 6

BUSHMAN TEXT

Digital Libraries Laboratory, University of Cape Town

  • Text contains complex diacritics
  • Stacked above and below characters
  • Span multiple characters
slide-7
SLIDE 7

BUSHMAN TEXT

Digital Libraries Laboratory, University of Cape Town

  • Diacritics cannot be represented using Unicode
  • No one left that speaks the |xam language!
  • Over 137 different diacritics (more still being found)
slide-8
SLIDE 8

ENCODING

Digital Libraries Laboratory, University of Cape Town

  • Bushman text cannot be encoded using Unicode
  • Latex IPA package contains diacritics
  • Allows for custom macros to be created
  • Stacked, nested, multiple characters
  • \uline{a} →
  • \xbelow{\uline{a}} →
  • \xbelow{aa} →
slide-9
SLIDE 9

ENCODING

Digital Libraries Laboratory, University of Cape Town

slide-10
SLIDE 10

XÒÄ'XÒÄ - “TO WRITE”

Digital Libraries Laboratory, University of Cape Town

  • An AJAX tool to create a Bushman corpus
  • Automatic algorithms
  • User input
  • Preprocessing
  • Line and word segmentation
  • Transcription
  • Job and user management
slide-11
SLIDE 11

TEXT SELECTION

Digital Libraries Laboratory, University of Cape Town

slide-12
SLIDE 12

LINE SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

  • Projection profile-based line segmentation
  • Count foreground-background transitions for each

row

  • Minima suggest space between lines
  • Could represent space between base character and

diacritics

  • Gaussian smoothing of projection profile
slide-13
SLIDE 13

LINE SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

slide-14
SLIDE 14

LINE SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

slide-15
SLIDE 15

WORD SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

  • Line slant is automatically corrected
  • Connected components in text lines are identified
  • Distances between adjacent components are

calculated

  • Distances above threshold separate words
slide-16
SLIDE 16

WORD SEGMENTATION

Digital Libraries Laboratory, University of Cape Town

slide-17
SLIDE 17

TRANSCRIPTION

Digital Libraries Laboratory, University of Cape Town

slide-18
SLIDE 18

CORPUS CREATION WORKSHOPS

Digital Libraries Laboratory, University of Cape Town

  • Workshop held to create Bushman corpus
  • 29 data capturers recruited
  • 900 pages from 2 authors randomly selected
  • 729 pages were segmented into lines and words
  • 1547 text lines were transcribed
  • 452 text lines could not be transcribed
  • Interface didn't support characters, noise, English
slide-19
SLIDE 19

CORPUS CREATION WORKSHOPS

Digital Libraries Laboratory, University of Cape Town

  • Quality and efficiency of data capturers evaluated
  • 5 data capturers asked to return
  • 1700 more line recruited
  • More efficient and potentially fewer errors
slide-20
SLIDE 20

CORPUS CREATION WORKSHOP

Digital Libraries Laboratory, University of Cape Town

slide-21
SLIDE 21

USER CONTRIBUTIONS

Digital Libraries Laboratory, University of Cape Town

slide-22
SLIDE 22

DATA QUALITY

Digital Libraries Laboratory, University of Cape Town

  • Quality represented by accuracy and correctness of

transcriptions

  • Useful in planning for follow on workshops
  • Random transcriptions by each user evaluated by

research assistant

  • Wrong diacritics, characters, etc.
  • Average of 0.48 errors per text line
  • Acceptable for lay persons?
slide-23
SLIDE 23

EFFICIENCY VS QUALITY

Digital Libraries Laboratory, University of Cape Town

slide-24
SLIDE 24

CONCLUSIONS

Digital Libraries Laboratory, University of Cape Town

  • Creation of corpora for historical texts is often

difficult due to complexities of script

  • Semi-automatic tool allowed for more efficient and

less expensive creation of corpus

  • Currently being used in handwriting recognition

study

  • Applicable to other historical collections
slide-25
SLIDE 25

THANK YOU

Questions?

Digital Libraries Laboratory, University of Cape Town