[PPT] - Speech Processing 15-492/18-495 Multilinguality Dealing with *all* PowerPoint Presentation

SLIDE 1

Speech Processing 15-492/18-495

Multilinguality

SLIDE 2

Dealing with all Languages Dealing with all Languages

 Over 6000 Languages

Over 6000 Languages

 Maybe not all commercially interesting … now

Maybe not all commercially interesting … now

 Major languages (economic)

Major languages (economic)

 Cell phone manufacturers list 46 languages

Cell phone manufacturers list 46 languages

 But even those not all covered

But even those not all covered

SLIDE 3

What you need What you need

 ASR

ASR

 Acoustic model (lots of speakers)

Acoustic model (lots of speakers)

 Pronunciation Lexicon

Pronunciation Lexicon

 Language model

Language model

 TTS

TTS

 Acoustic model (one speaker)

Acoustic model (one speaker)

 Pronunciation Lexicon

Pronunciation Lexicon

 Text analysis

Text analysis

SLIDE 4

Writing Systems Writing Systems

 Romanized writing systems

Romanized writing systems

 Latin-1 (iso-8599-1)

Latin-1 (iso-8599-1)

 Covers many Western Europeans languages

Covers many Western Europeans languages

 Cyrillic

Cyrillic

 Covers many Eastern European Languages

Covers many Eastern European Languages

 Arabic Scripts

Arabic Scripts

 Arabic(s), Farsi, Urdu, etc

Arabic(s), Farsi, Urdu, etc

 Devenagari

Devenagari

 Covers many Northern India Languages

Covers many Northern India Languages

 Chinese Hanzi

Chinese Hanzi

 Covers some Chinese dialects but different versions

Covers some Chinese dialects but different versions

 Many other scripts some non-standard

Many other scripts some non-standard

SLIDE 5

Writing Systems Writing Systems

 Letter based

Letter based

 Latin, Cyrillic

Latin, Cyrillic

 Consonant based

Consonant based

 Arabic, Hebrew

Arabic, Hebrew

 Mora based

Mora based

 Half syllable or syllable

Half syllable or syllable

 Indian scripts, Japanese native scripts

Indian scripts, Japanese native scripts

 Syllable based

Syllable based

 Hangul, Chinese

Hangul, Chinese

SLIDE 6

Standards Standards

 Writing standards

Writing standards

 Taught at schools, newspapers, computer

Taught at schools, newspapers, computer support support

 Typically standardized spelling

Typically standardized spelling

 May be mostly spoken

May be mostly spoken

 Occasionally written

Occasionally written

SLIDE 7

Language Specific Issues Language Specific Issues

 No explicit markings

No explicit markings

 Stress, accent, tones

Stress, accent, tones

 No word boundaries

No word boundaries

 Chinese, Thai

Chinese, Thai

 No (short) vowels

No (short) vowels

 Arabic, Hebrew

Arabic, Hebrew

 Rich morphology

Rich morphology

 Many different words in the languages

Many different words in the languages

 Finnish, Turkish, Greenlandic

Finnish, Turkish, Greenlandic

SLIDE 8

Genre Specific Issues Genre Specific Issues

 No capitals, punctuations

No capitals, punctuations

 Unpunctuated

Unpunctuated

 Plain vs polite form

Plain vs polite form

 Speech vs text form

Speech vs text form

 Many foreign phrases

Many foreign phrases

 (technology directed genre’s)

(technology directed genre’s)

 Many new abbreviations

Many new abbreviations

 E.g. SMS messages

E.g. SMS messages

SLIDE 9

Character Encoding Character Encoding

 Unicode vs utf8 vs latin

Unicode vs utf8 vs latin

 Documents mix them

Documents mix them

 Sometime accent omitted

Sometime accent omitted

 For ease of typing

For ease of typing

 Lots of standards

Lots of standards

 Unicode, EUC, BIG5, TIS42, …

Unicode, EUC, BIG5, TIS42, …

 Everyone has their own standard

Everyone has their own standard

 Some create their own standards

Some create their own standards

 Mixed character sets

Mixed character sets

SLIDE 10

Phoneme Sets Phoneme Sets

 Hard to find consensus for new languages

Hard to find consensus for new languages

 Typically lots of different dialects

Typically lots of different dialects

 What level of distinction?

What level of distinction?

 Some good for speech but not really phonetic

Some good for speech but not really phonetic

 /t/ vs /dx/ in “water”

/t/ vs /dx/ in “water”

 Often doesn’t include foreign phones

Often doesn’t include foreign phones

 /w/ in German is common for younger people

/w/ in German is common for younger people

SLIDE 11

Words Words

 May be hard to define

May be hard to define

 No word boundaries

No word boundaries

 Rich morphology

Rich morphology

 Words have many variations of compounds

Words have many variations of compounds

 Yomenakatta -> could not read

Yomenakatta -> could not read

 Yomemasendeshita -> could not read (polite)

Yomemasendeshita -> could not read (polite)

 Gender specific speech

Gender specific speech

 Boku vs atashi

Boku vs atashi

 Language mixtures

Language mixtures

SLIDE 12

Pronunciation lexicons Pronunciation lexicons

 “

“proper” speech vs “actual” speech proper” speech vs “actual” speech

 Hard to generalize

Hard to generalize

 Chinese

Chinese

 Cross lingual pronunciations

Cross lingual pronunciations

 “

“Human” (English/German) Human” (English/German)

SLIDE 13

“ “Industry” way Industry” way

 Collect at least 300 hours of spoken speech

Collect at least 300 hours of spoken speech

 At least 20 different speakers

At least 20 different speakers

 Mixture of gender, age, etc

Mixture of gender, age, etc

 Through desired channel (phone/desktop)

Through desired channel (phone/desktop)

 Collect at least 5 hours from one speaker

Collect at least 5 hours from one speaker

 High quality recording studio

High quality recording studio

 Data should be targeted to application

Data should be targeted to application

 Build pronunciation lexicon

Build pronunciation lexicon

 Expert phonologist

Expert phonologist

SLIDE 14

Industry way Industry way

 Probably 3-6 months

Probably 3-6 months

 Lead developer

Lead developer

 Local language expert

Local language expert

 Lots of human transcribers

Lots of human transcribers

 Costs?

Costs?

 Many hundreds of thousands

Many hundreds of thousands

SLIDE 15

Or cheaper (?) … Or cheaper (?) …

 Find existing data

Find existing data

 Linguistic Data Consortium (UPenn)

Linguistic Data Consortium (UPenn)

 ELRA (European equivalent)

ELRA (European equivalent)

 Appen, Australia

Appen, Australia

 Find local people who have collected data

Find local people who have collected data

 Found data might be in wrong format

Found data might be in wrong format

 Data cleaning is often the most expensive

Data cleaning is often the most expensive

SLIDE 16

Standardized Datasets Standardized Datasets

 Global Phone

Global Phone

– 20+ languages, for ASR/TTS 20+ languages, for ASR/TTS



LDC/DARPA/IARPA sets LDC/DARPA/IARPA sets

– Mostly English, Arabic and Chinese Mostly English, Arabic and Chinese



BABEL dataset BABEL dataset

– 35 low resource languages (telephone conversations) 35 low resource languages (telephone conversations)



Librivox Librivox

– Audio books Audio books



Voxforge Voxforge

– Open source collected languages Open source collected languages



Mozilla Mozilla

– Open source multilingual sets Open source multilingual sets

SLIDE 17

CMU Wilderness Dataset CMU Wilderness Dataset

 500+ Languages

500+ Languages

– 20 hours aligned for each language 20 hours aligned for each language – Single speaker Single speaker – Mined from read audio books (Bible) Mined from read audio books (Bible) – 20+ languages, for ASR/TTS 20+ languages, for ASR/TTS

SLIDE 18

Actual way Actual way

 Often mixture

Often mixture

 Found data for initial model

Found data for initial model

 Collect data with actual/initial application

Collect data with actual/initial application

SLIDE 19

Multilingual Systems Multilingual Systems

 Support lots of different languages

Support lots of different languages

 Press 1 for Spanish

Press 1 for Spanish

 Press 2 for Gujarati …

Press 2 for Gujarati …

 Automatically detect language

Automatically detect language

 Mixed language

Mixed language

SLIDE 20

Multilingual (Menu) Multilingual (Menu)

 Speak in your language

Speak in your language

 Eki-mai no tsugi no bus no ha?

Eki-mai no tsugi no bus no ha?

 When is the next bus to the station

When is the next bus to the station

 Need multiple recognizers

Need multiple recognizers

 Run in parallel and take best result

Run in parallel and take best result

 Or shared acoustic models

Or shared acoustic models

 Recognizing both languages at once (mix)

Recognizing both languages at once (mix)

SLIDE 21

Multilingual (in line) Multilingual (in line)

 Code switching

Code switching

 European, India, Bilingual areas

European, India, Bilingual areas

 Hinglish, Spanglish

Hinglish, Spanglish

 Borrowed words and phrases

Borrowed words and phrases

 Dad, time kyu hua hai

Dad, time kyu hua hai

 One lakh

One lakh

 Computer walla

Computer walla

 numbers

numbers

 Can be inflected

Can be inflected

 Was updated -> up gedaten

Was updated -> up gedaten

SLIDE 22

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492

Multilinguality SPICE: making it easier

SLIDE 23

Dealing with all Languages Dealing with all Languages

Over 6000 Languages Over 6000 Languages

Maybe not all commercially interesting … now Maybe not all commercially interesting … now

Major languages (economic) Major languages (economic)

Cell phone manufacturers list 46 languages Cell phone manufacturers list 46 languages But even those not all covered But even those not all covered

SLIDE 24

 Computerization

Computerization: Speech is key technology : Speech is key technology



Mobile Devices, Ubiquitous Information Access Mobile Devices, Ubiquitous Information Access



Globalization Globalization: Multilinguality : Multilinguality



More than 6000 Languages in the world More than 6000 Languages in the world



Multiple official languages Multiple official languages

 Europe has 20+ official languages

Europe has 20+ official languages

 South Africa has 11 official languages

South Africa has 11 official languages

  Speech Processing in multiple Languages Speech Processing in multiple Languages



Cross-cultural Human-Human Interaction Cross-cultural Human-Human Interaction



Human-Machine Interface in mother tongue Human-Machine Interface in mother tongue

Motivation Motivation

SLIDE 25

Challenges Challenges

 Algorithms language independent but require data

Algorithms language independent but require data

Dozens of hours audio recordings and corresponding transcriptions

Dozens of hours audio recordings and corresponding transcriptions

Pronunciation dictionaries for large vocabularies (>100.000 words)

Pronunciation dictionaries for large vocabularies (>100.000 words)

Millions of words written text corpora in various domains in question

Millions of words written text corpora in various domains in question

Bilingual aligned text corpora

Bilingual aligned text corpora

 BUT: Such data only available in very few languages

BUT: Such data only available in very few languages

Audio data

Audio data   40 languages, Transcriptions take up to 40x real time 40 languages, Transcriptions take up to 40x real time

Large vocabulary pronunciation dictionaries

Large vocabulary pronunciation dictionaries   20 languages 20 languages

Small text corpora

Small text corpora   100 languages, large corpora 100 languages, large corpora   30 languages 30 languages

Bilingual corpora in very few language pairs, pivot mostly English

Bilingual corpora in very few language pairs, pivot mostly English

 Additional complications:

Additional complications:

Combinatorical explosion (domain, speaking style, accent, dialect, ...)

Combinatorical explosion (domain, speaking style, accent, dialect, ...)

Few native speakers at hand for minority (endangered) languages

Few native speakers at hand for minority (endangered) languages

Languages without writing systems

Languages without writing systems

SLIDE 26

Solution: Learning Systems Solution: Learning Systems

  Systems that learn a language from the user Systems that learn a language from the user

 Efficient learning algorithms for speech processing

Efficient learning algorithms for speech processing

 Learning:

Learning:

Interactive learning with user in the loop

Interactive learning with user in the loop

Statistical modeling approaches

Statistical modeling approaches

 Efficiency:

Efficiency:

Reduce amount of data (save time and costs): by a factor of 10

Reduce amount of data (save time and costs): by a factor of 10

Speed up development cycles: days rather than months

Speed up development cycles: days rather than months

  Rapid Language Adaptation from universal models Rapid Language Adaptation from universal models

 Bridge the gap: language and technology experts

Bridge the gap: language and technology experts

Technology experts do not speak all languages in question

Technology experts do not speak all languages in question

Native users are not in control of the technology

Native users are not in control of the technology

SLIDE 27

Sharing data between modules Sharing data between modules

Lexst LMt

Word s  Word t N-grams

AMt Dictt

Word  phone sequence

LMt

N-grams

AMs Dicts

Word  phone sequence

Lexts

Word s  Word t

LMs

N-grams

AMs Dicts LMs

Word  phone sequence N-grams

AMt Dictt

Word  phone sequence

Input Ls Input Lt Output Ls

Speech-to-Speech Translation Lsource Ltarget Lsource Ltarget

SLIDE 28

SPICE SPICE

Speech Processing: Interactive Creation and Evaluation toolkit

National Science Foundation, Grant 10/2004, 3 years
Principle Investigators Tanja Schultz and Alan Black
Bridge the gap between technology experts  language experts
Automatic Speech Recognition (ASR),
Machine Translation (MT),
Text-to-Speech (TTS)
Develop web-based intelligent systems
Interactive Learning with user in the loop
Rapid Adaptation of universal models to unseen languages
SPICE webpage http://cmuspice.org

SLIDE 29

Spice Project Page Spice Project Page

SLIDE 30

Input: Speech

Speech Processing Systems Speech Processing Systems

Pronunciation rules

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

Hello

NLP / MT TTS Text data Phone set & Speech data

SLIDE 31

Input: Speech

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Phone set & Speech data +

Hello

Rapid Portability: Data Rapid Portability: Data

SLIDE 32

Finding “Nice” Prompts Finding “Nice” Prompts

From very large text databases From very large text databases Find “nice” sentences: Find “nice” sentences:

Containing only high frequency words Containing only high frequency words 5-15 words 5-15 words

Find grapheme/phoneme balanced set Find grapheme/phoneme balanced set

Select sentences with best triphone/graph Select sentences with best triphone/graph

500-1000 sentences 500-1000 sentences Collect for ASR and TTS acoustic modeling Collect for ASR and TTS acoustic modeling

SLIDE 33

Prompt Selection Issues Prompt Selection Issues

Need good text Need good text

De-htmlify, well-written, no misspelling De-htmlify, well-written, no misspelling

Need word segmentation Need word segmentation

Japanese, Chinese Thai Japanese, Chinese Thai

Natural text is often mixed language Natural text is often mixed language

Hindi Newspaper Text has lots of English words Hindi Newspaper Text has lots of English words

Automatic selection has errors Automatic selection has errors

Need Speaker to do further selection Need Speaker to do further selection E.g. lots of telephone numbers, formating commands E.g. lots of telephone numbers, formating commands

CMU Arctic used similar methods CMU Arctic used similar methods

SLIDE 34

Recording Prompts Recording Prompts

SLIDE 35

GlobalPhone GlobalPhone

Multilingual Database

 Widespread languages  Native Speakers  Uniform Data  Broad Domain  Large Text Resources

 Internet, Newspaper

Corpus

 19 Languages … counting   1800 native speakers   400 hrs Audio data  Read Speech  Filled pauses annotated

Arabic Ch-Mandarin Ch-Shanghai German French Japanese Korean Croatian Portuguese Russian Spanish Swedish Tamil Czech Turkish + Thai + Creole + Polish + Bulgarian + ... ???

Now available from ELRA !!

SLIDE 36

Speech Recognition in 17 Languages Speech Recognition in 17 Languages

SLIDE 37

Input: Speech

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Phone set & Speech data +

Hello

Rapid Portability: Acoustic Models Rapid Portability: Acoustic Models

SLIDE 38

Speech Production is independent from Language  IPA

1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing

 Reduction from 485 to 162 sound classes  m,n,s,l appear in all 12 languages  p,b,t,d,k,g,f and i,u,e,a,o in almost all

Problem: Context of sounds are language specific Context dependent models for new languages? Solution: 1) Multilingual Decision Context Trees 2) Specialize decision tree by Adaptation

Universal Sound Inventory Universal Sound Inventory

1=Plosiv?

N J k (0) k lau k ra ut k le

t k or

in k ar +2=Vokal? N J k (1) k (2) lau k ra in k ar

ut k le

t k or

Blaukraut Brautkleid Brotkorb Weinkarte

SLIDE 39

Choosing Phonemes Choosing Phonemes

SLIDE 40

Rapid Portability: Acoustic Model Rapid Portability: Acoustic Model

69,1 57,1 49,9 40,6 32,8 28,9 19,6 19 20 40 60 80 100 Word Error rate [%]

0:15 0:15 0:25 0:25 0:25 1:30 16:30

Ø Tree ML-Tree Po-Tree PDTS +

SLIDE 41

Input: Speech

Rapid Portability: Pronunciation Dictionary Rapid Portability: Pronunciation Dictionary

Pronunciation rules

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Textdaten

„adios“  /a/ /d/ /i/ /o/ /s/ „Hallo“  /h/ /a/ /l/ /o/ „Phydough“  ??? Hello

SLIDE 42

Phoneme Grapheme (FTT) Grapheme

English Spanish German Russian Thai

Phoneme- vs Grapheme based ASR Phoneme- vs Grapheme based ASR

Problem:

1 Grapheme  1 Phoneme

Flexible Tree Tying (FTT): One decision tree

Improved parameter tying
Less over specification
Fewer inconsistencies

0=vowel? 0=obstruent? 0=begin-state?

1=syllabic?0=mid-state?-1=obstruent?0=end-state?

AX-m IX-m AX-b

SLIDE 43

Dictionary: Interactive Learning Dictionary: Interactive Learning

* Follow the work of Davel & Barnard * Word list: extract from text

User Word list W

i:= best select

Word wi Generate pronunciation P(wi)

TTS

P(wi) okay? Yes

Delete wi

No

Update G-2-P Improve P(wi)

G-2-P

Delete wi * Update after each wi  more effective training * Kominek & Black * G-2-P

explicit mapping rules
neural networks
decision trees
instance learning

(grapheme context)

Lex

Skip

SLIDE 44

Spice: Lex Learner Spice: Lex Learner

SLIDE 45

Spice: Lex Learner Spice: Lex Learner

SLIDE 46

Issues and Challenges Issues and Challenges

 How to make best use of the human?

How to make best use of the human?

 Definition of successful completion

Definition of successful completion

 Which words to present in what order

Which words to present in what order

 How to be robust against mistakes

How to be robust against mistakes

 Feedback that keeps users motivated to continue

Feedback that keeps users motivated to continue  How many words?

How many words?

 G2P complexity language dependent

G2P complexity language dependent

 80% coverage

80% coverage hundred (SP) to thousands (EN) hundred (SP) to thousands (EN)

 G2P rule system perplexity

G2P rule system perplexity Language Language Perplexity Perplexity English English 50.11 50.11 Dutch Dutch 16.80 16.80 German German 16.70 16.70 Afrikaans Afrikaans 11.48 11.48 Italian Italian 3.52 3.52 Spanish Spanish 1.21 1.21

SLIDE 47

Input: Speech

Rapid Portability: LM Rapid Portability: LM

hi /h//ai/ you /j/u/ we /w//i/ hi you you are I am

AM Lex LM

Output: Speech & Text

NLP / MT TTS Text data

Internet / TV Hello

Inquiry

Automatic Extraction LM Bridge Languages

+ Resource rich languages  Resource low languages:

SLIDE 48

Parametric TTS Parametric TTS

 Text-to-speech for G2P Learning:

Text-to-speech for G2P Learning:

 Technique: phoneme-by-phoneme concatenation,

Technique: phoneme-by-phoneme concatenation, speech not natural but understandable (Marelie Davel) speech not natural but understandable (Marelie Davel)

 Units are based on IPA phoneme examples

Units are based on IPA phoneme examples

PRO: covers languages through simple adaptation

PRO: covers languages through simple adaptation

CONS: not good enough for speech applications

CONS: not good enough for speech applications

 Text-to-speech for Applications:

Text-to-speech for Applications:

 Statistical Parametric Systems:

Statistical Parametric Systems: clustergen clustergen

 Clusters representing context-dependent allophones

Clusters representing context-dependent allophones

PRO: can work with little speech (10 minutes)

PRO: can work with little speech (10 minutes)

PRO: robust to erroneous data.

PRO: robust to erroneous data.

CONS: speech sounds buzzy, lacks natural prosody

CONS: speech sounds buzzy, lacks natural prosody

SLIDE 49

SPICE: Afrikaans - English SPICE: Afrikaans - English

 Goal: Build Afrikaans – English S2S using SPICE

Goal: Build Afrikaans – English S2S using SPICE

Cooperation with Universit

Cooperation with University y Stellenbosch and ARMSCOR Stellenbosch and ARMSCOR

Bilingual PhD visited CMU f

Bilingual PhD visited CMU for 3 month (Herman Engelbrecht)

r 3 month (Herman Engelbrecht)
Afrikaans: Related to Dutch and English

Afrikaans: Related to Dutch and English, , g-2-p very close, regular grammar, simple morphology g-2-p very close, regular grammar, simple morphology  SPICE, all components apply statistical modeling paradigm

SPICE, all components apply statistical modeling paradigm

ASR: HMMs, N-gram LM (JRTk-ISL)

ASR: HMMs, N-gram LM (JRTk-ISL)

MT: Statistical MT

MT: Statistical MT (SMT-ISL) (SMT-ISL)

TTS: Unit-Selection (Festival)

TTS: Unit-Selection (Festival)

Dictionary

Dictionary: : G-2-P rules using CART decision trees G-2-P rules using CART decision trees  Text: 39 hansard; 680k words;

Text: 39 hansard; 680k words;

 43k bilingual aligned sentence pairs;

43k bilingual aligned sentence pairs;

 Audio: 6 hours read speech; 10k utterances,

Audio: 6 hours read speech; 10k utterances,

SLIDE 50

SPICE: Time effort SPICE: Time effort



Results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9) Results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9)



Shared pronunciation dictionaries (f Shared pronunciation dictionaries (for ASR+TTS) and LM

r ASR+TTS) and LM (f

(for ASR+MT)

r ASR+MT)



Most time consuming process: data preparation Most time consuming process: data preparation   reduce amount of data! reduce amount of data!



Still too much expert knowledge required (e.g. ASR parameter tuning!) Still too much expert knowledge required (e.g. ASR parameter tuning!)

SLIDE 51

Current Tests Current Tests

11 students is CMU class 11 students is CMU class Hindi (2), Vietnamese (2), French, German (2), Hindi (2), Vietnamese (2), French, German (2), Bulgarian, Telugu, Cantonese, Mandarin. Bulgarian, Telugu, Cantonese, Mandarin. Build complete S2S system Build complete S2S system

Teams of 2 for translation on small domain Teams of 2 for translation on small domain Translation is simple phrase-based Translation is simple phrase-based

Purpose: Purpose:

Have students get full experience Have students get full experience Find bugs/limitation in the system Find bugs/limitation in the system Evaluation resulting systems for development time and Evaluation resulting systems for development time and accuracy accuracy

SLIDE 52

Speech Processing 15-492/18-495

Multilinguality

Dealing with *all* Languages Dealing with *all* Languages

 Over 6000 Languages

Over 6000 Languages

Maybe not all commercially interesting … now

 Major languages (economic)

Major languages (economic)

Cell phone manufacturers list 46 languages

But even those not all covered

What you need What you need

 ASR

ASR

Acoustic model (lots of speakers)

Pronunciation Lexicon

Language model

 TTS

TTS

Acoustic model (one speaker)

Pronunciation Lexicon

Text analysis

Writing Systems Writing Systems

Romanized writing systems

Cyrillic

Arabic Scripts

Devenagari

Chinese Hanzi

Many other scripts some non-standard

Writing Systems Writing Systems

Letter based

Latin, Cyrillic

Consonant based

Arabic, Hebrew

Mora based

Half syllable or syllable

Indian scripts, Japanese native scripts

Syllable based

Hangul, Chinese

Standards Standards

 Writing standards

Writing standards

Taught at schools, newspapers, computer support support

Typically standardized spelling

 May be mostly spoken

May be mostly spoken

Occasionally written

Language Specific Issues Language Specific Issues

No explicit markings

Stress, accent, tones

No word boundaries

Chinese, Thai

No (short) vowels

Arabic, Hebrew

Rich morphology

Many different words in the languages

Finnish, Turkish, Greenlandic

Genre Specific Issues Genre Specific Issues

 No capitals, punctuations

No capitals, punctuations

 Unpunctuated

Unpunctuated

 Plain vs polite form

Plain vs polite form

 Speech vs text form

Speech vs text form

 Many foreign phrases

Many foreign phrases

(technology directed genre’s)

 Many new abbreviations

Many new abbreviations

E.g. SMS messages

Character Encoding Character Encoding

Unicode vs utf8 vs latin

Documents mix them

Sometime accent omitted

For ease of typing

Lots of standards

Unicode, EUC, BIG5, TIS42, …

Everyone has their own standard

Some create their own standards

Dealing with all Languages Dealing with all Languages