Linguistic Hacking How to know what a text in an unknown language - - PowerPoint PPT Presentation

linguistic hacking
SMART_READER_LITE
LIVE PREVIEW

Linguistic Hacking How to know what a text in an unknown language - - PowerPoint PPT Presentation

Linguistic Hacking How to know what a text in an unknown language is about? Martin.Haase@uni-bamberg.de maha@jabber.ccc.de 1 Contents how to identify the language of a written text in traditional ways, with the help of computer


slide-1
SLIDE 1

Linguistic Hacking

Martin.Haase@uni-bamberg.de maha@jabber.ccc.de

How to know what a text in an unknown language is about?

1

slide-2
SLIDE 2

Contents

  • how to identify the language of a written text
  • in traditional ways,
  • with the help of computer technology.
  • how to get at least some information out of

an unknown text.

2

slide-3
SLIDE 3

“the intellectual challenge of creatively overcoming or circumventing limitations.”

Hacking

Eric Raymond (1996): The New Hackerʼs Dictionary.

3

slide-4
SLIDE 4

Spoken texts?

multi-language corpus of telephone calls

4

slide-5
SLIDE 5

Writing Systems

  • Roman (thousands of languages)
  • Cyrillic (> 60 languages)
  • Arabic (> 20 languages)
  • Devana̅garī (> 10 languages, not

counting derivative writing systems)

  • Hebrew (~ 3 – 5 languages)

5

slide-6
SLIDE 6

Devanāgarī

वनागरी

6

slide-7
SLIDE 7

7

slide-8
SLIDE 8

8

slide-9
SLIDE 9

9

slide-10
SLIDE 10

10

slide-11
SLIDE 11

Hebrew

  • Old and Modern Hebrew,
  • Ladino (with difgerent varieties),
  • Judeo-Arabic,
  • Yiddish.

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

13

slide-14
SLIDE 14

Norman C. Ingle (1980): Language Identification

  • Table. London: Technical

Translation International.

14

slide-15
SLIDE 15

15

slide-16
SLIDE 16

Computer-aided identification

  • frequencies of unique characters and

character strings

  • common words recognition
  • n-gram analysis

16

slide-17
SLIDE 17

“Text”

17

slide-18
SLIDE 18

( TE), (TEX), (EXT), (XT )

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

variant of the unique character string approach

20

slide-21
SLIDE 21

compression effjciency

21

slide-22
SLIDE 22

reference model

22

slide-23
SLIDE 23

reference model text in language to be identified +

23

slide-24
SLIDE 24

reference model text in language to be identified + gzip

24

slide-25
SLIDE 25

reference model text in language to be identified + gzip compression effjciency

25

slide-26
SLIDE 26

Interesting applications

  • measuring linguistic difference

> language families

  • determining types of text
  • spam detection?

26

slide-27
SLIDE 27
  • TextCat (http://odur.let.rug.nl/vannoord/TextCat/Demo/), n-gram

based, 76 languages, usable as a web application,

  • Languid (http://languid.cantbedone.org/), downloadable

program, web application not running,

  • Langid (http://complingone.georgetown.edu/∼langid/), n-gram

based, 65 languages, web application,

  • LanguageGuesser (http://www.xrce.xerox.com/cgi-bin/mltt/

LanguageGuesser), frequency tests on characters and character sequences, about 40 languages, web application,

  • Polyglot 3000 (http://www.polyglot3000.com/), corpora, method

unknown, currently 441 languages, closed-source Windows

  • freeware. :-(

27

slide-28
SLIDE 28

approaching “content analysis”

28

slide-29
SLIDE 29

Hackerʼs approach

  • numbers, dates, words from another

language

  • typographic hints:
  • bold or italic print,
  • colored or underlined text chunks,
  • capital letters

29

slide-30
SLIDE 30

Zipfʼs law

Very frequent words are shorter and contain less lexical information, whereas infrequent words are longer and contain more lexical information.

30

slide-31
SLIDE 31

less lexical information implies more grammatical information and vice versa

31

slide-32
SLIDE 32

most interesting for us: words with more specific lexical information

32

slide-33
SLIDE 33

Ignore all short words! (even if they reiterate throughout the text)

33

slide-34
SLIDE 34

Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi.

34

slide-35
SLIDE 35

Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi.

35

slide-36
SLIDE 36

Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi.

36

slide-37
SLIDE 37

Ua salalau lenei gagana i le lalolagi atoa. ‘O lenei fo‘i gagana, ‘ua ‘avea ma gagana lona lua a le tele o tagata ‘o le vasa Pasefika, e pei ‘o Samoa. E iai le manatu, ‘o le gagana fa‘aperetania, ‘ua matua̅ talitonu i ai le tele o tagata Samoa e fa‘apea ‘o le gagana e maua ai le atamai ma le poto. ‘E talitonu fo‘i nisi o i latou, ‘e le̅ aoga la latou gagana. E le̅ sa‘o lea ta̅ofi, ‘aua̅ e ‘avatu le gagana fa‘aperetania i Samoa, ‘ua leva ona atamamai ma popoto tagata Samoa e fai lo latou soifua ma lo latou lalolagi.

37