Ruth Vatvedt Fjeld & Rune Lain Knudsen LBK2013 a lexicographic - - PowerPoint PPT Presentation

ruth vatvedt fjeld rune lain knudsen lbk2013 a
SMART_READER_LITE
LIVE PREVIEW

Ruth Vatvedt Fjeld & Rune Lain Knudsen LBK2013 a lexicographic - - PowerPoint PPT Presentation

Ruth Vatvedt Fjeld & Rune Lain Knudsen LBK2013 a lexicographic corpus for modern Norwegian bokml Purpose: Lemma selection Frequency based lemma selection Neologisms/obsolete words Singleword lemmas Mus (mouse) (meaning


slide-1
SLIDE 1

Ruth Vatvedt Fjeld & Rune Lain Knudsen LBK2013 – a lexicographic corpus for modern Norwegian bokmål

slide-2
SLIDE 2

Purpose:

  • Lemma selection

Frequency based lemma selection

Neologisms/obsolete words – Singleword lemmas » Mus (mouse) (meaning change) » Tastafon obsolete words – Multiword lemmas » Være lutter øre (be all ears) obsolete

slide-3
SLIDE 3

LBK – Lexciographical Bokmål Corpus

  • The documents in LBK2013 is restricted to the

timespan of 1985-2013.

– Availability – Modern language – Changes in lexicon related to existing dictionaries (built from excerpts of old language) – a balanced corpus of 100 mill. tokens

slide-4
SLIDE 4

Selection of text types

  • Modern fiction
  • Text books
  • Blogs
  • Factual prose
  • Law
  • Medicin
  • Natural siences
  • Humaniora
  • Sports …

Institutt for lingvistiske og nordiske studier (ILN)

slide-5
SLIDE 5

Demography markers

  • Age
  • Sex
  • Place of birth and youth
  • Year of birth
  • Publisher
  • Year of publication
  • Such metadata makes it easy to construct

subcorpora for comparative investigations and a wide range of queries

Institutt for lingvistiske og nordiske studier (ILN)

slide-6
SLIDE 6

Text categories LBK2013

  • 11. april 2011

Ny Powerpoint mal 2011 6

35 % 49 % 5 % 6 % 5 % Skjønnlitteratur Sakprosa Unormert Aviser og kulørte ukeblader TV-tekster

slide-7
SLIDE 7

How?

LBK makes use of the IMS Corpus Workbench, a widely used tool set for managing and querying large text corpora. It is made available for researchers through Glossa, a web based interface for corpora developed at the Text Laboratory, ILN at the University of Oslo. Every document is POS-tagged with the Oslo- Bergen tagger. Additional metadata such as bibliographic and ethnographic information is manually annotated and stored as TEI headers.

slide-8
SLIDE 8

Resources

20 000 000 40 000 000 60 000 000 80 000 000 100 000 000 120 000 000

slide-9
SLIDE 9

Staff

1 2 3 4 5 6 7 8 PROJECT LEADER ASSISTANT ENGINEER

slide-10
SLIDE 10

New statistical tools

  • Frequency counts
  • Concordances
  • DeepDict analysis (Bick)
  • Word Sketch Engine (Kilgarriff)
slide-11
SLIDE 11

Why compile a balanced corpus

Statistical analysis of interesting subcorpora for – Actual use of recommended morphology

  • (standardisation and documentation)

wordform TV-text Total korpus NoTa tiden/tida (time) 72/28 92/8 60/40 takken/takka (thanks) 100/0 100/0

  • hjelpen/hjelpa (help)

91/9 95/5 50/50 lysten/lysta (desire) 100/0 100/0 100/0 moren/mora (mother) 81/19 91/9 79/11 kvinnen/kvinna (woman) 100/0 99/1 100/0 uken/uka (week) 42/58 63/37 21/79

slide-12
SLIDE 12

How to mark up a corpus

  • PoS-tagging by automatic analysis
  • Grammar: valency/argument structure etc.
  • Jeg har tenkt til å gjøre det (I intend to do it)
  • Flaska knuste (the bottle broke)
slide-13
SLIDE 13

Muslim as first part of composita(1985-2000)

  • 11. april 2011

Ny Powerpoint mal 2011 13

1985-1990 1991-1995 1996-2000 5 muslim 215 muslim 164 muslim 1 muslimsk 176 muslimsk 139 muslimsk 8 muslimsk-kroatisk 2 muslimsk-kroatiske 6 muslimsk-dominert 2 muslimske 2 muslimsk-dominert

slide-14
SLIDE 14

Muslim in composita (2001-2013)

  • 11. april 2011

Ny Powerpoint mal 2011 14

2001-2005 2006-2010 2011-2013 1073 muslim 1217 muslim 948 muslim 987 muslimsk 923 muslimsk 499 muslimsk 14 muslimbrødrene 4 muslimene 3 muslimhat 5 muslimbror 1 muslimhets 2 muslimskdominert 3 muslimskføde 1 muslimsirkel 1 muslimhater 2 muslimsk-arabisk 1 muslimdominert 1 muslimhatende 1 muslimhater 1 muslimskhet 1 muslimvennlig 1 muslimsk-jødisk 1 muslimfrykten 1 muslimisme 1 muslimskdominert 1 muslimdebatt 1 muslimhets