authoring support with authoring support with acrolinx iq
play

Authoring Support with Authoring Support with acrolinx IQ - PowerPoint PPT Presentation

Authoring Support with Authoring Support with acrolinx IQ acrolinx - the company acrolinx - the company production of technical documents NLP for spelling and terminology spelling and terminology grammar style


  1. Authoring Support with Authoring Support with acrolinx IQ ™

  2. � acrolinx - the company � acrolinx - the company � production of technical documents � NLP for � spelling and terminology � spelling and terminology � grammar � style � consistent phrasing p g

  3. � software for information quality assurance � software for information quality assurance � spin-off from German Research Center for Artificial Intelligence (DFKI) Saarbrücken Artificial Intelligence (DFKI), Saarbrücken � technology under development since 1997 (since 2002 as acrolinx)) (since 2002 as acrolinx)) � headquarter in Berlin, about 40 employees � users in 25 countries, checking millions of i 25 t i h ki illi f words a month

  4. Communicatio Software Life Sciences Industrial Technology ns ns Adobe Dräger AlcatelLucent DAF Bosch Autodesk GE Cisco HOMAG Embraer KonicaMinol KonicaMinol CA Medtronic Huawei John Deere ta EMC Siemens Motorola MAN Philips p SEW IBM SonyEricsson Eurodrive SAS Siemens Institute Leica Symantec GeoSystems

  5. � correctness � correctness � spelling � spelling � understandability � grammar � readability d bilit � style t l � translatability � terminology � consistence � less ambiguity g y � corporate wording

  6. � Translation costs � Translation costs � Support costs

  7. words + phrases spelling p g � variants, such as US-English vs. UK-English ◦ terminology � set up and administration of terminology ◦ terminology checking terminology checking ◦ ◦ grammar � grammar checking ◦ sentences style y � checking of style guidelines ◦ checking for consistancy, translatability, readability ◦ structure � d document structure t t t ◦ multilinguality text �

  8. � errors are defined � errors are defined � words are defined in a � words are defined in a � unknown words that dictionary are not defined as � anything not in the y g errors are term errors are term dictionary is an error candidates � high recall, low � based on words and precision (depending rules � consider terminology on the domain) � high precision recall is � high precision, recall is dependent on data work language analysis error analysis

  9. � tokenization � tokenization � POS-tagging � morphology h l � dictionary � error dictionary

  10. � Close the door of our XYZ car � Close the door of our XYZ car. capital word capital word lower word lower word space space dot_EOS dot EOS 花子が本を読んだ。 based on rules and lists of 花子 が 本 を 読ん だ 。 abbreviations Kanji Hiragana dot_EOS

  11. � Close the door of � Close the door of our XYZ car. our XYZ car � V DET N PREP PRON NE N XML and attribut value structures value structures statistical methods large dictionaries large dictionaries

  12. � Close the door of � Close the door of our XYZ car. our XYZ car Lemma: close Tense: present_imp Lemma: car Person: third Person: third N Number: singular b i l Number: singular Case: nominative_accusative based on dictionaries based on dictionaries, rules for inflection and derivation

  13. � Consistency! � Consistency! � ideally: 1 term = 1 meaning = 1 translation � less ambiguity, better comprehension, t translatability, etc. l t bilit t � multilingual consistency � corporate wording � lower costs (translation but also support)

  14. � When analyzing terminology in documents � When analyzing terminology in documents, we find many variants that are used at the same time: same time: ◦ web server – web-server ◦ upload protection – upload-protection upload protection upload protection ◦ timeout – time out ◦ Reset – ReSet ◦ sub station – sub-station

  15. � author/company defines term banks � author/company defines term banks � list of deprecated terms � list of deprecated terms deprecated term: vehicle approved term: car pp � list of approved terms pp � identification of so-called “variants” approved term: SWASSNet User d deprecated term: SWASSNet user, SWASS- t d t SWASSN t SWASS Net User

  16. ◦ ◦ orthographic variants orthographic variants - hyphen, blank, case: term bank, termbank ◦ sem i-orthographic variants - number : 6-digit, six-digit - trademark : acrolinx IQ™, acrolinx IQ ◦ ◦ syntactic variants syntactic variants - preposition: oil level, level of oil - gerund/noun : call center, calling center ◦ synonym s “classical” : vehicle, car ◦ lang language-specific variants age specific a iants (e.g. Fugenelemente DE, Katakana JA)

  17. � in terminology: SpeicherKarte � in terminology: SpeicherKarte

  18. � term: MMC-Speicherkarten (deprecated) � term: MMC-Speicherkarten (deprecated), suggested: PC-Speicherkarten

  19. � � T Term erm Terminology Terminology Validation Validation Documentation Term candidates are validated Localization Term Discovery Term Discovery Document repository is analysed for terms Term Deploymen Term Deployment Term checking TermHarvesting™ TermHarvesting™ New terms are identified as content is checked

  20. � NLP methods for term extraction ◦ corpus analysis (morphology, POS, NER) ◦ information extraction (potential product names) ◦ ontologies (e.g. semantic groups) � NLP methods for setting up a term database � NLP methods for setting up a term database ◦ morphology (finding the lemma) ◦ POS � NLP methods for term checking ◦ variants ◦ similar words ◦ inflection

  21. � grammar errors are g � definition of correct grammar implemented ◦ e.g. HPSG, LFG, chunk- ◦ preconditions: grammar, statistical grammars � work with error corpora ◦ anything that‘s not analyzable anything that s not analyzable � error grammar with a high must be a grammar error number of error types ◦ preconditions: � grammar with large � „deepness“ of analysis coverage coverage varies with the type of varies with the type of error to be described � giant dictionaries ◦ high precision, recall is based � robust, but not too robust on the number of rules parsing p g � efficient parsing methods ◦ high recall, low precision descriptive grammar error grammar

  22. � subject verb agreem ent: � subject verb agreem ent: ◦ Check if instructions are programmed in such a way that a scan never finish way that a scan never finish. ◦ When the operations is completed, the return to home completes. � a an distinction: ◦ a isolating transformer ◦ an program � w rong verb form : ◦ it cannot communicates with them ◦ IP can be automatically get

  23. � write_w write_words_to rds_together g ether ◦ @can ::= [ TOK "^(can)$" MORPH.READING.MCAT "^Verb$" ]; ◦ ◦ The application can not start. ◦ The application can tomorrow not start. ◦ TRIGGER(80) == @can^1 [@adv]* 'not'^2 -> ($can, $not) ◦ -> { mark: $can, $not; ◦ suggest: $can -> '', $not -> 'cannot'; $ '' $ ' ' ◦ } ◦ ◦ Branch circuits can not only minimize system damage but can Branch circuits can not only minimize system damage but can interrupt the flow of fault current ◦ NEG_EV(40) == $can 'not' 'only' @verbInf []* 'but';

  24. • controlled languages controlled languages • AECMA – now: AeroSpace and Defence Industries Association of Europe (ASD) ASD STE100 ( i ASD-STE100 (simplified English) lifi d E li h) • Caterpillar Technical English (CTE) • disadvantage: • very restrictive! Prescriptive rules define allowed structures and y p allowed vocabulary � all other structures and words as disallowed • low acceptance of user • low acceptance of user

  25. � rules define errors (just as grammar rules do) � rules define errors (just as grammar rules do) � rules are defined by user / author � acceptance is much higher acceptance is much higher

  26. � style guidelines can be different for � style guidelines can be different for different usages ◦ text type (e g press release ◦ text type (e.g., press release – technical technical documentation) ◦ domain (e.g., software – machines) ( g , ) ◦ readers (e.g., end users – service personnel) ◦ authors (e.g., Germans tend to write long sentences)

  27. • avoid latin expressions avoid_latin_expressions • avoid_modal_verbs • avoid_passive • avoid_split_infinitives p • avoid_subjunctive • use_serial_comma i l • use_comma_after_introductory_phrase • spell_out_numerals

  28. • use units consistently • use_units_consistently • abbreviate currency _ y • COMPANY_trademark • do_not_refer_to_ COMPANY _intranet • add_tag_to_UI_string dd t t UI t i • avoid trademark as noun avoid_trademark_as_noun • avoid_articles_in_title

  29. • avoid nested sentences • avoid_nested_sentences • avoid_ing_words id i d • keep_two_verb_parts_together k t b t t th • avoid_parenthetical_expressions � dependent of MT system and language pair

  30. ◦ replacement of words or phrases ◦ replacement of words or phrases ◦ replacement using the correct writing with uppercase or lowercase pp ◦ replacement of words using the correct inflection ◦ generation of whole sentences (e.g. passive – active) requires semantic analysis and generation ) l d and is therefore not (yet) possible

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend