English version of Introduction to Computational Linguistics, slides - - PDF document

english version of introduction to computational
SMART_READER_LITE
LIVE PREVIEW

English version of Introduction to Computational Linguistics, slides - - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/270686919 English version of Introduction to Computational Linguistics, slides Conference Paper November 2014 DOI:


slide-1
SLIDE 1 See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/270686919

English version of Introduction to Computational Linguistics, slides

Conference Paper · November 2014

DOI: 10.13140/2.1.2987.2964 CITATIONS

2

READS

92

1 author: Some of the authors of this publication are also working on these related projects: Chief Technical Officer at Semantycs View project Project Arcturus View project Fiorella Dotti Universidad Autónoma de Madrid

6 PUBLICATIONS 6 CITATIONS SEE PROFILE

All content following this page was uploaded by Fiorella Dotti on 11 January 2015.

The user has requested enhancement of the downloaded file.
slide-2
SLIDE 2

Introduction to Computational Linguistics

Fiorella C. Dotti UAM

University of Salamanca

slide-3
SLIDE 3

Universidad de Salamanca Fiorella C. Dotti

What is Computational Linguistics?

ACL Definition: “The study of language from a computational perspective” An area of knowledge that combines theoretical and applied linguistics, statistics, computer science and mathematics, among other fields, in order to further our understanding of natural language and help us develop new language technologies.

slide-4
SLIDE 4

Universidad de Salamanca Fiorella C. Dotti

What is Computational Linguistics?

The field is closely related to Natural Language Processing (NLP). The relationship between NLP and Computational Linguistics has been described as the similarity between Engineering and Science: Computational Linguistics is more concerned with causes and

  • rigins, NLP is more concerned with direct

application.

slide-5
SLIDE 5

Universidad de Salamanca Fiorella C. Dotti

What is Computational Linguistics?

In our area of study, both fields are constantly

  • verlapping.

Most likely, we are interested in both things: improving recognition and finding out the underlying cause.

slide-6
SLIDE 6

Universidad de Salamanca Fiorella C. Dotti

Approaches to CL and NLP

The first approaches were mostly rule-based. Rule-based approaches typically make intensive use of hand-crafted resources. Creating these resources is expensive.

slide-7
SLIDE 7

Universidad de Salamanca Fiorella C. Dotti

Approaches to CL and NLP

Then, statistical approaches started to be used. Statistical approaches do not often rely on as much information as rule-based approaches. This makes them cheaper, and as a result of this, more popular.

slide-8
SLIDE 8

Universidad de Salamanca Fiorella C. Dotti

Approaches to CL and NLP

Nevertheless, statistical approaches only work for cases that are very frequent, so a combined approach is rising in popularity (rule-based + statistics).

slide-9
SLIDE 9

Universidad de Salamanca Fiorella C. Dotti

Some results of CL + NLP

❖ Speech recognition systems (NOT the same as Voice recognition). ❖ Search engines ❖ Automatic ontology creation ❖ Automatic correction systems. ❖ Sentiment analysis ❖ Automatic summarization ❖ Machine translation ❖ Automated natural language generation ❖ Natural language understanding

slide-10
SLIDE 10

Universidad de Salamanca Fiorella C. Dotti

Speech recognition systems

The most common example would be using a search engine or a digital assistant by means of speaking to your phone. Speech recognition systems use statistical techniques such as Hidden Markov Models to calculate the probability that a phoneme will be followed by another and identify the most likely intended word/sentence.

slide-11
SLIDE 11
slide-12
SLIDE 12

The photo depicts the launch of STS-26 (September 1988), the first return to flight mission after the Challenger accident. This was the first shuttle mission to use a non-critical speech recognition system. Weightlessness affected the astronaut’s articulation, so that templates created on the ground were ineffective, while templates that were created in microgravity were highly effective (as long as personal templates were created as well).

slide-13
SLIDE 13

Another possible aerospace application

This video captures a real conversation between a hypoxic pilot and air traffic controllers. The pilot is physically and cognitively unable to effectively control the plane and can only respond to direct instructions. Do you think a speech recognition system could have helped him? How?

slide-14
SLIDE 14

https://www.youtube.com/watch?v=_IqWal_EmBg

slide-15
SLIDE 15

Universidad de Salamanca Fiorella C. Dotti

Search Engines

Similarly to Speech recognition systems, they calculate the probability of a word being followed by another and that it would refer to

  • ne topic or another (an area of study known

as word sense disambiguation). They also identify keywords (something that Search Engine Optimization makes use of) and try to repair user error (e.g., typing “machne learning” would return suggested results for “machine learning”)

slide-16
SLIDE 16

Universidad de Salamanca Fiorella C. Dotti

Automatic Ontology creation

An ontology is a formal framework that we can use to represent knowledge. Natural language understanding and keyword extraction techniques (as well as others) can be used to extract information and its relationship to other information bits (e.g: Ontology from Wikipedia → DBpedia)

slide-17
SLIDE 17

Universidad de Salamanca Fiorella C. Dotti

Automatic error correction systems

Nowadays, it is not infrequent to teach a class with students with 5 different mother tongues in it (or more). New European standards demand learner autonomy. This is a very hard situation for teachers.

slide-18
SLIDE 18

Universidad de Salamanca Fiorella C. Dotti

Automatic error correction systems

Automatic error correction systems help because:

  • 1. They are always available, so students can

practise at any time.

  • 2. They do not have a native language

limitation, they can detect and trace errors from students with different L1s

slide-19
SLIDE 19

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

Big companies invest large amounts of money in obtaining information about their customers. One of the main ways to do so is by monitoring and participating in social networks. “Community Managers” are not able to stay up to date on absolutely everything related to the brand, in real-time

slide-20
SLIDE 20

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

If a program can detect customers’ opinions and understand how they see the brand’s competitors, marketing campaigns can be finely targeted. There are many possible methods to use in this area (bag of words, Support Vector Machine, etc) Many companies offer their services in this are

slide-21
SLIDE 21

Universidad de Salamanca Fiorella C. Dotti

Machine translation

Automatically translate a text from its source language into a target language. This is how CL started: In the 1950s, American defense agencies wanted to be able to translate scientific articles from Russian into English. Russian agencies were trying to do the same.

slide-22
SLIDE 22

Universidad de Salamanca Fiorella C. Dotti

Machine Translation

Current examples include, most famously, Google Translate. Google can detect the source language automatically. Most systems use parallel corpora (a corpus of texts in one language and their corresponding translations into other languages) and dictionaries.. Rule based techniques provide a syntactic basis, while statistical techniques help with false-friend detection.

slide-23
SLIDE 23

Universidad de Salamanca Fiorella C. Dotti

Natural Language Generation

The creation of natural language by a machine. We can determine how close it is to what a human will say by using a series of tests. One of the best known tests for this purpose is the Turing test.

slide-24
SLIDE 24

Universidad de Salamanca Fiorella C. Dotti

Natural Language Generation

Often used to create a more “user-friendly” experience (databases, Q&A systems). Also useful to improve accessibility for users with disabilities that prevent them from speaking, reading, etc.

slide-25
SLIDE 25

Universidad de Salamanca Fiorella C. Dotti

Natural Language Understanding

A ‘smarter’ computer: It entails not only being able to ‘read’ the text, but to make logical inferences from it. Present in the technologies that we have reviewed before, though it is also an area of research on its own right.

slide-26
SLIDE 26

Universidad de Salamanca Fiorella C. Dotti

How do these sytems work? Several components:

  • A. Statistical systems
  • B. Linguistic systems
  • C. Programming
  • D. Extra resources (of any type)
slide-27
SLIDE 27

Universidad de Salamanca Fiorella C. Dotti

Statistical systems: a quick intro

There are many statistical methods, but in essence they are mostly counting the amount

  • f instances of a particular phenomena and the

circumstances surrounding it, and deciding how likely it is that that would have occurred by chance.

slide-28
SLIDE 28

Universidad de Salamanca Fiorella C. Dotti

Statistical systems: a quick intro

Central to this idea is the concept of statistical significance: something is statistically significant if it is not likely to have happened by chance alone. There are tables with values that allow researchers to identify when something is or is not statistically significant.

slide-29
SLIDE 29

Universidad de Salamanca Fiorella C. Dotti

Linguistic systems

Rules that are derived from linguistic knowledge, e.g.: Example sentence: “He are busy” Linguistic rule: the third person singular for the verb ‘to be’ is “is”. Therefore, the sentence is incorrect.

slide-30
SLIDE 30

Universidad de Salamanca Fiorella C. Dotti

Programming

The backbone and glue of it all. Not necessarily innovative, sometimes it just acts as a facilitating medium (you wouldn’t be able to process a 3,000,000 word corpus without some programming involved).

slide-31
SLIDE 31

Universidad de Salamanca Fiorella C. Dotti

A programming primer

Computers are best suited to process numerical information and can also deal with logical representation, but they appear to handle more advanced concepts because we use programs on top of basic programs. The trick lies in reducing everything to its simplest expression so that the computer can understand it.

slide-32
SLIDE 32

Universidad de Salamanca Fiorella C. Dotti

Example: Turn Off The Light

We could program a robot to walk down the aisle and turn off the light. The problem is that the robot’s computer will not be able to work with that right away. We must break the problem into smaller pieces

  • f information.
slide-33
SLIDE 33

Universidad de Salamanca Fiorella C. Dotti

Example: Turn Off The Light

  • 1. Ascertain that the light is on
  • 2. Walk down the aisle
  • 3. Find the light switch
  • 4. Press the light switch.
slide-34
SLIDE 34

Universidad de Salamanca Fiorella C. Dotti

Example: Turn Off The Light

  • 2. Walk down the aisle

★ Error: What is ‘walk’? ★ Error: What is ‘the aisle’? ★ Error: How do I know which way is down?

slide-35
SLIDE 35

Universidad de Salamanca Fiorella C. Dotti

Example: Turn Off The Light

Walk= put one foot in front of the other in a straight line. The aisle= where you will be walking. A straight line, there will be walls on the sides (use computer vision, or sensors?) Down = you will know the way down because it is further from where you are now (or there is light, or if it is physically down, you can use your oscillometer).

slide-36
SLIDE 36

Universidad de Salamanca Fiorella C. Dotti

Example: Turn Off The Light

We need to do this for each one of our subdivisions. Each one of this smaller steps would be defined in code. Walking would be a function, the aisle would be a variable, etc. All the steps would ultimately become one function: Turn off the light. The user would probably only see a button that gives them the

  • ption to give the robot that order.
slide-37
SLIDE 37

Universidad de Salamanca Fiorella C. Dotti

Python

Python is a good programming language for someone in computational linguistics:

  • 1. Easy to learn. Reads almost like English.
  • 2. Good for prototyping.
  • 3. Good community. Many projects in CL and

NLP use Python, and it is a de facto industry standard.

  • 4. ‘Batteries included’ → lots of functionality

right off the box.

slide-38
SLIDE 38

Universidad de Salamanca Fiorella C. Dotti

Data types

Data types exist in all programming languages. They are ways to store data. Some of the most widely used ones are: ➢ Strings ➢ Lists ➢ Dictionaries (‘mappings’) ➢ Tuples ➢ Numbers ➢ Sets

slide-39
SLIDE 39

Universidad de Salamanca Fiorella C. Dotti

Data types: Strings

Strings are immutable (they never change). They can contain numbers or text, but when you use numbers inside them, the computer will not recognize them as such. a_string= “The University of Salamanca” b_string= “12345” c_string= “10000”

slide-40
SLIDE 40

Universidad de Salamanca Fiorella C. Dotti

Data types: Strings

If we sum strings, we only get concatenated strings (one goes after the other, the numbers are not treated as such): b_string+c_string= “1234510000” a_string+b_string= “The University of Salamanca12345”

slide-41
SLIDE 41

Universidad de Salamanca Fiorella C. Dotti

Data types: Strings

Strings are best used for information that won’t change and that should be kept in a certain

  • rder, e.g, a sentence:

some_string = “The man that you saw yesterday is asking me for help.”

slide-42
SLIDE 42

Universidad de Salamanca Fiorella C. Dotti

Data type: List

A list is an ordered sequence of elements, but in contrast to the string, it is mutable. Their

  • rder and length can change.

shopping_list= [“milk”, “cereal”, “bacon”] We can iterate over lists, that is, we can treat each one of its elements on its own. We can also sort them alphabetically and do many

  • ther things.
slide-43
SLIDE 43

Universidad de Salamanca Fiorella C. Dotti

Data type: Tuple

The tuple is an ordered, immutable small set of elements. tuple_bigram= (‘a’, ‘walk’). They are useful when maintaining order is important, but we need more atomicity and structure.

slide-44
SLIDE 44

Universidad de Salamanca Fiorella C. Dotti

Data type: Dictionary

Dictionaries are similar to real life dictionaries. They have a key and a value. The problem is that dictionaries in Python should not have repeated keys and they only take one value for key, so if a word has multiple values, they get problematic. dictionary={‘map’: ‘mapa’, ‘cat’: ‘gato’...}

slide-45
SLIDE 45

Universidad de Salamanca Fiorella C. Dotti

Data type: Numbers

Numbers can be integers, floats, complex numbers, etc. Integers and floats are the most commonly used. You can use numbers for any mathematical operation you want. my_integer = 1 my_float = 2.0 my_integer+my_float=3.0

slide-46
SLIDE 46

Universidad de Salamanca Fiorella C. Dotti

Data type: Sets

Sets do not take duplicates. They are often used to remove duplicate instances: my_list = [‘jam’, ‘toast’, ‘cereal’, ‘jam’] my_set = set(my_list) (‘jam’, ‘toast’, ‘cereal)

slide-47
SLIDE 47

Universidad de Salamanca Fiorella C. Dotti

Data type: Booleans

Something is True or False: my_favorite_animal=’lion’ >>>my_favorite_animal==’giraffe’ False

slide-48
SLIDE 48

Universidad de Salamanca Fiorella C. Dotti

Iteration

To iterate means to go over (elements) in sequence. my_list=[‘giraffe’, ‘lion’, ‘turtle’] for element in my_list: print(element) giraffe lion turtle

slide-49
SLIDE 49

Universidad de Salamanca Fiorella C. Dotti

Conditional Statements

If something is true, then do something. Else, do something else: If age<10: print(“You are more than 10 years old”) else: print(“You are 10 years old or younger.”)

slide-50
SLIDE 50

Universidad de Salamanca Fiorella C. Dotti

While Loops

While a condition is met, do something. It is possible to create infinite loops in this manner (be careful!) i=10 while i>5: print(‘still a while to go!”) i= i -1

slide-51
SLIDE 51

Universidad de Salamanca Fiorella C. Dotti

Functions

A collection of instructions that may or may not return a result: def count_to_ten(initial_number): while initial_number<10: initial_number=initial_number+1 print (‘done’) count_to_ten(5)

slide-52
SLIDE 52

Universidad de Salamanca Fiorella C. Dotti

Recursion

A function can call on itself until something is done. This is often useful to solve problems

slide-53
SLIDE 53

Universidad de Salamanca Fiorella C. Dotti

Objects

Everything in Python is an object. We can also define custom-made classes, which are the ‘templates’ for a specific object (instance). We could for instance create a class called ‘Verb’ that had “infinitive”, “gerund” and “regular

  • r irregular” as attributes.
slide-54
SLIDE 54

Universidad de Salamanca Fiorella C. Dotti

Modules and libraries

Python allows for modularity, which means that we can code things in different files, and import them as we need to. We can also use libraries. If we wanted to reuse code that someone else has done before, we could, as long as it is an available library, or we have the source code.

slide-55
SLIDE 55

Universidad de Salamanca Fiorella C. Dotti

Programming: the bottom line

You need to convert your idea into something the computer can understand. You need to decompose the problem into its smaller parts, and they have to be able to be expressed in

  • ne of the data types or objects we have seen

before, for a computer to be able to work on it. This can be a daunting task, so there are two main approaches: bottom-up or top down.

slide-56
SLIDE 56

Universidad de Salamanca Fiorella C. Dotti

The bottom up approach

Used when there is much prototyping to be done. Not good to keep up if you are going to tackle a big problem with many interconnected parts, it can be hard to keep the big picture on sight. It is done by testing a solution to one of the smaller problems, and then ‘moving up’ conceptually, going from the components to the larger scheme of things.

slide-57
SLIDE 57

Universidad de Salamanca Fiorella C. Dotti

The top down approach

You start from the highest possible conceptualization of the process, and you work your way down. Cons: you might find that solving the “smaller” problems actually requires a lot of code to be so that it “breaks” your top level code. Mostly useful when there is less prototyping to do, known solutions.

slide-58
SLIDE 58

Universidad de Salamanca Fiorella C. Dotti

SECTION 3: A First Approach to CL Problem Solving.

We will go over a couple of the previously presented areas of CL and NLP and try to work

  • ut the way in which they work:

Sentiment analysis Automatic error correction (we will gloss it as the next block will be dedicated to a case study

  • n this)
slide-59
SLIDE 59

Universidad de Salamanca Fiorella C. Dotti

Problem solving: Sentiment analysis

Case study: We are hired by a well-known advertising company (A). Their client wants to know how their new snack product fares against their major competitor. The competition is fierce, and their client needs an automated system in order to keep up with the avalanche of information found in social networks, blogs, etc. We are tasked with developing a system to solve this problem.

slide-60
SLIDE 60

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis: what we know

  • The client is focusing on social media and

blogs.

  • The format of Tweets and microblogging

services is different than that of traditional blogs.

  • The product might also be reviewed in

specialized review sites. The writing will be different too.

slide-61
SLIDE 61

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis: what we know

Our system should, regardless of media:

  • 1. Retrieve the text in question
  • 2. If possible, retrieve any information about

the poster.

  • 3. Analyze the text
  • 4. Determine whether the text is positive

towards their product or not, or whether the competitor is preferred.

slide-62
SLIDE 62

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis: problems

➢ Microblogs provide very little text on which to base our decision. ➢ Multi-word expressions: “great taste” is clear enough, but “not so great taste” could throw a simple recognizer off-track. ➢ What resources are open to us? Do we have access to large, relevant corpora? or are they locked out because of financial or legal reasons?

slide-63
SLIDE 63

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

Assume that we have the text and the author’s information: We can use a supervised learning system (create a corpus with positive or negative scores for each opinion regarding the product and train our system based on this) → Rotten Tomatoes (each review is matched to a numerical rating, 5 stars being the best).

slide-64
SLIDE 64

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

We must bear in mind that there are some steps before we can start processing information. For instance, if we use a dictionary, we will have to use lemmatization. Otherwise, our recognizer would fail because it would not know things such as that the word “liked” is a past form of the verb “to like”.

slide-65
SLIDE 65

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

We can score phrases according to their possibility of being positive or negative. A parser can provide extra information such as Part of Speech (POS) information. In this way, we will try to reconstruct the meaning of the phrase.

slide-66
SLIDE 66

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

We could use a statistical classifier (such as Support Vector Machines, for instance). These classifiers try to extract the main features of a negative or positive text (though they can be applied for other purposes as well) and, according to these, train the system so that it can classify texts for which no manually- annotated version is available.

slide-67
SLIDE 67

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

We could also combine both methods: For instance, we could use the statistical classifier, but if we observe that it always fails when it finds a certain word, we can provide it with a series of rules in relation to that error and correct or supplement the results of the initial classifier.

slide-68
SLIDE 68

Universidad de Salamanca Fiorella C. Dotti

Sentiment analysis

In order to find what technique yields best results, experimentation is required. Then, we can count the amount of false positives and false positives that we have. There are many standards to test how effective an automatic system is, but one of the most used ones is known as “Precission and Recall”.

slide-69
SLIDE 69

Universidad de Salamanca Fiorella C. Dotti

Case study: automatic text correction

Students in our class find verbs difficult to

  • master. Sometimes they use the wrong

inflexion for the pronoun that they are using. Sometimes they forget irregular past forms or how to produce a past continuous form. Also, they commit numerous spelling mistakes, in both languages.

slide-70
SLIDE 70

Universidad de Salamanca Fiorella C. Dotti

Case study: automatic text correction

Most frequent errors: “he are my father” “I was talk to a very curios person” “I swimmed two miles yesterday”

slide-71
SLIDE 71

Universidad de Salamanca Fiorella C. Dotti

Case study: automatic text correction

Work in small groups (2-3 people). You have 5 minutes to think of how you would address this

  • problem. Then we will work through it together.
View publication stats View publication stats