Machine Learning for NLP Ethics and Machine Learning Aurlie - - PowerPoint PPT Presentation

machine learning for nlp
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for NLP Ethics and Machine Learning Aurlie - - PowerPoint PPT Presentation

Machine Learning for NLP Ethics and Machine Learning Aurlie Herbelot 2019 Centre for Mind/Brain Sciences University of Trento 1 Today 1. Predicting or not predicting? That is the question. 2. Data and people: personalisation, bubbling,


slide-1
SLIDE 1

Machine Learning for NLP

Ethics and Machine Learning

Aurélie Herbelot 2019

Centre for Mind/Brain Sciences University of Trento 1

slide-2
SLIDE 2

Today

  • 1. Predicting or not predicting? That is the question.
  • 2. Data and people: personalisation, bubbling, privacy.
  • 3. The problem with representations: biases and big data.
  • 4. The problem with language

2

slide-3
SLIDE 3

Predicting or not predicting?

3

slide-4
SLIDE 4

Brave New World

Artificial Intelligence and Life in 2030

https://ai100.stanford.edu/sites/default/files/ai_100_report_0831fnl.pdf

“Society is now at a crucial juncture in determining how to deploy AI-based technologies in ways that promote rather than hinder democratic values such as freedom, equality, and transparency.”

4

slide-5
SLIDE 5

Brave New World

“As cars will become better drivers than people, city-dwellers will

  • wn fewer cars, live

further from work, and spend time differently, leading to an entirely new urban organization.” “Though quality education will always require active engagement by human teachers, AI promises to enhance education at all levels, especially by providing personalization at scale.” “As dramatized in the movie Minority Report, predictive policing tools raise the specter of innocent people being unjustifiably targeted. But well-deployed AI prediction tools have the potential to actually remove or reduce human bias.” 5

slide-6
SLIDE 6

Cambridge Analytica

  • The ML scandal of the last two years...
  • Used millions of Facebook profiles to (allegedly) influence

US elections, Brexit referendum, and many more political processes around the world.

  • Provided user-targeted ads after classifying profiles into

psychological types.

  • Closed and reopened under the name Emerdata.

6

slide-7
SLIDE 7

Palantir Technologies

  • Named after Lord of the Rings’ Palantír (all-seeing eye).1
  • Two projects: Palantir Gotham (for defense and

counter-terrorism) and Palantir Metropolis (for finance).

  • Billion-dollar company accumulating data from every

possible source, and making predictions from that data.

1https://www.forbes.com/sites/andygreenberg/2013/08/14/agent-of-intelligence-how-a-deviant-philosopher-built- palantir-a-cia-funded-data-mining-juggernaut/

7

slide-8
SLIDE 8

Predictive policing

  • RAND Corporation: a think tank originally created to

support US armed forces.

  • RAND Report on predictive policing:2

“Predictive policing – the application of analytical techniques, particularly quantitative techniques, to identify promising targets for police intervention and prevent or solve crime – can offer several advantages to law enforcement agencies. Policing that is smarter, more effective, and more proactive is clearly preferable to simply reacting to criminal acts. Predictive methods also allow police to make better use of limited resources.”

2https://www.rand.org/pubs/research_briefs/RB9735.html

8

slide-9
SLIDE 9

ML and predicting

  • ML algorithms are fundamentally about predictions.
  • What is the quality of those predictions? Do we even want

to make those predictions?

  • If the possible futures of an individual become part of the

representation of that individual here and now, what does it mean for the way they are treated by institutions?

  • Remember: you too are a vector.

9

slide-10
SLIDE 10

Data and people: personalisation, bubbling, privacy

10

slide-11
SLIDE 11

Big data = quality

  • One argument about needing big data is that it is the only

way to provide quality services in applications.

  • It is true when comparing a big data representation with

aggregated human answers.

  • For instance, similarity-based evaluation of semantic

vectors.

11

slide-12
SLIDE 12

Similarity-based evaluations

Human output

sun sunlight 50.000000 automobile car 50.000000 river water 49.000000 stair staircase 49.000000 ... green lantern 18.000000 painting work 18.000000 pigeon round 18.000000 ... muscle tulip 1.000000 bikini pizza 1.000000 bakery zebra 0.000000

System output

stair staircase 0.913251552368 sun sunlight 0.727390960465 automobile car 0.740681924959 river water 0.501849324363 ... painting work 0.448091435945 green lantern 0.383044261062 ... bakery zebra 0.061804313745 bikini pizza 0.0561356056323 pigeon round 0.028243620524 muscle tulip 0.0142570835367

12

slide-13
SLIDE 13

The job of the machine

  • Setup 1: supervised setting. The system is trained on a

subset of the above data, trying to replicate human judgements.

  • Human judgements are means, aggregated over
  • participants. The system is never required to predict the

tail of the distribution.

  • Setup 2: unsupervised setting. Vectors are simply

gathered from corpus data. The data is an aggregate of what many people have said about a word.

  • In both cases, reproduction of majority opinion / majority

word usage.

13

slide-14
SLIDE 14

The need for personalisation

  • Safiya Noble: the black hair example.
  • Black hair can mean 1) hair of a

black colour or 2) hair with a texture typical to black people.

  • If the representation of black is

biased towards the colour, results for 2) will not be returned.

  • NB: this is a compositionality issue.

More on this later! 14

slide-15
SLIDE 15

Personalisation

  • A centralised view of decentralisation: if many people give

their private data, ML can learn how to give personalised results.

  • A double-edged sword: the need for personalisation goes

against the need for privacy.

15

slide-16
SLIDE 16

Bubbling

Personalisation also often goes with bubbling – it is hard to find a happy middle ground.

16

slide-17
SLIDE 17

Bubbling

16

slide-18
SLIDE 18

The algorithm’s fault?

Yes, algorithms built for big data will require big data. But small data algorithms are hard to produce, and not so attractive to large companies. Also, speaker-dependent data is hardly ever publicly available.

17

slide-19
SLIDE 19

The problem with representations

18

slide-20
SLIDE 20

Biases in cognitive science

System 1: automatic fast, parallel, automatic, associative, slow-learning System 2: effortful slow, serial, controlled, rule-governed, flexible Decision-making: two systems (Kahneman & Tversky, 1973). Over 95% of our cognition gets routed through System 1. We need to consciously override System 1 through System 2 to stop ourselves from acting according to stereotypes.

Credit: Yulia Tsvetkov. https://docs.google.com/presentation/d/1499G1yyAVwRaELO9MdZFIHrAC jzeiBBuMKpwdPafneI/

19

slide-21
SLIDE 21

Biases in cognitive science

20

slide-22
SLIDE 22

Constructivism in philosophy

  • The main claim of constructivism is that discourse has an

effect on reality.

  • People do not necessarily learn how things are ‘in fact’, but

also integrate the linguistic patterns most characteristic for a certain phenomenon. This, again, does have tremendous effects on reality – so-called ‘constructive’ effects.

21

slide-23
SLIDE 23

Bias in image search

  • Search engines are averaging

machines.

  • Big data algorithms necessarily

reproduce social biases.

  • In fact, they even amplify those

biases.

22

slide-24
SLIDE 24

Bias in text search

23

slide-25
SLIDE 25

Bias in search

  • Say the vector for EU is very close to unelected and

undemocractic.

  • Say this is the vector used by the search algorithm when

answering queries about the EU.

  • Returned pages will necessarily be biased towards

critiques of the EU. Data reinforces system 1’s automatic associations, which will be activated most of the time.

24

slide-26
SLIDE 26

Bias in machine translation

Hungarian does not have explicit marking of gender on verbs. How will Google Translate add the corresponding pronoun?

https://link.springer.com/article/10.1007/s00521-019-04144-6

25

slide-27
SLIDE 27

The revelation...

(Duh...)

26

slide-28
SLIDE 28

Datasets are biased

Zhao et al, 2017 - http://markyatskar.com/talks/ZWYOC17_slide.pdf

27

slide-29
SLIDE 29

Datasets are biased

Zhao et al, 2017 - http://markyatskar.com/talks/ZWYOC17_slide.pdf

27

slide-30
SLIDE 30

Datasets are biased

A system trained on biased data: behaviour after training.

Zhao et al, 2017 - http://markyatskar.com/talks/ZWYOC17_slide.pdf

27

slide-31
SLIDE 31

Three main questions

  • Where are the biases? (Tomorrow)
  • How to erase them from representations? (Thursday)
  • How to ensure models don’t amplify biases? (Today)

28

slide-32
SLIDE 32

Bias amplification

  • Supervised learning learns a function that generalises over

the data.

  • Imagine a standard regression line across some data. Can

you see how it might accentuate problems?

29

slide-33
SLIDE 33

Bias amplification

The point marked by an arrow is fairly ‘non-female’ and high on the ‘cooking’ dimension, but it gets normalised by the regression line.

30

slide-34
SLIDE 34

Bias amplification

Still from Zhao et al, 2017 - http://markyatskar.com/talks/ZWYOC17_slide.pdf

31

slide-35
SLIDE 35

What are those gender ratios?

32

slide-36
SLIDE 36

Preventing bias amplification

  • Can we train a system so that:
  • we prevent bias amplification;
  • we don’t decrease performance (warning: we don’t want to
  • verfit!);
  • NB: we are not actually removing bias from the original

data, just making sure it does not get worse.

33

slide-37
SLIDE 37

Preventing bias amplification

34

slide-38
SLIDE 38

Remember SVMs?

  • When implementing an SVMs, we have to tune the

hyperparameter C which controls how many datapoints can violate the margin.

  • Similarly, we can set a constraint on the learning problem so that

|Training ratio − Predicted ratio| ≤ margin

  • That is, the solution to our regression problem should not

emphasise the bias present in the corpus.

  • The technique is ‘safe’ from a performance point of view

because the system still has to find the best possible solution to the regression problem.

35

slide-39
SLIDE 39

Results from Zhao et al, 2017

36

slide-40
SLIDE 40

Where are the biases?

  • Research concentrates of gender / race / disability bias.

Methods such as removal of bias amplification act upon the system as a whole, which is positive.

  • But of course, other aspects of life can be biased. (See

example of EU vector previously.)

  • Examples: propaganda, commercially-biased texts...

37

slide-41
SLIDE 41

Debiasing the data

  • This is equivalent to ‘fixing the world’.
  • Why do people talk the way they talk? Why do certain

kinds of people contribute more to Web content than

  • thers? How are datasets sampled and constructed?
  • Who is to say what ‘unbiased’ data should look like? (More

Thursday!)

38

slide-42
SLIDE 42

The problem with language

39

slide-43
SLIDE 43

Language: inherently biased?

  • Interestingly, the way language is structured and acquired

lends itself to bias creation.

  • Language evolved to satisfy particular constraints related

to conceptualisation and communication.

  • Today, we will look at two such constraints: productivity

and efficiency.

40

slide-44
SLIDE 44

Language: inherently biased?

  • Composition and productivity: language makes use of

the compositionality principle, which lets us be infinitely productive using finite means. But is it the case that Comp(A, B) = AB?

  • Efficiency: certain constructions are more ‘innate’ than
  • thers. They make language generation and interpretation

efficient, but they are not the most discriminative...

41

slide-45
SLIDE 45

Commercial search

https://www.google.com/about/datacenters/gallery/#/tech/

42

slide-46
SLIDE 46

Commercial search

3.6 billion searches every day ... trillions of pages (???)

43

slide-47
SLIDE 47

Commercial search

3.6 billion searches every day ...

  • ver 45 billion (contentful) pages

43

slide-48
SLIDE 48

Commercial search

Does it work?

43

slide-49
SLIDE 49

Searching for good films

44

slide-50
SLIDE 50

Understanding taxation

45

slide-51
SLIDE 51

Being a good human: speak Searchenginese

46

slide-52
SLIDE 52

Intersectionality

Kimberlé Crenshaw (1991) The combination (intersection) of various forms of inequality makes a qualitative difference not only to the self-perception/ identity of social actors, but also to the way they are addressed through politics, legislation and other institutions.

  • Founding case: a law suit that African American women filed against the hiring

policy of General Motors (DeGraffenreid v. General Motors, 1977).

  • Crenshaw made the case for a reform of the US anti-discrimination-law.
  • Her work was further influential in the drafting of the equality clause in the South

African Constitution.

  • The concept black woman is not the addition of black and woman.

47

slide-53
SLIDE 53

Intersectionality in linguistic terms

  • Distributional compositional semantics: the intersective

composition of two elements should return a new vector.

  • Let’s take two old-fashioned models:
  • models that emulate the vector of the phrase itself, as it would be observed

given a large enough corpus (Guevara, 2010 and 2011; Baroni and Zamparelli, 2010). Trained and evaluated against phrases’ distributions.

  • models which only focus on the composition operation, independently from

the phrasal distribution (Mitchell and Lapata, 2010). Task-based evaluation.

  • We will call the former phrasal models and the latter,

intersective models.

48

slide-54
SLIDE 54

Intersectionality in linguistic terms

  • The intersective model by excellence is pointwise

multiplication.

  • Reminder from formal semantics: intersection betwen sets

is what belongs to both sets.

  • Vector multiplication implements this by zero-ing any

dimension that is 0 in either vector.

49

slide-55
SLIDE 55

The meaning of phrases

Is intersection enough? A big city: just a city which is big? It may also be related to loud, underground, advertisement, crowd, show, sightseeing, gentrification...

  • There is more to composition than intersection (Partee,

1994).

  • There may be ‘extra’ (non-intersective) meaning which can

be clearly observed in phrasal vectors and which is ‘hidden’ in vectors that are the result of a purely intersective operation.

50

slide-56
SLIDE 56

The vector of black woman

Multiplicative model Phrasal model stripes, makeup, pepper, hole, racial, white, woman, spots, races, women, whites, holes, colours, belt, shirt, african-american, pale, yel- low, wears, powder, coloured, wear, wore, colour, dressed, racism, leather, colors, hair, colored, trim, shorts, silk, throat, patch, jacket, dress, metal, scarlet, worn, grey, wearing, shoes, purple, native, gray, breast, slaves, color, vein, tail, hat, painted, uniforms, collar, dark, coat, fur, olive, bear, boots, paint, red, lined, canadiens, predominantly, slavery racism, feminist, women’s, slavery, negro, ide-

  • logy, tyler, filmmaker, african-american, ain’t,

elderly, whites, nursing, patricia, abbott, glo- ria, freeman, terrestrial, shirley, profession, ju- lia, abortion, diane, possibilities, argues, re- union, hiv, blacks, inability, indies, sexually, giuseppe, perry, vince, portraits, prevention, beacon, gender, attractive, tucker, fountain, ri- ley, beck, comfortable, stern, paradise, twist, anthology, brave, protective, lesbian, domestic, feared, breast, collective, barbara, liberation, racial, rosa, riot, aunt, equality, rape, lawyers, playwright, white, argued, documentary, carol, isn’t, experiences, witch, men, spoke, slaves, depicted, teenage, photos, resident, lifestyle, aids, commons, slave, freedom, exploitation, clerk, tired, romantic, harlem, celebrate, quran, interred, stargate, alvin, ada, katherine, im- mense

Herbelot et al (2012). Most characteristic contexts for black woman. Multiplicative and phrasal model.

51

slide-57
SLIDE 57

So what should we do when we compose?

  • Phrasal vectors are expensive to obtain. We need to store

and update extra target vectors in our semantic space.

  • They may well suffer from data sparsity. (Remember issue

with larger n-grams in language modeling!)

  • Composed vectors may not express the full meaning of

the phrase. They include whichever biases were included in their components vectors.

  • And which composition operation is the best one? (Not just

from the point of view of performance!)

52

slide-58
SLIDE 58

Moving on... Generalised quantifiers

  • Quantifiers have a restrictor and a scope.

All cats are mammals. Some cats are ginger.

  • Simple interpretation: set overlap.
  • The logic selects individuals over which to quantify:

∃x, ∀x, etc.

53

slide-59
SLIDE 59

Beyond ∃ and ∀

  • no: monotone decreasing.
  • most: what is most? More than half? Nearly all?
  • many: Many cars have a GPS, Many dogs have three legs.
  • the, a: The cat sleeps, The cat is a mammal, A cat sleeps,

A cat is independent, Have you fed the fish?

  • ∅: generic bare plurals: Cats are mammals, Ducks lay

eggs, Mosquitoes carry malaria, existential bare plurals: Students came this morning (Carlson, 1977).

  • ...

54

slide-60
SLIDE 60

The psychology of quantifiers

  • Children acquire quantifiers after generics (Hollander et al

2002).

  • Children acquire numerical abilities (counting) after the

Approximate Number Sense (ANS) (Mazzocco et al 2011).

  • Adults make quantification ‘mistakes’ linked to
  • ver-generalisation:

(All) ducks lay eggs. (Leslie et al 2011).

55

slide-61
SLIDE 61

Non-grounded quantification

  • All cats are mammals, Most cats have four legs, We had

profiteroles for dessert (at the restaurant last night).

  • In non-grounded quantification, it is often unclear what

exactly the restrictor’s set consists of. E.g. no one knows the exact composition of the set of cats.

  • Often, the set will anyway be too large to count: Most ants

have six legs.

56

slide-62
SLIDE 62

Quantification biases

  • Women like cooking, Immigrants receive money from the

state = few, some, most, all?

  • Generics are efficient constructions which don’t require a

commitment to a quantifier and can be left ‘vague’.

  • Because of the over-generalisation bias, people are likely

to interpret such statements as universals.

  • (Machines don’t even bother with quantification.)

57

slide-63
SLIDE 63

Can machines repair language?

0.042 seussentennial 0.041 scaredy 0.035 saber-toothed 0.034 un-neutered 0.034 meow 0.034 unneutered 0.033 fanciers 0.033 pussy 0.033 pedigreed 0.032 sabre-toothed 0.032 tabby 0.032 civet 0.032 redtail 0.032 meowing 0.032 felis 0.032 whiskers 0.032 morphosys 0.031 meows 0.031 scratcher ... 1 walks 1 purrs 1 meows 1 has-eyes 1 has-a_heart 1 has-a_head 1 has-whiskers 1 has-paws 1 has-fur 1 has-claws 1 has-a_tail 1 has-4_legs 1 an-animal 1 a-mammal 1 a-feline 0.7 is-independent 0.7 eats-mice 0.7 is-carnivorous 0.3 is-domestic ...

58

slide-64
SLIDE 64

Conclusion

59

slide-65
SLIDE 65

Be good

  • Low quality of algorithms: much reliance on big data,

mostly implementing ‘system 1’ of decision-making.

  • Reproduction of social biases: the machine seems to

have learnt all that is bad from the data.

  • Centralisation of data: how this relates to the type of

algorithms that are used.

  • (Lack of) personalisation: a double-edged sword.

60

slide-66
SLIDE 66

Be good

  • Be involved in small data!
  • Understand language and its inherent biases.

61