Data and Ethics in NLP Sharon Goldwater 14 November 2016 Sharon - - PowerPoint PPT Presentation

data and ethics in nlp
SMART_READER_LITE
LIVE PREVIEW

Data and Ethics in NLP Sharon Goldwater 14 November 2016 Sharon - - PowerPoint PPT Presentation

Data and Ethics in NLP Sharon Goldwater 14 November 2016 Sharon Goldwater Data and Ethics 14 November 2016 NLP requires different types of data Most NLP systems are supervised Training data is annotated with tags, trees, word senses,


slide-1
SLIDE 1

Data and Ethics in NLP

Sharon Goldwater 14 November 2016

Sharon Goldwater Data and Ethics 14 November 2016

slide-2
SLIDE 2

NLP requires different types of data

  • Most NLP systems are supervised

– Training data is annotated with tags, trees, word senses, etc.

  • Increasingly, systems are unsupervised or semi-supervised

– Unannotated data is used alone, or in addition to annotated data

  • All systems require data for evaluation

– Could be just more annotated data, but could be judgements from human users: e.g., on fluency, accuracy, etc.

Sharon Goldwater Data and Ethics 1

slide-3
SLIDE 3

Where does the data come from?

  • Annotated data:

annotators usually paid by research grants (government or private) or by companies

  • Unannotated data: often collected from the web
  • Human evaluation data: collected in labs or online: again, usually

paid by research grants All of these raise ethical issues which you need to be aware of when using or collecting data.

Sharon Goldwater Data and Ethics 2

slide-4
SLIDE 4

Intellectual property issues

Annotation is expensive and time-consuming, so annotated data is usually distributed under explicit user/licensing agreements.

  • Paid licenses: e.g., the Linguistic Data Consortium (LDC) uses

this model. – Researchers/institutions pay for individual corpora or buy an annual membership. – Edinburgh has had membership for many years, so you can use corpora like Penn Treebank (and treebanks in Arabic, Czech, Chinese, etc), Switchboard, CELEX, etc. – But you/we may not redistribute these outside the Uni (which is why we put them behind password-protected webpages).

Sharon Goldwater Data and Ethics 3

slide-5
SLIDE 5

Intellectual property issues

Annotation is expensive and time-consuming, so annotated data is usually distributed under explicit user/licensing agreements.

  • Freely available corpora: e.g., Child Language Data Exchange

(CHILDES) uses this model. – Anyone can download the data (corpora in many languages donated by researchers around the world). – If used in a publication, must cite the CHILDES database and the contributor of the particular corpus. – Redistribution/modification follows Creative Commons license.

  • Other free corpora may have different requirements, e.g., register
  • n website, specific restrictions, etc.

Sharon Goldwater Data and Ethics 4

slide-6
SLIDE 6

Privacy issues

To build NLP systems for spontaneous interactions, we need to collect spontaneous data. But...

  • Are individuals identifiable in the data?
  • Is personal information included (or inferrable) from the data?
  • What type of consent has been obtained from the individuals

involved? The answers to these questions will determine who is permitted access to the data, and for what.

Sharon Goldwater Data and Ethics 5

slide-7
SLIDE 7

Example: CHILDES database

Many of the corpora are recordings of spontaneous interactions between parents and children in their own homes.

  • Usually 1-2 hours at a time, at most once a week.
  • Parents must sign consent agreement, including information about

who will have access to the data.

  • In some cases, only transcripts (no recordings) are available, often

with personal names removed.

Sharon Goldwater Data and Ethics 6

slide-8
SLIDE 8

Example: Human Speechome Project

Deb Roy (MIT researcher) instrumented his own home to record all waking hours

  • f his child from ages 0 to 3

(starting around 2006).

Image: http://www.media.mit.edu/cogmac/projects.html Sharon Goldwater Data and Ethics 7

slide-9
SLIDE 9

Example: Human Speechome Project

Deb Roy (MIT researcher) instrumented his own home to record all waking hours of his child from ages 0 to 3 (starting around 2006).

  • Huge project involving massive storage and annotation issues;

incredible effort and expense.

  • Huge potential to study language acquisition in incredible detail.
  • But for privacy reasons, “there is no plan to distribute or publish

the complete original recordings”. Roy may consider “sharing appropriately coded and selected portions of the full corpus.”

Sharon Goldwater Data and Ethics 8

slide-10
SLIDE 10

Example: Twitter data

Lots of NLP researchers want to use it for sentiment analysis, event detection, sociolinguistics, etc.

  • Twitter allow downloads of a 1% sample of Tweets for free.
  • But subject to many restrictions (e.g., no redistribution, must

delete any Tweets as user deletes them, etc.)

  • This course uses data from a set of Tweets collected here, but

you may not copy it onto your own computer or redistribute it. The licensing agreement protects both Twitter’s IP and users’ privacy (though also makes reproducing research results trickier).

Sharon Goldwater Data and Ethics 9

slide-11
SLIDE 11

Use of existing data sets

Usually straightforward to follow legal and ethical guidelines.

  • Don’t redistribute data without checking license agreements

– This includes modified versions of the data

  • In most cases, you may store your own copy of data licensed by

Edinburgh to use for University-related work; if not, we’ll tell you.

  • If in doubt, check with your instructor or project supervisor.

Sharon Goldwater Data and Ethics 10

slide-12
SLIDE 12

Collection of new data

Creating a new corpus, getting human evaluations of a system, etc.

  • Any work involving human participants, personal or confidential

data requires ethical approval.

  • (Sometimes) subtle distinction: annotators vs participants.

– Annotators are recruited (and/or trained) for their expert knowledge, and are not subjects of the study. – Participants are recruited as non-experts, and may themselves be subjects of study.

  • Heightened approval requirements if participants include children,

people with disabilities, etc.

Sharon Goldwater Data and Ethics 11

slide-13
SLIDE 13

Why ethical approval?

  • Image from http://www.prisonexp.org/, where you can find

details, movie, etc.

Sharon Goldwater Data and Ethics 12

slide-14
SLIDE 14

How is it enforced?

  • Funding agencies and journals normally require universities to

have ethics approval procedures, and researchers to follow them.

  • Companies must follow privacy laws, also self-policing based on

public relations. – though sometimes data which purports to be “anonymized” can still be identifiable... see for example

https://www.wired.com/2010/03/netflix-cancels-contest/ https://en.wikipedia.org/wiki/AOL_search_data_leak

Sharon Goldwater Data and Ethics 13

slide-15
SLIDE 15

Example: Evaluating a system

You develop a machine translation system and want people to rate the output of the system for fluency and accuracy.

  • If you bring people into your lab

to do this, you will need to get ethical approval.

  • If you use people on the Internet

to do this, you will still need to get ethical approval. Generally, cases like this only require a signed self-assessment confirming no further issues.

Sharon Goldwater Data and Ethics 14

slide-16
SLIDE 16

Example: language use on Twitter

Real paper: case study of one Twitter user’s use of spelling to indicate regional pronunciation.

  • The relevant data from the user is already public.
  • But that isn’t the same as giving informed consent to participate

in a research study.

  • Username, profile information, example tweets, and results of

study are all described in the paper (i.e., personally identifying information).

  • Requires further ethical consideration: presumably the researcher

contacted the individual for approval (I hope!).

Sharon Goldwater Data and Ethics 15

slide-17
SLIDE 17

Example: anti-spambot

Real student project: develop a system to automatically respond to spammers, trying to engage them in email conversation for as long as possible.

  • The person on the other end of the spam is still a person.
  • This project involves human participants, and ones who cannot

give informed consent.

  • Requires further ethical consideration.

Sharon Goldwater Data and Ethics 16

slide-18
SLIDE 18

Example: user localization from audio

Real student proposal: learn what individual’s daily patterns are using always-on audio recording from mobile phone.

  • Plans to avoid needing subjects’ consent by running the data

collection on own phone. (No ethical approval required for self- experimentation.)

  • Only plans to use non-speech audio data.
  • However, always-on recording will still capture other people’s

speech.

  • Requires further ethical consideration.

Sharon Goldwater Data and Ethics 17

slide-19
SLIDE 19

What you need to know

  • Your supervisor should be aware of the School ethics procedures,

and will help you fill out forms if required.

  • However, CS researchers are sometimes less aware than they

should be!

  • So don’t be afraid to ask your supervisor if you think there might

be an issue.

  • More information at the School website:

http://www.ed.ac.uk/informatics/research/ethics

Sharon Goldwater Data and Ethics 18

slide-20
SLIDE 20

Summary

Use and collection of data for NLP requires consideration of

  • Intellectual property
  • Privacy
  • Other potential ethical issues.

Usually not difficult, but important to be aware.

Sharon Goldwater Data and Ethics 19