Information Retrieval for Development Hussein Suleman Digital - - PowerPoint PPT Presentation

information retrieval for development
SMART_READER_LITE
LIVE PREVIEW

Information Retrieval for Development Hussein Suleman Digital - - PowerPoint PPT Presentation

Information Retrieval for Development Hussein Suleman Digital Libraries Laboratory @ Centre for ICT4D Department of Computer Science University of Cape Town January 2019 Key Research Question How do we use Information Retrieval / Data


slide-1
SLIDE 1

Information Retrieval for Development

Hussein Suleman Digital Libraries Laboratory @ Centre for ICT4D Department of Computer Science University of Cape Town January 2019

slide-2
SLIDE 2

Digital Libraries Lab @ Centre for ICT4D

Key Research Question

How do we use Information Retrieval / Data Mining /... to support Development in Africa?

slide-3
SLIDE 3

Digital Libraries Lab @ Centre for ICT4D

Outline of Talk

What is Development What is Development What is ICT for Development What is ICT for Development

Collection Development Collection Development African Language IR African Language IR

Challenges in IR 4 Development Challenges in IR 4 Development

Low Resource Environments Low Resource Environments

Where to next ? Where to next ?

Development Interventions Development Interventions

slide-4
SLIDE 4

Digital Libraries Lab @ Centre for ICT4D

What is (Human/Socio-economic) Development?

slide-5
SLIDE 5

Digital Libraries Lab @ Centre for ICT4D

Development Agendas

 UN Millenium Development Goals  UN Millenium Declaration  UN Sustainable Development Goals  South Africa

 National Development Plan (2012)  Growth Employment and Redistribution (1996)  Reconstruction and Development Plan (1994)

 Africa-wide

 New Partnership for Africa's Development (NEPAD)  ...

slide-6
SLIDE 6

Digital Libraries Lab @ Centre for ICT4D

UN Millenium Developmemt Goals

slide-7
SLIDE 7

Digital Libraries Lab @ Centre for ICT4D

slide-8
SLIDE 8

Digital Libraries Lab @ Centre for ICT4D

SA National Development Plan 2012-2030

The creation of jobs and the development of the economy

Development of the economic infrastructure: coal and gas, water, electricity and telecommunications

Environmental sustainability and management of environmental resources

Development of an inclusive rural economy

Regional and international trade

Housing and urban/rural planning

Education and training

Medical care

Safety and security

Building capacity for a developmental state

Fighting corruption

Nation building for a unified society

slide-9
SLIDE 9

Digital Libraries Lab @ Centre for ICT4D

Programme of the Austrian Federal Govt 2008-2013

slide-10
SLIDE 10

Digital Libraries Lab @ Centre for ICT4D

Nigeria Vision 20:2020

slide-11
SLIDE 11

Digital Libraries Lab @ Centre for ICT4D

Zambia 7th National Dev Plan

slide-12
SLIDE 12

Digital Libraries Lab @ Centre for ICT4D

The Decolonisation Debates

 How do we decolonise African society?

 Different knowledge systems? ICT? Do we do ICT differently?  Do we need a programming language with keywords in isiZulu?  Do we teach programming in isiZulu?  Public intellectuals or universal scholars?  Excellence vs. Local Relevance

 Why is AFIRM mostly run by people from the Northern

Hemisphere?

 What do they say: Ngũgĩ wa Thiong'o, Mahmood Mamdani,...

slide-13
SLIDE 13

Digital Libraries Lab @ Centre for ICT4D

What is ICT for Development

slide-14
SLIDE 14

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 1/4

slide-15
SLIDE 15

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 2/4

slide-16
SLIDE 16

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 3/4

slide-17
SLIDE 17

Digital Libraries Lab @ Centre for ICT4D

What is ICT4D: Example 4/4

slide-18
SLIDE 18

Digital Libraries Lab @ Centre for ICT4D

The Big Question

 Can we use ICT to aid human

development?

 Can we use IR/DM to aid human

development?

slide-19
SLIDE 19

Digital Libraries Lab @ Centre for ICT4D

Challenges: IR for Development

slide-20
SLIDE 20

Digital Libraries Lab @ Centre for ICT4D

Goal: IR for Human Development

 Human Dignity

 Promote the status of local languages.  Create tools that support local languages.  Increase presence of local languages.

 IR4D

 IR for employment, governance, health, etc.

slide-21
SLIDE 21

Digital Libraries Lab @ Centre for ICT4D

Challenge 1: IR algorithms

 Little algorithmic support in IR/NLP.  Are there language-specific

tools/algorithms in African languages?

 How well do they work?  How many languages are supported?

slide-22
SLIDE 22

Digital Libraries Lab @ Centre for ICT4D

Challenge 2: Data

 Very little and noisy

data.

 <1000 Wikipedia

documents for some African languages.

 How much

electronic content do we produce?

slide-23
SLIDE 23

Digital Libraries Lab @ Centre for ICT4D

Challenge 3: Fuzziness

 Unclear language boundaries.  How many languages are there?

 How many have been clearly defined?  How many are managed?

 What is a language and what is a

dialect/accent?

slide-24
SLIDE 24

Digital Libraries Lab @ Centre for ICT4D

Challenge 4: Digital Divide

 Access / Knowledge  How many people understand how to

search?

 How many people use search?

 Do people even have Internet access?

slide-25
SLIDE 25

Digital Libraries Lab @ Centre for ICT4D

Challenge 5: Many Languages

 Multilingualism is the

norm.

 How many languages

do people use?

 Are

documents/queries in

  • ne language or are

they mixed?

slide-26
SLIDE 26

Digital Libraries Lab @ Centre for ICT4D

Challenge 6: Resource Limits

 We do not have the resources.  Limited skills among researchers.  Limited bandwidth to access data.  Limited skills among users.  Limited funding for anything.

slide-27
SLIDE 27

Digital Libraries Lab @ Centre for ICT4D

Collection Development

slide-28
SLIDE 28

Digital Libraries Lab @ Centre for ICT4D

Corpora

 Corpora for African Language IR are rare.

 There are limited corpora for speech

recognition, speech synthesis, MT, etc.

 Very few documents online.  Wikipedia has <1000 (poor quality) pages

in many Bantu languages!

 Lots of OOV, loan words, mixed texts, etc.

slide-29
SLIDE 29

Digital Libraries Lab @ Centre for ICT4D

Corpora: Language Detection

Meluleki Dube, U/G

 Can we successfully determine the language, from

among a group of 9 related African languages, of a piece of text?

 Web page?  Tweet?  Trigram modelling and model alignment distance gives

up to 92% accuracy.

 Incorrect predictions scatter by language similarity.

slide-30
SLIDE 30

Digital Libraries Lab @ Centre for ICT4D

Corpora: Crowdsourcing

Sean Packham, MSc

 Parallel corpus in isiXhosa-English.  Will people contribute if money paid is

varied or there is no money but only gamification?

 Payment is only criterion!

slide-31
SLIDE 31

Digital Libraries Lab @ Centre for ICT4D

Corpora: SALANG

Andreas von Holy, Osher Shuman, Alon Bresler, Bsc(Hons)

 Create a central portal for documents in

any SA Bantu language, with gamification, multilingual search, etc.

slide-32
SLIDE 32

Digital Libraries Lab @ Centre for ICT4D

Corpora: Long-term efects

Jackson Moji, MSc (current)

 Does gamification for corpus creation work

in the long term?

 Will people lose interest?  Will they continue to contribute?  How is intrinsic motivation affected by time?

 Extension of SALang project.

slide-33
SLIDE 33

Digital Libraries Lab @ Centre for ICT4D

African Language IR

slide-34
SLIDE 34

Digital Libraries Lab @ Centre for ICT4D

Mixed Language IR

Mohammed Mustafa Ali, PhD

 Noted that Google is language unaware.  Poor results for mixed queries – queries in

multiple languages.

 Dominant languages are dominant in results.  Mixed language use is very popular in Africa.

 Solution: Examine queries and rerank based on

language-based collection weights.

slide-35
SLIDE 35

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR

 Search engines in Bantu languages,

especially South African languages (isiZulu, isiXhosa, etc.).

 Many core IR algorithms are unchanged

but some language-specific algorithms needed:

 Language identification  Text pre-processing and normalization  Ranking and reranking

slide-36
SLIDE 36

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: AfriWeb

Nkosana Malumba, Katlego Moukangwe, BSc(Hons)

 Zulu Search Engine.  High accuracy in identifying

isiZulu vs. English+Italian.

 Simple morphological parser

  • utperformed simple

stemmer in IR results.

slide-37
SLIDE 37

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: Transfer?

Nyasha Katemauswa, U/G

 Shona Search Engine.

 Can we adapt the isiZulu framework to get

better results in chiShona?

Michael Kyeyune, U/G

 Xhosa Search Engine.  Can we adapt the isiZulu framework to get

better results in isiXhosa?

slide-38
SLIDE 38

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: Similar Language IR

Catherine Chavula, PhD (current); Sinead Urisohn, Andre Lopes, BSc(Hons)

 Exploit language similarity for those who

can read multiple languages.

 Reranking to emphasize language similarity in

addition to relevance.

 Universal language group text pre-processing,

such as stemming.

slide-39
SLIDE 39

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: kiSwahili

Joseph Telemala, PhD (current)

 How do we support Swahili speakers?

 Professionals want English for work.  Everyone wants kiSwahili for play.

 Who you are and what you are doing

dictates query/result expectations.

slide-40
SLIDE 40

Digital Libraries Lab @ Centre for ICT4D

IR in Low Resource Environments

slide-41
SLIDE 41

Digital Libraries Lab @ Centre for ICT4D

Bantu Language IR: Speech UI

Morebodi Modise, MSc

 Speech-driven mobile search interface in

isiXhosa.

 Works well, but educated people want English!

slide-42
SLIDE 42

Digital Libraries Lab @ Centre for ICT4D

|Xam IR

 Extinct Khoisan

language.

 Language used in

documenting early South African history/culture (25000 pages of stories).

 No Unicode

representation.

slide-43
SLIDE 43

Digital Libraries Lab @ Centre for ICT4D

Digital Bleek and Lloyd Collection

slide-44
SLIDE 44

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: Low Resource IR

 IR engine within the browser – no network

needed.

 Only simple transcriptions supported.

slide-45
SLIDE 45

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: Dictionary

Lebogang Molwantoa, Sanvir Manilal, Kyle Williams, BSc(Hons)

 Visual dictionary – pictures of words.  Find meanings of words in stories by image search.

slide-46
SLIDE 46

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: Transcription

Kyle Williams, MSc; Ngoni Munyaradzi, MSc

 Using machine learning to transcribe |Xam.  Training data manually generated.  45% accuracy at best.  Crowdsourcing had 10% better performance.

 Answer determined by agreement among 3

amateur transcribers.

slide-47
SLIDE 47

Digital Libraries Lab @ Centre for ICT4D

Bleek and Lloyd: Text Input

Sunkanmi Olaleye, MSc

 Inputting |Xam is non-trivial.  Diacritics above, below and both; single

and multiple characters.

 Custom Android keyboards for predictive

and directed text entry in |Xam.

slide-48
SLIDE 48

Digital Libraries Lab @ Centre for ICT4D

IR/DM for Development

slide-49
SLIDE 49

Digital Libraries Lab @ Centre for ICT4D

IR for Development

Gina Paihama, PhD (current)

 How can we give users directed results to

address unemployment?

 Relevance is more specific here:

slide-50
SLIDE 50

Digital Libraries Lab @ Centre for ICT4D

DM for Development

Selvas Mwanza, PhD (current)

 Can we use Twitter data to evaluate

developmental measures in society (e.g., level of free speech)?

 We have found an association between what

people discuss (politics vs. entertainment) and how.

slide-51
SLIDE 51

Digital Libraries Lab @ Centre for ICT4D

What next?

slide-52
SLIDE 52

Digital Libraries Lab @ Centre for ICT4D

Where we are

 Some early successes but:

 Too many languages, with  Too few documents,  Too few resources (money/users), and  Too much mixing of languages in queries and

documents.

 Lots of work still needed  Lots of opportunities for research

slide-53
SLIDE 53

questions, comments, ...

http://dl.cs.uct.ac.za/ enkosi hamba kakuhle thank you and go well