community translation in africa
play

COMMUNITY TRANSLATION IN AFRICA DENIS GIKUNDA, LOCALIZATION PRG - PowerPoint PPT Presentation

COMMUNITY TRANSLATION IN AFRICA DENIS GIKUNDA, LOCALIZATION PRG MANAGER w3c: The Multilingual Web: Where are we? Google in Africa Local language content Tools Methodology (x 3) Friday, October 29, 2010 GOOGLE IN AFRICA Google confidential


  1. COMMUNITY TRANSLATION IN AFRICA DENIS GIKUNDA, LOCALIZATION PRG MANAGER w3c: The Multilingual Web: Where are we? Google in Africa Local language content Tools Methodology (x 3) Friday, October 29, 2010

  2. GOOGLE IN AFRICA Google confidential & proprietary WHAT, WHO, WHERE • Making the internet an integral part of every-day life in Africa • Access, Relevance, Sustainability • Product Development, Engineering, Localization, Business Development, +San-francisco, Zurich, London, New York, Marketing, PR, Sales*. Dublin, Tel Aviv, Haifa Friday, October 29, 2010

  3. AFRICAN LANGUAGES Google confidential & proprietary • Highest language density in world [2k+ languages] • Over 100 languages with over 1M+ speakers landscape • 12 - 15 macro languages reach ~60% of indigenous language speakers • Most use latin script, extended diacritics, with exception of Amharic (ET). • English/French/Portuguese predominantly used as official or language of instruction in education Policy • Exceptions are Amharic (ET), Swahili (TZ), Setswana (BW), and 11 South African local languages. • Large policy formulation gaps wrt language/education/ict, hence low demand for local language services. Potential partners are UNESCO, ANLOC, IDRC • African languages have remained a largely oral, informal phenomena. Very few books, newspapers, publications have been developed due to cost. Status • Oral literature, indigenous knowledge, cultural novelty, and creativity remain unamplified, and lost over generations. • Internet presents a opportunity to bootstrap written form of african languages. Friday, October 29, 2010

  4. Native speakers online (M) Wikipedia articles (K) Google confidential & proprietary http://www.internetworldstats.com/stats7.htm http://stats.wikimedia.org/EN/ 600 4000 3500 Negligible african language content relative to 450 3000 speakers online 2500 300 2000 1500 Stunted organic growth of content relative to user growth 150 1000 500 0 0 Some efforts show promise of impact am sw ar ru zh en New articles per day Amharic Swahili Arabic Chinese Russian English New articles Internet user 2000-2009 2000-2010 per day growth am 2 2810% 13% 22% 2006 sw 29 247.8% 42% 106% 2007 ar 61 1545% 165% 143% 2008 ru 529 1125.8% 239% 220% 2009 zh 185 894.8% 246% 213% 2010 en 1351 226.7% 124% 110% all langs 8457 342.2% 226% 202% 0 750 1,500 2,250 3,000 Friday, October 29, 2010

  5. USER GENERATED CONTENT Google confidential & proprietary • Users first generate content, or content that draws in users? Google Translate (MT) Afrikaans & Swahili Google Translator Toolkit Community Translation Voice Search Google Translate Google Program (MT) in Your Language 2001 2005 2007 2009 2009 Friday, October 29, 2010

  6. TOOLS Google confidential & proprietary Automatic translation between 2,500+ language pairs Google Sponsored Projects • Human translation between 100,000+ language pairs Indic languages: 10MM+ words • WYSIWYG display for MediaWiki text (not just Wikipedia) Arabic: 5MM+ words • Direct publish to Wikipedia (preview mode only) Swahili: 1MM+ words Friday, October 29, 2010

  7. Google confidential & proprietary Friday, October 29, 2010

  8. Google confidential & proprietary Friday, October 29, 2010

  9. Google confidential & proprietary Friday, October 29, 2010

  10. Google confidential & proprietary COMMUNITY TRANSLATION • In a nutshell • Outcomes Use a toolkit that • Google Web Search 300+ volunteers, 10 + Universities • • combines MT, Glossary Interface in top 100 African matching & global TM, and languages. allows online collaborative 24 languages UIs launched. • work. Translation Party • model - a fun, collaborative Surge in search queries • Quality is vetted local • & social 2 day workshop language specialists, involving students studying journalists, publishers. CS & language. • Challenges • Approach Long term: recognition, • paid work. Locale selection & disambiguation Prioritize against internet • • Terminology • penetration, usage status, harmonization, and content available. release. Incentive / Reward Inheritance, blind test, • Glossary development Short term: Certificate, • • Training, Social, curriculum centered. Internet Access • Friday, October 29, 2010

  11. Google confidential & proprietary Usage of african language interfaces, over 5 years. (Search Queries) A - SSA community Translation program begins As the internet expands into low-penetration regions, demand for local language services & content grows. Friday, October 29, 2010

  12. Google confidential & proprietary • In a nutshell • Outcomes Wikipedia : #3 content property globally (Alexa). 60% referrals • Sw wiki pages: 3/10 - 9/10 from Google. Contest : grow Swahili Wikipedia articles by 500K words. • Translate/author preselected, high traffic, substantive, relevant articles, using Google Translate/Google Translator Toolkit. Partners : 7 Universities in Kenya, Tanzania over 6 Week duration. • +1600 Articles (+14%) | 7000 Articles in 10 months | 1.9M words (100% CAGR), 800 registrants | 10 active contributors Prizes : Netbooks, Internet modems, phones, and Google Schwag. • • Approach • Challenges Content structure part of quality metric. Online training, • Process: Quality review, reversions, line by line translation. • using videos. Technical: Published MT, markup, • MT as an enabler, prevent publishing with <50% human • translation. Sustained contribution • Contest model. Partnership with decentralized Wikipedia • Communities. Content focus (entertainment, local References become multilingual? knowledge, sports) • Friday, October 29, 2010

  13. Google confidential & proprietary sitescontent.google.com/healthspeaks • In a nutshell • Outcomes Background : High quality health information is particularly • scarce in foreign languages, affecting arguably the most needy users. Volunteer effort driven by Google.org. Participants are • mainly medical student/faculty communities. Google matches every word in $1 of funding towards local health organization. >2000 registrants ~1000 articles claimed <10% published Targeting Hindi , Arabic , Swahili users • >22,000 page views • Challenges • Approach Audience/expertise disparity Seed with paid translations, and professionally developed • • terminology to maximize TM leveraging in Google Translator toolkit. Overwrites • Find partners with vested interest in the content. • Sustained Contribution • Continue to work closely with decentralized communities -> • Submit to talk page. Friday, October 29, 2010

  14. Google confidential & proprietary WHERE ARE WE? Community • The community needs to be center stage for content to happen organically. Content will grow around communities needs. Incentive / reward mechanisms • Should vary based on audience, content type and short/long term. Short term: Contest prizes, accreditation, social networking. Longer term: Job opportunities, paid translation work. Access • The cost of reliable PC based internet access is a real inhibitor to access. Will mobile be an enabler? Tools / Platforms / APIs • Terminology & TM sharing via tools lower barrier for translation, allow more to participate. Standards • Still lacking for African language wrt (i) variant/dialect classification (ii) term harmonization Friday, October 29, 2010

  15. Google confidential & proprietary • Discussion • dgikunda@google.com • @kariithi Friday, October 29, 2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend