Assignment 1 - Corpus Creation Marko Lo zaji c University of T - - PowerPoint PPT Presentation

assignment 1 corpus creation
SMART_READER_LITE
LIVE PREVIEW

Assignment 1 - Corpus Creation Marko Lo zaji c University of T - - PowerPoint PPT Presentation

Assignment 1 - Corpus Creation Marko Lo zaji c University of T ubingen June 5, 2019 Marko Lo zaji c (University of T ubingen) Assignment 1 June 5, 2019 1 / 15 General remarks Great job! Marko Lo zaji c (University


slide-1
SLIDE 1

Assignment 1 - Corpus Creation

Marko Loˇ zaji´ c

University of T¨ ubingen

June 5, 2019

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 1 / 15

slide-2
SLIDE 2

General remarks

Great job!

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 2 / 15

slide-3
SLIDE 3

General remarks

Great job! Please include honor code

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 2 / 15

slide-4
SLIDE 4

General remarks

Great job! Please include honor code Please do not commit to assignment repository after deadline

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 2 / 15

slide-5
SLIDE 5

General remarks

Great job! Please include honor code Please do not commit to assignment repository after deadline Reminder: worst lab doesn’t count!

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 2 / 15

slide-6
SLIDE 6

General (Python) comment

# Recommended way to open a file in Python 3: with open(file, "r") as f: ... # As opposed to: f = open(file, "r") ... f.close() # As well as: import io with io.open(file, "r") as f: ...

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 3 / 15

slide-7
SLIDE 7

Recap

1 Construct distinctive word lists for English and German 2 Collect tweets containing words in both languages 3 Have langdetect tell you how you did Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 4 / 15

slide-8
SLIDE 8

Part 1 - Getting the most frequent words (sketch)

Goal: We want the 5000 most frequent English words and the most frequent 2000 German words, such that no words within the most common 20000 of the other language are selected. import gzip from collections import Counter with gzip.open(corpus, "rt", encoding="utf-8") as f: c = Counter() for line in f: for word in line.split(): if len(word) > 3: c[word] += 1 # or word.lower() # return all words, not just first 20,000! return [word for word, _freq in c.most_common()]

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 5 / 15

slide-9
SLIDE 9

Part 1 - Saving word lists (sketch)

common_words = set(en_words[:20000]) \ & set(de_words[:20000]) en_wordlist = [] collected = 0 for word in en_words: if word not in common_words: en_wordlist.append(word) collected += 1 if collected == 5000: # 2000 for German ... # write words to file and break

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 6 / 15

slide-10
SLIDE 10

Part 1 - Saving word lists (Option 2)

en_wordlist = set(en_words[:5000]) de_wordlist = set(de_words[:2000]) intersec = eng_wordlist & de_wordlist words_discarded = 0 while len(intersec) > 0: en_wordlist -= intersec de_wordlist -= intersec en_wordlist.add(set( eng_words[5000 + words_discarded: 5000 + words_discarded + len(intersec)])) de_wordlist.add(set( de_words[2000 + words_discarded: 2000 + words_discarded + len(intersec)])) words_discarded += len(intersec) intersec = en_wordlist & de_wordlist ... # write word lists to files

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 7 / 15

slide-11
SLIDE 11

Part 2 - Getting code-switched tweets

Search API vs Stream API

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 8 / 15

slide-12
SLIDE 12

Part 2 - Getting code-switched tweets

Search API vs Stream API Fighting rate limits

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 8 / 15

slide-13
SLIDE 13

Part 2 - Getting code-switched tweets

Search API vs Stream API Fighting rate limits Fetching unique tweets

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 8 / 15

slide-14
SLIDE 14

Part 2 - Getting code-switched tweets

Search API vs Stream API Fighting rate limits Fetching unique tweets Speeding up tweet hunt

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 8 / 15

slide-15
SLIDE 15

Part 2 - Getting code-switched tweets

Some concrete possibilities for the search API: query? ”-filter:retweets”, part of German word list

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 9 / 15

slide-16
SLIDE 16

Part 2 - Getting code-switched tweets

Some concrete possibilities for the search API: query? ”-filter:retweets”, part of German word list count? 100

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 9 / 15

slide-17
SLIDE 17

Part 2 - Getting code-switched tweets

Some concrete possibilities for the search API: query? ”-filter:retweets”, part of German word list count? 100 max id?

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 9 / 15

slide-18
SLIDE 18

Part 2 - Getting code-switched tweets

Some concrete possibilities for the search API: query? ”-filter:retweets”, part of German word list count? 100 max id? geocode? ”48.6,11.5,400km”, list of countries

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 9 / 15

slide-19
SLIDE 19

Part 2 - Getting code-switched tweets

Some concrete possibilities for the search API: query? ”-filter:retweets”, part of German word list count? 100 max id? geocode? ”48.6,11.5,400km”, list of countries tweet mode? ”extended”

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 9 / 15

slide-20
SLIDE 20

Part 2 - Getting code-switched tweets (sketch)

en_wordset = read_words("en.words") de_wordset = read_words("de.words") for tweet in tweets: if len(tweet.full_text < 50): continue en, de = False, False for word in tweet: if word in en_wordset: en = True elif word in de_wordset: de = True if en and de: ... # dump to file, check if 50 tweets reached

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 10 / 15

slide-21
SLIDE 21

Gallery

Like Italy is a 10 und Schweiz naja 7 oder so. Still high aber like nichts gegen Italien bitte

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 11 / 15

slide-22
SLIDE 22

Gallery

Like Italy is a 10 und Schweiz naja 7 oder so. Still high aber like nichts gegen Italien bitte Die midlife crisis des millennial startet bei der ¨ Uberlegung sich beim poetry slam anzumelden

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 11 / 15

slide-23
SLIDE 23

Gallery

Like Italy is a 10 und Schweiz naja 7 oder so. Still high aber like nichts gegen Italien bitte Die midlife crisis des millennial startet bei der ¨ Uberlegung sich beim poetry slam anzumelden Excuse me, wieso steht dieser komplett creepy aussehende B¨ ar da neben dem Altar?!

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 11 / 15

slide-24
SLIDE 24

Gallery

Like Italy is a 10 und Schweiz naja 7 oder so. Still high aber like nichts gegen Italien bitte Die midlife crisis des millennial startet bei der ¨ Uberlegung sich beim poetry slam anzumelden Excuse me, wieso steht dieser komplett creepy aussehende B¨ ar da neben dem Altar?! Lots of sports ball fans excited for their sports ball and lots of polizei should they get too excited.

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 11 / 15

slide-25
SLIDE 25

Gallery...?

Erste Vertragsgespr¨ ache werden nach dem Pokalspiel gegen Worms gef¨ uhrt.

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 12 / 15

slide-26
SLIDE 26

Gallery...?

Erste Vertragsgespr¨ ache werden nach dem Pokalspiel gegen Worms gef¨ uhrt. Ok, dann muss ich halt Bill Gates werden, um von ¨ alteren angesehen zu werden lol

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 12 / 15

slide-27
SLIDE 27

Gallery...?

Erste Vertragsgespr¨ ache werden nach dem Pokalspiel gegen Worms gef¨ uhrt. Ok, dann muss ich halt Bill Gates werden, um von ¨ alteren angesehen zu werden lol OMG wasn los??? Leider von hier aus keine Hilfe m¨

  • glich...

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 12 / 15

slide-28
SLIDE 28

Gallery...?

Erste Vertragsgespr¨ ache werden nach dem Pokalspiel gegen Worms gef¨ uhrt. Ok, dann muss ich halt Bill Gates werden, um von ¨ alteren angesehen zu werden lol OMG wasn los??? Leider von hier aus keine Hilfe m¨

  • glich...

Ja abwarten aber das ist der perfekte Mann f¨ ur die defensive gewesen schon mal.

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 12 / 15

slide-29
SLIDE 29

Gallery...?

Erste Vertragsgespr¨ ache werden nach dem Pokalspiel gegen Worms gef¨ uhrt. Ok, dann muss ich halt Bill Gates werden, um von ¨ alteren angesehen zu werden lol OMG wasn los??? Leider von hier aus keine Hilfe m¨

  • glich...

Ja abwarten aber das ist der perfekte Mann f¨ ur die defensive gewesen schon mal. I had told Liz Johnston that their tote bag was going places

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 12 / 15

slide-30
SLIDE 30

Gallery...?

Erste Vertragsgespr¨ ache werden nach dem Pokalspiel gegen Worms gef¨ uhrt. Ok, dann muss ich halt Bill Gates werden, um von ¨ alteren angesehen zu werden lol OMG wasn los??? Leider von hier aus keine Hilfe m¨

  • glich...

Ja abwarten aber das ist der perfekte Mann f¨ ur die defensive gewesen schon mal. I had told Liz Johnston that their tote bag was going places Bei mir hats auch gedauert..

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 12 / 15

slide-31
SLIDE 31

Gallery...?

Erste Vertragsgespr¨ ache werden nach dem Pokalspiel gegen Worms gef¨ uhrt. Ok, dann muss ich halt Bill Gates werden, um von ¨ alteren angesehen zu werden lol OMG wasn los??? Leider von hier aus keine Hilfe m¨

  • glich...

Ja abwarten aber das ist der perfekte Mann f¨ ur die defensive gewesen schon mal. I had told Liz Johnston that their tote bag was going places Bei mir hats auch gedauert.. Jamie si tu vois ce tweet sache que je fais ¸ ca juste pour toi

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 12 / 15

slide-32
SLIDE 32

Part 3 - Using langdetect

Read in JSON file containing tweets Tokenize tweets using TweetTokenizer Report number of code-switched tweets Save all tokens with corresponding languages to file

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 13 / 15

slide-33
SLIDE 33

Part 3 - Saving langdetect results (sketch)

for tweet in tweets: text = tweet_tokenizer.tokenize(tweet['full_text']) en, de = False, False for token in text: f.write(token + "\t") if not token.isalpha(): f.write("OTHER\n") continue detected_languages = detect_langs(token) for i, lang in enumerate(detected_languages): if lang.lang == "en": en = True f.write("en\n") break elif lang.lang == "de": ... # analogous to above elif i == len(detected_languages) - 1: f.write("OTHER\n") if en and de: code_switched_count += 1

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 14 / 15

slide-34
SLIDE 34

Haben Sie questions?

Marko Loˇ zaji´ c (University of T¨ ubingen) Assignment 1 June 5, 2019 15 / 15