Python basics for NLP Type : which python3 in the command prompt - - PDF document

python basics for nlp
SMART_READER_LITE
LIVE PREVIEW

Python basics for NLP Type : which python3 in the command prompt - - PDF document

Programming environment Is Python installed Python basics for NLP Type : which python3 in the command prompt (applications->utilitaires->Terminal) Vincent Claveau It should answer: /usr/bin/python3 IRISA-CNRS Text file


slide-1
SLIDE 1

Python basics for NLP

Vincent Claveau IRISA-CNRS

Programming environment

Is Python installed

– Type : which python3 in the command prompt (applications->utilitaires->Terminal) – It should answer: /usr/bin/python3

Text file editor like emacs, gedit...

– Create a new document named hello.py, save it as pure txt (not rtf, doc...) and type:

#!/usr/bin/python3 #coding: utf-8 print('hello world\n')

Programming environment

Go in the directory where you saved the file

– Windows: cd Dekstop cd .. – Unix/Mac: ls pwd

Make hello.py an executable

– Type chmod u+x hello.py in the command prompt

Run your program

– ./hello.py

Python basics

Data structures

 Scalar: var

– integer: var = 3 float: var = 7.456 – string: var = 'I love Python' – boolean: var = True var = False

 List (table, array...)

t_misc = ['titi', 3.1415, var] t_misc.append('toto') t_misc[0] = 'tutu'

 Dictionary (associative array, hash...)

h_misc2 = { 'pi': 3.1415, 12:'December' } h_misc2['vincent'] = 'claveau'

Python basics

Programming structures

  • Conditional structures (note the indent)

if nb == 5:

elif line == “blabla\n” or 3 in t_prime:

else

Python basics

Programming structures

  • Loops

while nb < 100:

for i in range(0,len(t_word)):

for my_key in h_word2count:

slide-2
SLIDE 2

Python basics

  • Regular expressions (regex)

m = re.search( '^(D|d)upon.', line) if m is not None: … m = re.search( '^[^\t]*\t(.+)$', line) if m is not None: h_count[ m.group(1) ] += 1 var = re.sub( '[ \t]{2,8}([A-Z])', '\t\g<1>', var)

Python basics

  • About lists

t_integer = [1,2,3,4,5,6,7,8,9,10] t_even = [ num for num in t_integer if num%2 == 0 ] t_decreasing = sorted( t_integer, key=lambda a: -a) t_word_count = [ ('toto',2), ('titi',32), ('tata',12) ] t_dec_pair = sorted( t_word_count, key=lambda a,b: (-b,a) )

Python basics

  • About dictionary

h_word2count = { 'toto': 2, 'titi': 32, 'tata': 12 } for (w,c) in h_word2count.items( ) : …. h_doc2word2count = defaultdict(lambda: defaultdict(lambda: 0) ) h_doc2word2count[ 'doc_450' ][ 'toto' ] += 1 t_doc = h_doc2word2count.keys( ) t_count = h_doc2word2count[ 'doc_450' ].values( )

Python basics

  • functions

def myScore( tf , df , N = 1): if df > 0: score = tf**2 * log( N / df ) else: print('something strange happens here\n') score = 0 return score num = 540 for w in h_word2info: for (doc_id, freq) in h_word2info[w]: print myScore(freq, h_word2df[w],num)

Python examples

Read an (existing) file my_file.txt fh = open('my_file.txt', 'r') your operations on file handle fh fh.close()

  • r for line in codecs.open('my_file.txt', 'r', 'utf-8'): …
  • r with open(...) as fh:

for line in fh: …. Write (create) a file output.txt fh = codecs.open('my_file.txt', 'w', 'utf-8') fh.write('I write strings !\n' + 'I convert if needed ' + str(10/3) + '\n') fh.close()

Python examples

Explore a file : count lines line_nb = 0 for line in open('mon_fichier.txt', 'r'): line_nb += 1 print('There are' , line_nb , 'lines\n')

slide-3
SLIDE 3

Python examples

Explore a file: put lines in an array – reverse

printing of the file

t_lines = [ ] for line in open( 'my_file.txt', 'r' ): t_lines.append( line.rstrip('\r\n') ) for i in range(len(t_lines)): print t_lines[len(t_lines)-1-i]

Python examples

Explore a file : count lines with a dictionary –

frequency of beginning letter of line

h_line2count = defaultdict(lambda: 0) # better idea than { } re_letter = re.compile('^[^A-Za-z]*([A-Za-z])') for line in open('my_file.txt', 'r' ) : ## h_line2count[ line.rstrip('\r\n') ] += 1 m = re_letter.search(line) if m is not None: h_line2count[ m.group(1) ]+=1 print sorted(h_line2count.items(), key=lambda (c,v): (-v,c))

Python typical headers

Load common libraries from __future__ import division import os, subprocess, codecs, sys, glob, re, getopt, random, operator, pickle from math import * reload(sys) sys.setdefaultencoding('utf-8') from collections import defaultdict prg = sys.argv[0] def P(output=''): input(output+"\nDebug point; Press ENTER to continue") def Info(output=''): sys.stderr.write(output+'\n')

Python typical headers

Read arguments from command line import argparse parser = argparse.ArgumentParser() parser.add_argument("-t", "--train", dest="file_train", help="file containing the training docs", metavar="FILE") parser.add_argument("-s", "--stop", dest="file_stop", default='common_words.total_fr.txt', help="FILE contains stop words", metavar="FILE") parser.add_argument("-v", "--verbose", action="store_false", dest="verbose", default=True, help="print status messages to stdout") args = parser.parse_args()

NLTK

Natural Language Tool Kit

– Contains corpus, resources – Contains basic tools: tagger, chunker...

Use a local easy-install

– easy-install --instal-dir monRepPython – Check your PYTHONPATH

Install the needed data/tools import nltk nltk.download()

Python exercises

 Count the number of words in the PoS tagged

corpus → ~4400

 Count how many times each word appears →

the:202

 Count the average occurrence of common nouns

(NN, NNS, NNP) → 2.61099

 Count word co-occurrences (2 words occurring

in a same sentence)

slide-4
SLIDE 4

Python debugging

Compiling errors

First focus on the first syntax error – syntax error at ./my_prg.py line 1209, near "}" – common causes: wrong indent or paren ('(','{') mismatch Then process the other errors as they

appears

– Global symbol "df" unknown – Common causes : forgot to declare/initialize a variable (toto = 0)

Python debugging

Runtime errors

Common causes

– Uninitialised variables: h_occ['Le'], t_count[502] – Division by 0, log 0