python basics for nlp
play

Python basics for NLP Type : which python3 in the command prompt - PDF document

Programming environment Is Python installed Python basics for NLP Type : which python3 in the command prompt (applications->utilitaires->Terminal) Vincent Claveau It should answer: /usr/bin/python3 IRISA-CNRS Text file


  1. Programming environment  Is Python installed Python basics for NLP – Type : which python3 in the command prompt (applications->utilitaires->Terminal) Vincent Claveau – It should answer: /usr/bin/python3 IRISA-CNRS  Text file editor like emacs, gedit... – Create a new document named hello.py, save it as pure txt (not rtf, doc...) and type: #!/usr/bin/python3 #coding: utf-8 print('hello world\n') Python basics Programming environment Data structures  Scalar: var  Go in the directory where you saved the file – – Windows: cd Dekstop cd .. integer: var = 3 float: var = 7.456 – – Unix/Mac: ls pwd string: var = 'I love Python'  Make hello.py an executable – boolean: var = True var = False  List (table, array...) – Type chmod u+x hello.py in the command prompt t_misc = ['titi', 3.1415, var] t_misc.append('toto')  Run your program t_misc[0] = 'tutu'  Dictionary (associative array, hash...) – ./hello.py h_misc2 = { 'pi': 3.1415, 12:'December' } h_misc2['vincent'] = 'claveau' Python basics Python basics Programming structures Programming structures • • Conditional structures (note the indent) Loops if nb == 5: while nb < 100: … … elif line == “blabla\n” or 3 in t_prime: for i in range(0,len(t_word)): … … else for my_key in h_word2count: … …

  2. Python basics Python basics • • Regular expressions (regex) About lists t_integer = [1,2,3,4,5,6,7,8,9,10] t_even = [ num for num in t_integer if num%2 == 0 ] m = re.search( '^(D|d)upon.', line) t_decreasing = sorted( t_integer, key=lambda a: -a) if m is not None: … t_word_count = [ ('toto',2), ('titi',32), ('tata',12) ] m = re.search( '^[^\t]*\t(.+)$', line) t_dec_pair = sorted( t_word_count, key=lambda a,b: (-b,a) ) if m is not None: h_count[ m.group(1) ] += 1 var = re.sub( '[ \t]{2,8}([A-Z])', '\t\g<1>', var) Python basics Python basics • • About dictionary functions def myScore( tf , df , N = 1): h_word2count = { 'toto': 2, 'titi': 32, 'tata': 12 } if df > 0: for (w,c) in h_word2count.items( ) : …. score = tf**2 * log( N / df ) else: h_doc2word2count = defaultdict(lambda: defaultdict(lambda: 0) ) print('something strange happens here\n') h_doc2word2count[ 'doc_450' ][ 'toto' ] += 1 score = 0 t_doc = h_doc2word2count.keys( ) return score t_count = h_doc2word2count[ 'doc_450' ].values( ) num = 540 for w in h_word2info: for (doc_id, freq) in h_word2info[w]: print myScore(freq, h_word2df[w],num) Python examples Python examples  Read an (existing) file my_file.txt  Explore a file : count lines fh = open('my_file.txt', 'r') line_nb = 0 your operations on file handle fh for line in open('mon_fichier.txt', 'r'): fh.close() line_nb += 1 or for line in codecs.open('my_file.txt', 'r', 'utf-8'): … or with open(...) as fh: print('There are' , line_nb , 'lines\n') for line in fh: ….  Write (create) a file output.txt fh = codecs.open('my_file.txt', 'w', 'utf-8') fh.write('I write strings !\n' + 'I convert if needed ' + str(10/3) + '\n') fh.close()

  3. Python examples Python examples  Explore a file: put lines in an array – reverse  Explore a file : count lines with a dictionary – printing of the file frequency of beginning letter of line t_lines = [ ] h_line2count = defaultdict(lambda: 0) # better idea than { } for line in open( 'my_file.txt', 'r' ): re_letter = re.compile('^[^A-Za-z]*([A-Za-z])') t_lines.append( line.rstrip('\r\n') ) for line in open('my_file.txt', 'r' ) : ## h_line2count[ line.rstrip('\r\n') ] += 1 for i in range(len(t_lines)): m = re_letter.search(line) print t_lines[len(t_lines)-1-i] if m is not None: h_line2count[ m.group(1) ]+=1 print sorted(h_line2count.items(), key=lambda (c,v): (-v,c)) Python typical headers Python typical headers  Load common libraries  Read arguments from command line import argparse from __future__ import division parser = argparse.ArgumentParser() import os, subprocess, codecs, sys, glob, re, getopt, random, operator, pickle parser.add_argument("-t", "--train", dest="file_train", from math import * help="file containing the training docs", metavar="FILE") reload(sys) parser.add_argument("-s", "--stop", dest="file_stop", sys.setdefaultencoding('utf-8') default='common_words.total_fr.txt', help="FILE contains stop words", metavar="FILE") from collections import defaultdict parser.add_argument("-v", "--verbose", action="store_false", prg = sys.argv[0] dest="verbose", default=True, help="print status messages to stdout") def P(output=''): input(output+"\nDebug point; Press ENTER to continue") args = parser.parse_args() def Info(output=''): sys.stderr.write(output+'\n') NLTK Python exercises  Count the number of words in the PoS tagged  Natural Language Tool Kit corpus → ~4400 – Contains corpus, resources  Count how many times each word appears → – Contains basic tools: tagger, chunker... the:202  Use a local easy-install  Count the average occurrence of common nouns – easy-install --instal-dir monRepPython (NN, NNS, NNP) → 2.61099 –  Count word co-occurrences (2 words occurring Check your PYTHONPATH in a same sentence)  Install the needed data/tools import nltk nltk.download()

  4. Python debugging Python debugging Compiling errors Runtime errors  First focus on the first syntax error  Common causes – syntax error at ./my_prg.py line 1209, near "}" – Uninitialised variables: h_occ['Le'], – common causes: wrong indent or paren ('(','{') mismatch t_count[502]  Then process the other errors as they – Division by 0, log 0 appears – Global symbol "df" unknown – Common causes : forgot to declare/initialize a variable (toto = 0)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend