Handling sequential data N ATURAL LAN GUAGE GEN ERATION IN P YTH - - PowerPoint PPT Presentation

handling sequential data
SMART_READER_LITE
LIVE PREVIEW

Handling sequential data N ATURAL LAN GUAGE GEN ERATION IN P YTH - - PowerPoint PPT Presentation

Handling sequential data N ATURAL LAN GUAGE GEN ERATION IN P YTH ON Biswanath Halder Data Scientist Natural language generation Generation of texts in a certain style. Machine translation. Sentence or word auto-completion. Generation of


slide-1
SLIDE 1

Handling sequential data

N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Biswanath Halder

Data Scientist

slide-2
SLIDE 2

NATURAL LANGUAGE GENERATION IN PYTHON

Natural language generation

Generation of texts in a certain style. Machine translation. Sentence or word auto-completion. Generation of textual summaries. Automated chatbots.

slide-3
SLIDE 3

NATURAL LANGUAGE GENERATION IN PYTHON

Introduction to sequential data

Any data where the order matters. Examples - T ext data, Time series data, DNA sequences. Models should take order information into account.

slide-4
SLIDE 4

NATURAL LANGUAGE GENERATION IN PYTHON

Text or language data

Data used in spoken or written language. Specic order amongst words or characters. Change of order - different meaning or gibberish. "I am learning Mathematics" - Correct. "learning am Mathematics I" - Doesn't make sense.

slide-5
SLIDE 5

NATURAL LANGUAGE GENERATION IN PYTHON

An example of text dataset

Dataset of people's names. Each word is a name, e.g. john, william, james, charles, george. Each name is an independent word. However, the order of the characters inside the name matters. Name - sequence of characters, e.g., 'j', 'a', 'm', 'e', 's'. Our goal - generate names such as these.

slide-6
SLIDE 6

NATURAL LANGUAGE GENERATION IN PYTHON

Names Dataset

names.head(5) name 0 john 1 william 2 james 3 charles 4 george

slide-7
SLIDE 7

NATURAL LANGUAGE GENERATION IN PYTHON

Word delimiters

Specify the start and end of a name using start and end token. One special character to specify the start - start token. Another special character to specify the end - end token. Start token - \t . End token - \n .

slide-8
SLIDE 8

NATURAL LANGUAGE GENERATION IN PYTHON

Insert start token

Start token in front of the name. data['name'] = data['name'].apply(lambda x : '\t' + x)

name 0 \tjohn 1 \twilliam 2 \tjames 3 \tcharles 4 \tgeorge

slide-9
SLIDE 9

NATURAL LANGUAGE GENERATION IN PYTHON

Append end token

End token at the end of the name. data['target'] = data['name'].apply(lambda x : x[1:len(x)] + '\n')

name target 0 \tjohn john\n 1 \twilliam william\n 2 \tjames james\n 3 \tcharles charles\n 4 \tgeorge george\n

slide-10
SLIDE 10

NATURAL LANGUAGE GENERATION IN PYTHON

Vocabulary for names dataset

Vocabulary - set of all unique characters used in the dataset.

def get_vocabulary(names): # Define vocabulary as a set and include start and end token vocabulary = set(['\t', '\n']) # Iterate over all names and all characters of each name for name in names: for c in name: if c not in all_chars: # If character is not in vocabulary, add it vocabulary.add(c) # Return the vocabulary return vocabulary

slide-11
SLIDE 11

NATURAL LANGUAGE GENERATION IN PYTHON

Character to integer mapping

Sort the vocabulary and assign numbers in order. Character \t mapped to 0 , \n to 1 , a to 2 , b to 3 , etc.

ctoi = { char : idx for idx, char in enumerate(sorted(vocabulary))} {'\t': 0, '\n': 1, 'a': 2, 'b': 3, 'c': 4, ...}

slide-12
SLIDE 12

NATURAL LANGUAGE GENERATION IN PYTHON

Integer to character mapping

Integer to character mapping. Integer 0 to \t , 1 to \n , 2 to a , 3 to b , etc.

itoc = { idx : char for idx, char in enumerate(sorted(vocabulary))} {0: '\t', 1: '\n', 2: 'a', 3: 'b', 4: 'c', ...}

slide-13
SLIDE 13

Let's practice!

N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

slide-14
SLIDE 14

Introduction to recurrent neural network

N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Biswanath Halder

Data Scientist

slide-15
SLIDE 15

NATURAL LANGUAGE GENERATION IN PYTHON

Feed-forward neural network

slide-16
SLIDE 16

NATURAL LANGUAGE GENERATION IN PYTHON

Introducing recurrence

slide-17
SLIDE 17

NATURAL LANGUAGE GENERATION IN PYTHON

RNN for baby name generator

Generate next character given current. Keep track of the history so far. Generate name john . Sequence - \t , j , o , h , n , \n . Time-step 1: input \t , output j . Time-step 2: input j , output o . State remembers \t and j seen so far. Continue till end of sequence.

slide-18
SLIDE 18

NATURAL LANGUAGE GENERATION IN PYTHON

Encoding of the characters

Character to integer mapping.

{'\t': 0, '\n': 1, 'a': 2, 'b': 3, 'c': 4, ...}

One-hot encoding of the characters.

'\t' = [1, 0, 0, 0, ..., 0] '\n' = [0, 1, 0, 0, ..., 0] 'a' = [0, 0, 1, 0, ..., 0] 'b' = [0, 0, 0, 1, ..., 0] . . . 'z' = [0, 0, 0, 0, ..., 1]

slide-19
SLIDE 19

NATURAL LANGUAGE GENERATION IN PYTHON

Number of time Steps

Time-step: Length of the longest name.

def get_max_len(names): length_list=[] for l in names: length_list.append(len(l)) max_len = np.max(length_list) return max_len max_len = get_max_len(names)

Each name as a sequence of length max_len

slide-20
SLIDE 20

NATURAL LANGUAGE GENERATION IN PYTHON

Input and target vectors

slide-21
SLIDE 21

NATURAL LANGUAGE GENERATION IN PYTHON

Initialize the input vector

Create 3-D zero vector of required shape for input. input_data = np.zeros((len(names.name), max_len+1, len(vocabulary)), dtype='float32') Fill the vector with data for n_idx, name in enumerate(names.name): for c_idx, char in enumerate(name): input_data[n_idx, c_idx, char_to_idx[char]] = 1.

slide-22
SLIDE 22

NATURAL LANGUAGE GENERATION IN PYTHON

Initialize the target vector

Create 3-D zero vector of required shape for target. target_data = np.zeros((len(names.name), max_len+1, len(vocabulary)), dtype='float32') Fill the target vector with data. for n_idx, name in enumerate(names.target): for c_idx, char in enumerate(name): target_data[n_idx, c_idx, char_to_idx[char]] = 1.

slide-23
SLIDE 23

NATURAL LANGUAGE GENERATION IN PYTHON

Build and compile recurrent neural network

model = Sequential() model.add(SimpleRNN(50, input_shape=(max_len+1, len(vocabulary)), return_sequences=True)) model.add(TimeDistributed(Dense(len(vocabulary), activation='softmax'))) model.compile(loss='categorical_crossentropy', optimizer='adam')

slide-24
SLIDE 24

NATURAL LANGUAGE GENERATION IN PYTHON

Check model summary

model.summary() Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= simple_rnn_1 (SimpleRNN) (None, 13, 50) 3950 _________________________________________________________________ time_distributed_1 (TimeDist (None, 13, 28) 1428 _________________________________________________________________ time_distributed_2 (TimeDist (None, 13, 28) 0 ================================================================= Total params: 5,378 Trainable params: 5,378 Non-trainable params: 0 _________________________________________________________________

slide-25
SLIDE 25

Let's practice!

N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

slide-26
SLIDE 26

Inference using recurrent neural network

N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Biswanath Halder

Data Scientist

slide-27
SLIDE 27

NATURAL LANGUAGE GENERATION IN PYTHON

Understanding training

Neural network: a black box. Input target pair (x, y): ideal output y for input x. For input x produces output, say, z. Goal: reduce difference between actual output z and ideal output y. Training: adjust the internal parameters to achieve goal. After training actual output more similar to ideal output.

slide-28
SLIDE 28

NATURAL LANGUAGE GENERATION IN PYTHON

Input and target vectors for training

slide-29
SLIDE 29

NATURAL LANGUAGE GENERATION IN PYTHON

Train recurrent network

Train recurrent network.

model.fit(input_data, target_data, batch_size=128, epochs=15)

Batch size: number of samples after which the parameters are adjusted. Epoch: number of times to iterate over the full dataset.

slide-30
SLIDE 30

NATURAL LANGUAGE GENERATION IN PYTHON

Predict rst character

Initialize the rst character of the sequence.

  • utput_seq = np.zeros((1, max_len+1, len(vocabulary)))
  • utput_seq[0, 0, char_to_idx['\t']] = 1

Probability distribution for the next character. probs = model.predict_proba(output_seq, verbose=0)[:,1,:] Sample the vocabulary using the probability distribution. first_char = np.random.choice(sorted(list(vocabulary)), replace=False, p=probs.reshape(28))

slide-31
SLIDE 31

NATURAL LANGUAGE GENERATION IN PYTHON

Predict second character using the rst

Insert rst character in the sequence.

  • utput_seq[0, 0, char_to_idx[first_char]] = 1.

Sample from probability distribution. probs = model.predict_proba(output_seq, verbose=0)[:,2,:] second_char = np.random.choice(sorted(list(vocabulary)), replace=False p=probs.reshape(28))

slide-32
SLIDE 32

NATURAL LANGUAGE GENERATION IN PYTHON

Generate baby names

def generate_baby_names(n): for i in range(0,n): stop=False counter=1 name = '' # Initialize first char of output sequence

  • utput_seq = np.zeros((1, max_len+1, 28))
  • utput_seq[0, 0, char_to_idx['\t']] = 1.

# Continue until a newline is generated or max no of chars reached while stop == False and counter < 10: # Get probability distribution for next character probs = model.predict_proba(output_seq, verbose=0)[:,counter-1,:] # Sample vocabulary to get most probable next character c = np.random.choice(sorted(list(vocabulary)), replace=False, p=probs.reshape(28)) if c=='\n': stop=True else: name = name + c

  • utput_seq[0,counter , char_to_idx[c]] = 1.

counter=counter+1 print(name)

slide-33
SLIDE 33

NATURAL LANGUAGE GENERATION IN PYTHON

Cool baby names

generate_baby_names(10) leannad elfrey lisse artima revel geletha

  • rtone

rorental berne raypha

slide-34
SLIDE 34

Let's practice!

N ATURAL LAN GUAGE GEN ERATION IN P YTH ON