GenerativeTF

artificial-intelligencernngan17-GenerativeNetworksmicrosoft-for-beginnerslessonsAImicrosoft-AI-For-Beginnersmachine-learning5-NLPdeep-learningcomputer-visioncnnNLP

Generative networks

Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provided a mechanism for language modeling, i.e. they can learn word ordering and provide predictions for next word in a sequence. This allows us to use RNNs for generative tasks, such as ordinary text generation, machine translation, and even image captioning.

In RNN architecture we discussed in the previous unit, each RNN unit produced next hidden state as an output. However, we can also add another output to each recurrent unit, which would allow us to output a sequence (which is equal in length to the original sequence). Moreover, we can use RNN units that do not accept an input at each step, and just take some initial state vector, and then produce a sequence of outputs.

In this notebook, we will focus on simple generative models that help us generate text. For simplicity, let's build character-level network, which generates text letter by letter. During training, we need to take some text corpus, and split it into letter sequences.

[1]

Building character vocabulary

To build character-level generative network, we need to split text into individual characters instead of words. TextVectorization layer that we have been using before cannot do that, so we have to options:

  • Manually load text and do tokenization 'by hand', as in this official Keras example
  • Use Tokenizer class for character-level tokenization.

We will go with the second option. Tokenizer can also be used to tokenize into words, so one should be able to switch from char-level to word-level tokenization quite easily.

To do character-level tokenization, we need to pass char_level=True parameter:

[2]

We also want to use one special token to denote end of sequence, which we will call <eos>. Let's add it manually to the vocabulary:

[3]

Now, to encode text into sequences of numbers, we can use:

[4]
[[48, 2, 10, 10, 5, 44, 1, 25, 5, 8, 10, 13, 78]]

Training a generative RNN to generate titles

The way we will train RNN to generate news titles is the following. On each step, we will take one title, which will be fed into an RNN, and for each input character we will ask the network to generate next output character:

Image showing an example RNN generation of the word 'HELLO'.

For the last character of our sequence, we will ask the network to generate <eos> token.

The main difference between generative RNN that we are using here is that we will take an output from each step of the RNN, and not just from the final cell. This can be achieved by specifying return_sequences parameter to the RNN cell.

Thus, during the training, an input to the network would be a sequence of encoded characters of some length, and an output would be a sequence of the same length, but shifted by one element and terminated by <eos>. Minibatch will consist of several such sequences, and we would need to use padding to align all sequences.

Let's create functions that will transform the dataset for us. Because we want to pad sequences on minibatch level, we will first batch the dataset by calling .batch(), and then map it in order to do transformation. So, the transformation function will take a whole minibatch as a parameter:

[5]

A few important things that we do here:

  • We first extract the actual text from the string tensor
  • text_to_sequences converts the list of strings into a list of integer tensors
  • pad_sequences pads those tensors to their maximum length
  • We finally one-hot encode all the characters, and also do the shifting and <eos> appending. We will soon see why we need one-hot-encoded characters

However, this function is Pythonic, i.e. it cannot be automatically translated into Tensorflow computational graph. We will get errors if we try to use this function directly in the Dataset.map function. We need to enclose this Pythonic call by using py_function wrapper:

[6]

Note: Differentiating between Pythonic and Tensorflow transformation functions may seem a little too complex, and you may be questioning why we do not transform the dataset using standard Python functions before passing it to fit. While this definitely can be done, using Dataset.map has a huge advantage, because data transformation pipeline is executed using Tensorflow computational graph, which takes advantage of GPU computations, and minimized the need to pass data between CPU/GPU.

Now we can build our generator network and start training. It can be based on any recurrent cell which we discussed in the previous unit (simple, LSTM or GRU). In our example we will use LSTM.

Because the network takes characters as input, and vocabulary size is pretty small, we do not need embedding layer, one-hot-encoded input can directly go into LSTM cell. Output layer would be a Dense classifier that will convert LSTM output into one-hot-encoded token numbers.

In addition, since we are dealing with variable-length sequences, we can use Masking layer to create a mask that will ignore padded part of the string. This is not strictly needed, because we are not very much interested in everything that goes beyond <eos> token, but we will use it for the sake of getting some experience with this layer type. input_shape would be (None, vocab_size), where None indicates the sequence of variable length, and output shape is (None,vocab_size) as well, as you can see from the summary:

[7]
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
masking (Masking)            (None, None, 84)          0         
_________________________________________________________________
lstm (LSTM)                  (None, None, 128)         109056    
_________________________________________________________________
dense (Dense)                (None, None, 84)          10836     
=================================================================
Total params: 119,892
Trainable params: 119,892
Non-trainable params: 0
_________________________________________________________________
15000/15000 [==============================] - 229s 15ms/step - loss: 1.5385
<tensorflow.python.keras.callbacks.History at 0x7fa40c1245e0>

Generating output

Now that we have trained the model, we want to use it to generate some output. First of all, we need a way to decode text represented by a sequence of token numbers. To do this, we could use tokenizer.sequences_to_texts function; however, it does not work well with character-level tokenization. Therefore we will take a dictionary of tokens from the tokenizer (called word_index), build a reverse map, and write our own decoding function:

[10]

Now, let's do generation. We will start with some string start, encode it into a sequence inp, and then on each step we will call our network to infer the next character.

Output of the network out is a vector of vocab_size elements representing probablities of each token, and we can find the most probably token number by using argmax. We then append this character to the generated list of tokens, and proceed with generation. This process of generating one character is repeated size times to generate required number of characters, and we terminate early when eos_token is encountered.

[12]
'Today #39;s lead to strike for the strike for the strike for the strike (AFP)'

Sampling output during training

Because we do not have any useful metrics such as accuracy, the only way we can see that our model is getting better is by sampling generated string during training. To do it, we will use callbacks, i.e. functions that we can pass to the fit function, and that will be called periodically during training.

[13]
Epoch 1/3
15000/15000 [==============================] - 226s 15ms/step - loss: 1.2703
Today #39;s a lead in the company for the strike
Epoch 2/3
15000/15000 [==============================] - 227s 15ms/step - loss: 1.2057
Today #39;s the Market Service on Security Start (AP)
Epoch 3/3
15000/15000 [==============================] - 226s 15ms/step - loss: 1.1752
Today #39;s a line on the strike to start for the start
<tensorflow.python.keras.callbacks.History at 0x7fa40c74e3d0>

This example already generates some pretty good text, but it can be further improved in several ways:

  • More text. We have only used titles for our task, but you may want to experiment with full text. Remember that RNNs are not too great with handling long sequences, so it makes sense either to split them into shorted sentences, or to always train on a fixed sequence length of some predefined value num_chars (say, 256). You may try to change the example above into such architecture, using official Keras tutorial as an inspiration.
  • Multilayer LSTM. It makes sense to try 2 or 3 layers of LSTM cells. As we mentioned in the previous unit, each layer of LSTM extracts certain patterns from text, and in case of character-level generator we can expect lower LSTM level to be responsible for extracting syllables, and higher levels - for words and word combinations. This can be simply implemented by passing number-of-layers parameter to LSTM constructor.
  • You may also want to experiment with GRU units and see which ones perform better, and with different hidden layer sizes. Too large hidden layer may result in overfitting (e.g. network will learn exact text), and smaller size might not produce good result.

Soft text generation and temperature

In the previous definition of generate, we were always taking the character with highest probability as the next character in generated text. This resulted in the fact that the text often "cycled" between the same character sequences again and again, like in this example:

today of the second the company and a second the company ...

However, if we look at the probability distribution for the next character, it could be that the difference between a few highest probabilities is not huge, e.g. one character can have probability 0.2, another - 0.19, etc. For example, when looking for the next character in the sequence 'play', next character can equally well be either space, or e (as in the word player).

This leads us to the conclusion that it is not always "fair" to select the character with higher probability, because choosing the second highest might still lead us to meaningful text. It is more wise to sample characters from the probability distribution given by the network output.

This sampling can be done using np.multinomial function that implements so-called multinomial distribution. A function that implements this soft text generation is defined below:

[33]

--- Temperature = 0.3
Today #39;s strike #39; to start at the store return
On Sunday PO to Be Data Profit Up (Reuters)
Moscow, SP wins straight to the Microsoft #39;s control of the space start
President olding of the blast start for the strike to pay &lt;b&gt;...&lt;/b&gt;
Little red riding hood ficed to the spam countered in European &lt;b&gt;...&lt;/b&gt;

--- Temperature = 0.8
Today countie strikes ryder missile faces food market blut
On Sunday collores lose-toppy of sale of Bullment in &lt;b&gt;...&lt;/b&gt;
Moscow, IBM Diffeiting in Afghan Software Hotels (Reuters)
President Ol Luster for Profit Peaced Raised (AP)
Little red riding hood dace on depart talks #39; bank up

--- Temperature = 1.0
Today wits House buiting debate fixes #39; supervice stake again
On Sunday arling digital poaching In for level
Moscow, DS Up 7, Top Proble Protest Caprey Mamarian Strike
President teps help of roubler stepted lessabul-Dhalitics (AFP)
Little red riding hood signs on cash in Carter-youb

--- Temperature = 1.3
Today wits flawer ro, pSIA figat's co DroftwavesIs Talo up
On Sunday hround elitwing wint EU Powerburlinetien
Moscow, Bazz #39;s sentries olymen winnelds' next for Olympite Huc?
President lost securitys from power Elections in Smiltrials
Little red riding hood vides profit, exponituity, profitmainalist-at said listers

--- Temperature = 1.8
Today #39;It: He deat: N.KA Asside
On Sunday i arry Par aldeup patient Wo stele1
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-33-db32367a0feb> in <module>
     18     print(f"\n--- Temperature = {i}")
     19     for j in range(5):
---> 20         print(generate_soft(model,size=300,start=words[j],temperature=i))

<ipython-input-33-db32367a0feb> in generate_soft(model, size, start, temperature)
     11             chars.append(nc)
     12             inp = inp+[nc]
---> 13         return decode(chars)
     14 
     15 words = ['Today ','On Sunday ','Moscow, ','President ','Little red riding hood ']

<ipython-input-10-3f5fa6130b1d> in decode(x)
      2 
      3 def decode(x):
----> 4     return ''.join([reverse_map[t] for t in x])

<ipython-input-10-3f5fa6130b1d> in <listcomp>(.0)
      2 
      3 def decode(x):
----> 4     return ''.join([reverse_map[t] for t in x])

KeyError: 0

We have introduced one more parameter called temperature, which is used to indicate how hard we should stick to the highest probability. If temperature is 1.0, we do fair multinomial sampling, and when temperature goes to infinity - all probabilities become equal, and we randomly select next character. In the example below we can observe that the text becomes meaningless when we increase the temperature too much, and it resembles "cycled" hard-generated text when it becomes closer to 0.