Notebooks
M
Microsoft
EmbeddingsPyTorch

EmbeddingsPyTorch

artificial-intelligencernnganmicrosoft-for-beginnerslessonsAImicrosoft-AI-For-Beginnersmachine-learning5-NLPdeep-learning14-Embeddingscomputer-visioncnnNLP

Embeddings

In our previous example, we operated on high-dimensional bag-of-words vectors with length vocab_size, and we were explicitly converting from low-dimensional positional representation vectors into sparse one-hot representation. This one-hot representation is not memory-efficient, in addition, each word is treated independently from each other, i.e. one-hot encoded vectors do not express any semantic similarity between words.

In this unit, we will continue exploring News AG dataset. To begin, let's load the data and get some definitions from the previous notebook.

[1]
Loading dataset...
d:\WORK\ai-for-beginners\5-NLP\14-Embeddings\data\train.csv: 29.5MB [00:01, 18.8MB/s]                            
d:\WORK\ai-for-beginners\5-NLP\14-Embeddings\data\test.csv: 1.86MB [00:00, 11.2MB/s]                          
Building vocab...
Vocab size =  95812

What is embedding?

The idea of embedding is to represent words by lower-dimensional dense vectors, which somehow reflect semantic meaning of a word. We will later discuss how to build meaningful word embeddings, but for now let's just think of embeddings as a way to lower dimensionality of a word vector.

So, embedding layer would take a word as an input, and produce an output vector of specified embedding_size. In a sense, it is very similar to Linear layer, but instead of taking one-hot encoded vector, it will be able to take a word number as an input.

By using embedding layer as a first layer in our network, we can switch from bag-of-words to embedding bag model, where we first convert each word in our text into corresponding embedding, and then compute some aggregate function over all those embeddings, such as sum, average or max.

Image showing an embedding classifier for five sequence words.

Our classifier neural network will start with embedding layer, then aggregation layer, and linear classifier on top of it:

[2]

Dealing with variable sequence size

As a result of this architecture, minibatches to our network would need to be created in a certain way. In the previous unit, when using bag-of-words, all BoW tensors in a minibatch had equal size vocab_size, regardless of the actual length of our text sequence. Once we move to word embeddings, we would end up with variable number of words in each text sample, and when combining those samples into minibatches we would have to apply some padding.

This can be done using the same technique of providing collate_fn function to the datasource:

[3]

Training embedding classifier

Now that we have defined proper dataloader, we can train the model using the training function we have defined in the previous unit:

[4]
3200: acc=0.6415625
6400: acc=0.6865625
9600: acc=0.7103125
12800: acc=0.726953125
16000: acc=0.739375
19200: acc=0.75046875
22400: acc=0.7572321428571429
(0.889799795315499, 0.7623160588611644)

Note: We are only training for 25k records here (less than one full epoch) for the sake of time, but you can continue training, write a function to train for several epochs, and experiment with learning rate parameter to achieve higher accuracy. You should be able to go to the accuracy of about 90%.

EmbeddingBag Layer and Variable-Length Sequence Representation

In the previous architecture, we needed to pad all sequences to the same length in order to fit them into a minibatch. This is not the most efficient way to represent variable length sequences - another apporach would be to use offset vector, which would hold offsets of all sequences stored in one large vector.

Image showing an offset sequence representation

Note: On the picture above, we show a sequence of characters, but in our example we are working with sequences of words. However, the general principle of representing sequences with offset vector remains the same.

To work with offset representation, we use EmbeddingBag layer. It is similar to Embedding, but it takes content vector and offset vector as input, and it also includes averaging layer, which can be mean, sum or max.

Here is modified network that uses EmbeddingBag:

[5]

To prepare the dataset for training, we need to provide a conversion function that will prepare the offset vector:

[6]

Note, that unlike in all previous examples, our network now accepts two parameters: data vector and offset vector, which are of different sizes. Similarly, our data loader also provides us with 3 values instead of 2: both text and offset vectors are provided as features. Therefore, we need to slightly adjust our training function to take care of that:

[7]
3200: acc=0.6153125
6400: acc=0.6615625
9600: acc=0.6932291666666667
12800: acc=0.715078125
16000: acc=0.7270625
19200: acc=0.7382291666666667
22400: acc=0.7486160714285715
(22.771553103007037, 0.7551983365323096)

Semantic Embeddings: Word2Vec

In our previous example, the model embedding layer learnt to map words to vector representation, however, this representation did not have much semantical meaning. It would be nice to learn such vector representation, that similar words or synonyms would correspond to vectors that are close to each other in terms of some vector distance (eg. euclidian distance).

To do that, we need to pre-train our embedding model on a large collection of text in a specific way. One of the first ways to train semantic embeddings is called Word2Vec. It is based on two main architectures that are used to produce a distributed representation of words:

  • Continuous bag-of-words (CBoW) — in this architecture, we train the model to predict a word from surrounding context. Given the ngram (W2,W1,W0,W1,W2)(W_{-2},W_{-1},W_0,W_1,W_2), the goal of the model is to predict W0W_0 from (W2,W1,W1,W2)(W_{-2},W_{-1},W_1,W_2).
  • Continuous skip-gram is opposite to CBoW. The model uses surrounding window of context words to predict the current word.

CBoW is faster, while skip-gram is slower, but does a better job of representing infrequent words.

Image showing both CBoW and Skip-Gram algorithms to convert words to vectors.

To experiment with word2vec embedding pre-trained on Google News dataset, we can use gensim library. Below we find the words most similar to 'neural'

Note: When you first create word vectors, downloading them can take some time!

[8]
[9]
neuronal -> 0.7804799675941467
neurons -> 0.7326500415802002
neural_circuits -> 0.7252851724624634
neuron -> 0.7174385190010071
cortical -> 0.6941086649894714
brain_circuitry -> 0.6923246383666992
synaptic -> 0.6699118614196777
neural_circuitry -> 0.6638563275337219
neurochemical -> 0.6555314064025879
neuronal_activity -> 0.6531826257705688

We can also compute vector embeddings from the word, to be used in training classification model (we only show first 20 components of the vector for clarity):

[10]
array([ 0.01226807,  0.06225586,  0.10693359,  0.05810547,  0.23828125,
,        0.03686523,  0.05151367, -0.20703125,  0.01989746,  0.10058594,
,       -0.03759766, -0.1015625 , -0.15820312, -0.08105469, -0.0390625 ,
,       -0.05053711,  0.16015625,  0.2578125 ,  0.10058594, -0.25976562],
,      dtype=float32)

Great thing about semantical embeddings is that you can manipulate vector encoding to change the semantics. For example, we can ask to find a word, whose vector representation would be as close as possible to words king and woman, and as far away from the word man:

[10]
('queen', 0.7118192911148071)

Both CBoW and Skip-Grams are “predictive” embeddings, in that they only take local contexts into account. Word2Vec does not take advantage of global context.

FastText, builds on Word2Vec by learning vector representations for each word and the charachter n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pre-training it enables word embeddings to encode sub-word information.

Another method, GloVe, leverages the idea of co-occurence matrix, uses neural methods to decompose co-occurrence matrix into more expressive and non linear word vectors.

You can play with the example by changing embeddings to FastText and GloVe, since gensim supports several different word embedding models.

Using Pre-Trained Embeddings in PyTorch

We can modify the example above to pre-populate the matrix in our embedding layer with semantical embeddings, such as Word2Vec. We need to take into account that vocabularies of pre-trained embedding and our text corpus will likely not match, so we will initialize weights for the missing words with random values:

[11]
Embedding size: 300
Populating matrix, this will take some time...Done, found 41080 words, 54732 words missing

Now let's train our model. Note that the time it takes to train the model is significantly larger than in the previous example, due to larger embedding layer size, and thus much higher number of parameters. Also, because of this, we may need to train our model on more examples if we want to avoid overfitting.

[12]
3200: acc=0.6359375
6400: acc=0.68109375
9600: acc=0.7067708333333333
12800: acc=0.723671875
16000: acc=0.73625
19200: acc=0.7463541666666667
22400: acc=0.7560714285714286
(214.1013875559821, 0.7626759436980166)

In our case we do not see huge increase in accuracy, which is likely to quite different vocabularies. To overcome the problem of different vocabularies, we can use one of the following solutions:

  • Re-train word2vec model on our vocabulary
  • Load our dataset with the vocabulary from the pre-trained word2vec model. Vocabulary used to load the dataset can be specified during loading.

The latter approach seems easiter, especially because PyTorch torchtext framework contains built-in support for embeddings. We can, for example, instantiate GloVe-based vocabulary in the following manner:

[14]
100%|█████████▉| 399999/400000 [00:15<00:00, 25411.14it/s]

Loaded vocabulary has the following basic operations:

  • vocab.stoi dictionary allows us to convert word into its dictionary index
  • vocab.itos does the opposite - converts number into word
  • vocab.vectors is the array of embedding vectors, so to get the embedding of a word s we need to use vocab.vectors[vocab.stoi[s]]

Here is the example of manipulating embeddings to demonstrate the equation kind-man+woman = queen (I had to tweak the coefficient a bit to make it work):

[15]
'queen'

To train the classifier using those embeddings, we first need to encode our dataset using GloVe vocabulary:

[16]

As we have seen above, all vector embeddings are stored in vocab.vectors matrix. It makes it super-easy to load those weights into weights of embedding layer using simple copying:

[17]

Now let's train our model and see if we get better results:

[18]
3200: acc=0.6271875
6400: acc=0.68078125
9600: acc=0.7030208333333333
12800: acc=0.71984375
16000: acc=0.7346875
19200: acc=0.7455729166666667
22400: acc=0.7529464285714286
(35.53972978646833, 0.7575175943698017)

One of the reasons we are not seeing significant increase in accuracy is due to the fact that some words from our dataset are missing in the pre-trained GloVe vocabulary, and thus they are essentially ignored. To overcome this fact, we can train our own embeddings on our dataset.

Contextual Embeddings

One key limitation of traditional pretrained embedding representations such as Word2Vec is the problem of word sense disambiguation. While pretrained embeddings can capture some of the meaning of words in context, every possible meaning of a word is encoded into the same embedding. This can cause problems in downstream models, since many words such as the word 'play' have different meanings depending on the context they are used in.

For example word 'play' in those two different sentences have quite different meaning:

  • I went to a play at the theature.
  • John wants to play with his friends.

The pretrained embeddings above represent both of these meanings of the word 'play' in the same embedding. To overcome this limitation, we need to build embeddings based on the language model, which is trained on a large corpus of text, and knows how words can be put together in different contexts. Discussing contextual embeddings is out of scope for this tutorial, but we will come back to them when talking about language models in the next unit.