What you’ll learn
- how to use a neural network to learn a useful representation of text documents
- what doc2vec is
- how unsupervised learning can help you during a supervised learning task
- Check the NOTEBOOK for the full script!!!
In the last post, we saw that (with the right tools) it was quite easy to build a spam classifier achieving high both precision and recall, even with an unbalanced data set. In this post, we will investigate the patterns in our data set without explicitly using the labels in the algorithm we use. In other terms, we will adopt an unsupervised learning approach to this problem of spam classification, and see to what extent it is feasible to classify those messages without using the labels.
A word on unsupervised learning
Unsupervised learning can be defined as the task of finding useful representations in unlabeled data or learning functions that describe interesting patterns in unlabeled data sets. This field is very exciting and is a hot topic among researcher in the machine learning/AI community. First of all, unlabeled data is widespread, to the contrary of labeled data. Indeed, labeled data sets must be labeled by some kind of intelligence, which is not a trivial task, while unlabeled data is present everywhere. Think for example about the millions of pictures or unlabeled documents you can find on the internet.
Now let’s talk about a very important concept in machine learning: data representation. More specifically, unstructured data such as text documents and images need to be represented in a smart way, before you can learn something about them. For instance, in the field deep learning, convolutional neural networks classifiers are generally composed of two parts:
- the first part is made of convolutional layers (among other things such as activation functions layers, pooling layers, drop out layers,…) and acts as a features extractor
- the second part is a classifier (generally fully-connected layers)
The first part of the network learns a representation of the images that can be understood by some classifier.
Another example is text document representation. In the last post, we talked about the tf-idf representation of text messages. This basic representation turned out to be useful (even thought it is sparse and high dimensional), as a basic SVM classifier did the trick for spams detection.
So, the question we ask in this post is
Can we learn a representation of our data set so that it would be possible to differentiate between spams and non-spams without using the labels, or at least help a classifier to do the job?
As we will see, we could find some answers…
doc2vec for document embedding
doc2vec is an unsupervised learning method based on this paper.
I’ll not give here an extensive explanation of how it works, but you can check this Quora post for a very nice description of what happens behind the scenes!
This approach is an extension of the word2vec method. The idea is to build a one hidden layer neural network, and to give as input some words (context) and to predict the missing word. For example: “The parrot is near the window” would generate the following training samples:
x = [The, is], y = parrot
x= [ parrot, near ], y = is
x= [is, the], y = near
It is possible to change the window size in the model parameters.
The hidden layer of this network contains k neurons, where k is the dimensionality of the final embedding.
Hence, we turn this unsupervised problem into a supervised one, but we are note interested by the predictions, but only by the learned weights of the hidden layer! This is the representation we are looking for. Note that if k is set to 2 or 3, it is possible to immediately visualize the embedded documents..
Let’s code it!
As we’ll see, it only takes some lines of codes to train the doc2vec neural net and get a useful representation of our text messages. I’ll explain here the logic and main steps of the code, but check the NOTEBOOK for the whole script!
According to the gensim documentation, it is possible to train the model with a generator, so that we do not have to load all the documents into RAM at the same time. Even though our data set is relatively small (approximately 5000 text messages), we will use this trick, so that you can reproduce it for larger data sets. Our generator has to yield TaggedDocument objects, which are then processed by the doc2vec model. The TaggedDocument objects are instantiate like this:
model = gensim.models.TaggedDocument(txt, label)
The text is a document data point, which can be pre-processed with the gensim method
The label is just a unique id for each document. It can be basically anything but these id’s should be unique! That’s it for the preprocessing. As you can see we don’t do anything fancy here, but there is most certainly room for improvement.
A doc2vec model is instantiate in this way:
model = gensim.models.Doc2Vec(size=2, min_count=5, iter=200)
Here is an explanation of the parameters:
- size: dimension of the embedded vectors, 2 means that the documents will be projected into a 2-dimensional Euclidean space
- min_count: each word whose total number of occurence in the whole corpus does not exceed this threshold will not be considered in the vocabulary. Typical values are from 5 to 100. The larger the corpus, the larger it should be.
- iter: the number of iteration (epochs) over the corpus to train the neural network
The model gets trained very quickly and as seen in the next section, the results are quite good!
Results and discussion
The following plots are self-explanatory! As we can see, we managed to successfully embed high-dimensional vectors into 2 and 3 dimensional Euclidean space, so that spam and non-spams are grouped together. However, there is a worrying issue with this representation. In this case, we had the labels and hence we could plot the positive embedded data points (spams) in red and the negative (non-spams) in blue. However, how would look that plot like if we had not highlighted the two classes with colours?
The answer is not difficult to imagine. It would have been very difficult to cluster the two classes, as all the points lie in one unique region. Therefore, the doc2vec method was able to group spams together, in an unsupervised way but unable to separate them from the non-spams. An interesting extension to this model would be to find a way to separate the different classes, so that a clustering algorithm (unsupervised) would be able to to the classification job on its own.
I tried different parameters for this model, including higher dimensional embedding followed by t-sne (t-Distributed Stochastic Neighbor Embedding ), but I could not manage to separate the two classes, unfortunately… feel free to try and to share your results if you manage to do it!
As we can see, unsupervised learning can be very useful in order to solve an supervised problem. A low dimensional representation of the data can be visualized (if the dimension is 2 or 3) and significantly help you for the classification/ regression task. Indeed, in the first figure we can even manually draw the separating line between the two classes. Without dimensionality reduction, there is no way to manually draw an hyperplan when your vectors have some 1500 dimensions…