In Part 3 of this technology.org series on Dialog Systems, we introduced the basic idea of how to convert a piece of text into features that a machine can work with. There, we described two of the most straightforward techniques, one-hot encoding and bag-of-words encoding.

In this part, we will explain the internals of the TF-IDF technique, the abbreviation you must have heard many times before. We will also summarize things common to all the basic text representation techniques we have learned thus far. This will reveal their major drawbacks and give us hints what is needed for better numerical representation of text we still have to learn about.

Towards the end of Part 4, we will also introduce the idea of word embeddings and a powerful algorithm for converting vectors with word frequencies into a new space containing vectors with a much lower number of dimensions.

Dialog systems use specific algorithms for text conversion into machine-friendy features.

Dialog systems use specific algorithms for text conversion into machine-friendy features. Image credit: Pxhere, CC0 Public Domain

In case you missed the first three articles, you may be interested in reading the following posts, before starting with the fourth part:

How to Make Your Customer Happy by Using a Dialog System?

AI | Dialog Systems Part 2: How to Develop Dialog Systems That Make Sense

AI | Dialog Systems Part 3: How to Find Out What the User Needs?

More on Feature Extraction in Dialog Systems

TF-IDF Encoding

In both of the two vectorization approaches we have learned in Part 3, all the words in the document are regarded as equally important – none of them are superior in relation to the others. There is another technique known as TF-IDF that aims to change this by measuring the importance of each word relative to other words in the corpus. The abbreviation TF-IDF stands for Term Frequency–Inverse Document Frequency.

The essence of TF-IDF can be stated like this: if word X shows up frequently in document Y, but does not appear often in the other documents of the corpus, then X must be of great importance to Y. So we have Term Frequency (TF) to quantify how frequently a word appears in the document concerned.

As the corpus will probably contain documents of different lengths, a given word is likely to appear in a longer document more frequently than in a shorter one. To have a normalized metric for TF, number of the occurrences of word X in document Y is divided by the length of Y (i.e., total number of words in Y). Meanwhile, Inverse Document Frequency (IDF) is meant to assess the importance of a given word across the entire corpus.

Now, we come across a small issue. Ordinary computation of TF assumes equal weights (importance) for all the words. That would introduce some bias for stop words such as “a”, “the”, “is”, etc. They would tend to look more important simply because you see them so often. To mitigate this, IDF reduces the weights for popular words in the corpus, while giving a boost to those of the rare words.

We compute IDF of word X using the following procedure:

  • Count the number of documents containing X;
  • Divide the total number of documents in the corpus by this count;
  • Take the log base 2 of the ratio.

Now, it’s time to combine TF and IDF into a single TF-IDF score by simply calculating their product.

In the case of our example tiny corpus we introduced in Part 3 (repeated below for convenience), only one of the four sentences has some of our six words, whereas other words appear in two different sentences, and some even show up in all three sentences.

S1:       Cat chases dog.

S2:       Dog chases cat.

S3:       Cat drinks milk.

S4:       Dog drinks water.

Let’s take the word “cat”. It appears once in three of the sentences, so its TF score is 3⁄12 = 0.25, and its IDF score is log2 (4/3) = 0.4150. Therefore, the combined TF-IDF score for “cat” becomes 0.4150 * 0.25 = 0.1038. You could easily obtain TF-IDF scores for the other five words in the example corpus using this template.

Then, to get the vector representation for any sentence, simply collect the TF-IDF scores for each word in the sentence. For S1 it will be:

cat                    chases              dog                  drinks              milk                 water

0.1038             0.1667             0.1038             0                      0                      0

As in the case of the BoW technique, we can use the TF-IDF vectors to compute similarity between two documents. The TF-IDF representation is often used for common applications such as information retrieval and text classification. Even with the current expansion of deep-learning techniques, TF-IDF remains a favorite representation approach in a great deal of NLP tasks, particularly when developing an early version of the solution.

Bad News about the Three Techniques

Now that we learned three approaches for representing text numerically, it’s time to point out the major shortcomings common to them all:

  • All three representations are discrete in nature, which means they consider words as atomic units. This implies reduced capacity in capturing relationships between words.
  • All three representations are high-dimensional as the number of dimensions grows with the length of the vocabulary. Meanwhile, the feature vectors will contain almost exclusively zeros resulting in sparse representation. The latter is likely to hinder learning capacity due to overfitting. Also, high-dimensional representations consume too many computation resourses.
  • None of the techniques can deal with the out-of-vocabulary problem. This is a situation when the model encounters a word that has not been included in the training data. Then, the model has simply no way to represent that word.

To help get rid of the limitations of the basic vectorization techniques discussed, new approaches have been invented that are able to work with low-dimensional representations of texts. Application of these approaches, collectively known as distributed representations or embeddings, gathered speed not that long ago – in the past eight years or so. In the next section, we will have a look at the major reason of their popularity.

From Plain Word Statistics to Word Vectors

Now that you have some understanding how to turn natural language into numbers, it’s time to get us busy with doing some magic to the user input. This is the first occasion we will look at a machine’s ability to grasp what words actually mean. After all, you can fancy dialog systems as a semantic search engine.

That’s why you will definitely want to know how to translate your TF-IDF vectors into topic vectors that in turn allow you to do many fascinating things. Among those, your future dialog systems will be able to do semantic search – i.e., searching for documents based on their meaning. This way your dialog system will figure out what texts it could retrieve from its knowledge base to be used as a template for responding to the user in line with the context of the conversation.

If you try adding or subtracting a couple of TF-IDF vectors, you get only plain statistics on what words were used in the texts represented by those vectors. This will not give you any clue what meaning all these words convey. Hence, we should find some way to extract meaning from simple word counts. Then, we should figure out how to create another, more compact vector to represent that meaning.

Moreover, the basic text representation techniques we already covered neglect the information about the words surrounding a target word. In other words, these techniques neglect the impact of the words in the vicinity of the target word on its meaning. Instead, all the words from each sentence are simply thrown together into a statistical bag.

Now it’s time to learn constructing much smaller bags of words from a sequence of only a few neighboring words. This time it will be important to make secure that these neighborhoods don’t leak out into the sentences next to the current sentence. This way training of your word vectors will be better focussed on the appropriate words.

Finding a suitable numerical representation for the meaning of words and entire documents can be complicated. This is particularly true for languages like English that contain plenty of words with multiple meanings for the same word. If you agree this is a challenge for a new learner, you will see what it’s like for machine learning as well.

Doing Vector-Oriented Reasoning with Word2vec

While training a neural network to anticipate occurrences of words close to each target word, Tomas Mikolov invented how to encode their meaning using a relatively small number of vector dimensions. After joining the Google team in 2013, Mikolov released the code for building these word vectors that is known as Word2vec [Mikolov et al, 2013].

The power of Word2vec lies in its ability to use unsupervised learning through which a machine can learn directly from data, without any help from humans. That is, you don’t have to worry about organizing your training data, or labeling it by hand. Since there is an abundance of unlabeled and unstructured natural language text on the internet, Word2vec is just perfect for this type of data. All you need is a corpus big enough to refer to the words of your interest a sufficient number of times, so the neural network can learn what other words usually appear next to the target words.

The Word2vec algorithm was developed to support doing ordinary math with word vectors, so you get a reasonable answer after translating these vectors back into words. Adding and subtracting word vectors may answer analogy questions such as: “The Timbers are to Portland as what is to Seattle?” If you’ve taken care to train your neural network well enough, you can be sure this math on word vectors will spit “Seattle_Sounders” as the answer.

So, with this new representation of word vectors, you can figure out questions on analogies using mundane vector algebra you learned at high school.

Image from [Hobson Lane, Cole Howard, and Hannes Max Hapke. Natural Language Processing in Action. Manning Publications, 2019]

Since the Word2vec model encloses information about the connections between words, it is able to deduce that “Portland” and “Portland Timbers” are about the same distance apart as “Seattle” and “Seattle Sounders”.

It’s true that the vector you get after adding and subtracting the original word vectors could hardly ever be exactly equal to some vector readily available in your vocabulary of word vectors. Still, if you search your vocabulary for the entry closest to the calculated target vector, you will most likely get the answer to your NLP question. For the current question about sports teams and cities, the word linked to this adjacent vector is the answer.

We will have to stop here and continue with this exciting Word2vec topic in the next article of the series.

Wrapping Up

This was the fourth article in the technology.org series on Dialog Systems, where we examined the workings of the popular TF-IDF technique and summarized the major drawbacks of basic text representation techniques that rely on simple counts of words in a corpus.

We also learned how reasoning with word vectors can facilitate solving some notably delicate problems such as answering questions based on analogies. The power of the Word2vec algorithm is in enabling transformation of vectors with word frequencies into a new space containing vectors with a much lower number of dimensions. You can do ordinary vector algebra in this new lower-dimensional space and then return to the natural language space whenever you need. Just fancy what benefits this can bring to your dialog system!

In the next part of the technology.org series, we will finish the topic of word embeddings by demonstrating how to use popular NLP libraries for quickly accessing some key functionality of pretrained word vector models.

Author’s Bio

Darius Miniotas is a data scientist and technical writer with Neurotechnology in Vilnius, Lithuania. He is also Associate Professor at VILNIUSTECH where he has taught analog and digital signal processing. Darius holds a Ph.D. in Electrical Engineering, but his early research interests focused on multimodal human-machine interactions combining eye gaze, speech, and touch. Currently he is passionate about prosocial and conversational AI. At Neurotechnology, Darius is pursuing research and education projects that attempt to address the remaining challenges of dealing with multimodality in visual dialogues and multiparty interactions with social robots.

References

  1. Andrew R. Freed. Conversational AI. Manning Publications, 2021.
  2. Rashid Khan and Anik Das. Build Better Chatbots. Apress, 2018.
  3. Hobson Lane, Cole Howard, and Hannes Max Hapke. Natural Language Processing in Action. Manning Publications, 2019.
  4. Michael McTear. Conversational AI. Morgan & Claypool, 2021.
  5. Tomas Mikolov, Kai Chen, G.S. Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. Sep 2013, https://arxiv.org/pdf/1301.3781.pdf.
  6. Sumit Raj. Building Chatbots with Python. Apress, 2019.
  7. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana. Practical Natural Language Processing. O’Reilly Media, 2020.