Perspectives

A Gentle Intro to Natural Language Processing: The Ultimate Text Tool

6 min read

As machine learning and AI continues to develop at a rapid pace, some of the most exciting and interesting progress is being made by researchers looking at NLP, otherwise known as Natural Language Processing.

NLP is the application of mathematical algorithms and computational techniques to the analysis of natural language, speech, and text. The premise of NLP dates back as far as the 1950’s, where John Searle’s Chinese room thought experiment [1] summarised the aim of NLP, could a computer emulate natural language from rules set out in a Chinese phrasebook? Since the 1990’s, statistical NLP has grown through the use of machine learning algorithms, an initial example being machine translation of one language to another.

Despite its name, NLP has plenty of mathematics around the algorithms used within it. However, this article intends to give only a brief overview of the some of the methods used in the discipline, as well as how they would be useful to businesses.

But firstly, to understand why NLP is important, we first need to understand the two types of data:

  • Quantitative: Continuous or discrete numerical data.
  • Qualitative: Nominal or ordinal (limited) text data.

Whilst qualitative data is technically text data, it is not unique to the record. An example would be the colour of a set of cars, where there is a finite number of colours that the car could be. A long string of text, such as a sentence, would not fit into either of the above categories.

This presents a new type of data to investigate and analyse, which is not only very common, but can contain large amounts of information. NLP can be applied to all sorts of documents, from articles to legal contracts, which have useful insights to be extracted, making NLP an important tool in a modern data-driven world.


Using a Computer’s Language

Whilst humans may communicate with words, computers and algorithms cannot understand them, thus presenting the first obstacle to overcome in NLP: converting text to a numeric format. This is known as word embedding, first theorised by Gerard Salton[2], and if it is not done then algorithms will be unable to extract meaning from the text.


Whilst there are many ways to do this conversion, only a couple will be shown here.

     1. One-Hot Encoding/Bag of Words

One-hot encoding is the simplest form of word embedding. It counts the frequency of each word in the document and assigns the value to the word. These values can be assembled into a vector (a collection of numeric values), known as a bag of words, and fed into the algorithms. Whilst a vector may not mean much to the human eye, NLP algorithms can make good use out of them to extract insights from a document.

One-hot encoding reflects how common a word is within a corpus, allowing for more important words in the document to be given a higher value. This is under the assumption that words more frequently used are more important, however that is not always the case…

     2. TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF reflects how important a word is to a document in a corpus (a collection of documents), and originates from the first proposal of term weighting by Hans Peter Luhn[3]. Essentially, it is the number of times the word appears in a document divided by the number of documents in the corpus in which the word appears. The larger the value, the more important the word is in the document.

Each word in the document is then given a corresponding TF-IDF value, where the larger the value, the more important the word is in the document. These can be collected into a vector and fed into algorithms, similarly to a bag of words.

The benefit that TF-IDF has over the One-hot encoding approach is that it will filter out very common words, which often have little meaning, such as ‘and’, ‘the’ and ‘of’.

There are more advanced and powerful word embedding algorithms, such as Word2Vec, but they require pretrained neural networks. Neural networks are black boxes, meaning their internal workings cannot be understood, making them difficult to describe and understand, and out of scope for this article.

Word embedding is just one step in pre-processing text, with other steps such as tokenization, stemming/ lemmatization, and stopword removal.


Reading Between the Lines

Now that text documents can be understood by machine learning algorithms, let’s review the NLP techniques that can help us derive insight from them.

1. Sentiment Analysis

In its most basic form, sentiment analysis is a tool that classifies the polarity of a text (whether the tone is positive, negative, or neutral), and was first described by Volcani and Fogel[4]. Whilst sentiment analysis can go deeper, looking at specific emotions in a text, the basic form can still allow for some useful analysis.

Some examples of sentiment analysis’ uses are:

  • Confirming the neutrality of audit reports,
  • Determining emotion in a transcript of a telephone conversation,
  • Monitoring a company’s brand using social media.

Sentiment analysis typically uses supervised (requiring labelled data to learn from) regression algorithms to classify texts with a score between 1 and -1, where -1 indicates negative sentiment and +1 indicates positive sentiment.

2. Topic Modelling

The aim of topic modelling is to reveal semantic structure within a group of documents and group them by this structure. Most common and important words can then be pulled from these groups and used to define the overarching topic of each group.

Topic modelling was first described in 1998[5] and provides an explainable way to group texts within a corpus. This can be useful in situations such as:

  • Grouping customer feedback into common issues which can be given automated responses,
  • Analysing legal documents to find dominant themes and abnormal text,
  • Grouping news articles based on specific risks to quantify threat.

Topic modelling uses unsupervised algorithms (ones that do not require labelled data) such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorisation (NNMF).

Outlined above are just a few NLP techniques that are commonly used to extract information from text. Others include: Text Summarization, Named Entity Recognition, and Relationship Extraction.


Takeaways

This article has outlined the general concept of NLP, the challenges that need to be overcome whilst using it, and a selection of common techniques with accompanying examples of where they could, and have, been used in a business setting.

There is plenty of scope for these techniques to be used across our clients’ issues, and internally, and is worth bearing in mind any time you encounter a large number of documents that need to be reviewed or analysed.

NLP is an area that will continue to grow as research advances, with an eventual aim of having a computer generate human-understandable text that is indistinguishable from real human written text. To an extent this is already happening, with OpenAi’s GPT-3 text generator[6], which can generate sensical documents, such as emails to video scripts, or Google’s PaLM[7], which can answer text-based questions and even explain jokes! They are incredibly interesting projects which are being incorporated into numerous businesses and are both worth reading about.


References

[1]: Minds, brains, and programs | Behavioral and Brain Sciences | Cambridge Core

[2]: Some experiments in the generation of word and document associations | Proceedings of the December 4-6, 1962, fall joint computer conference (acm.org)

[3]: A Statistical Approach to Mechanized Encoding and Searching of Literary Information | IBM Journals & Magazine | IEEE Xplore

[4]: United States Patent: 7136877 (uspto.gov)

[5]: Latent semantic indexing | Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems

[6]: GPT-3 (gpt3-openai.com)

[7]: Google AI Blog: Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance (googleblog.com)

Fullwidth SCC. Do not delete! This box/component contains JavaScript that is needed on this page. This message will not be visible when page is activated.

Did you find this useful?