Why and how to exploit your textual data thanks to natural language processing?

Textual data is omnipresent in business. Stored in documents (Word, PowerPoint, PDF, etc.), in mailboxes or even in browser histories, they store our knowledge and the history of our digital actions.

Enterprise data exploitation is one of the pillars of corporate performance. Representing 80% of corporate data, textual data concentrates the majority of information, making it essential to make the most of it. In recent years, Artificial Intelligence has revolutionized text analysis, notably through the development of high-performance NLP methods. Automating processes, searching for information easily and flexibly in a variety of applications, automatically organizing documents… the areas of application for text analysis are infinite.

Why leverage your text data with NLP?

Exploiting your textual data will enable you to organize your data, explore it and transform your processes:

  • Organize your textual data: most companies have already set up platforms to store textual data. However, employees are often at a loss when faced with the diversity of data sources and the differences in their structure and organization. To bring order to your knowledge bases, there are algorithms capable of grouping and classifying similar documents. Bringing structure to your data enables your collaborators to understand their typology and to appropriate them.
  • Explore your data more efficiently: chances are that a large proportion of your company’s knowledge is concentrated in text files stored in data warehouses. Enabling your entire workforce to rapidly explore your textual data ensures a continuous flow of knowledge, and thus the development of your teams’ skills. Implementing an intelligent search engine connected to all your textual data sources also reduces search times, and minimizes the time spent understanding all the different storage platforms.
  • Transform your processes: many of your processes already depend on textual data (ticketing systems, support mail, document classification, etc.). These sometimes redundant processes require human intervention. Automating them with NLP enables your teams to concentrate on higher value-added tasks. The time saved is a source of profit for companies.

How can NLP be used to leverage your text data?

The first and most important step: surround yourself with the right people. Working with NLP experts will save you precious time, as text analysis research and development is very expensive. This point will be developed further in the rest of this article.

Calculation of administrative fines under the RGPD

Once your ideation team has been formed, you’ll need to identify use cases. The best way to do this is to go round the teams you wish to support and ask them to formulate the blockages they encounter on a daily basis (in relation to data). From this stage onwards, it’s important to involve NLP experts, who can quickly identify the data projects linked to the bottlenecks encountered. For example, a search engine connected to a ticketing database and a support mailbox is an appropriate response to an employee’s complaint that response times are too long during IT incidents.

When your list of use cases is long enough, take the time to fill in the contribution/complexity matrix. To begin with, give priority to projects with a high contribution for employees and low complexity. By doing so, you’ll allow your teams to acculturate to Data Science projects at their own pace, and legitimize subsequent projects. To estimate the complexity of a project, the presence of NLP experts alone is not enough. Consider inviting architects and IT security managers to attend, if appropriate.

Textual data: what are the limits of NLP methods?

Despite what you may read on the internet, NO, you can’t do everything with text analysis. Some tasks are still highly experimental, while others have been solved. Before embarking on the analysis of your text data, make sure that one of your contacts is aware of these limitations and will be able to help you form realistic projects.

Two families of models are mainly used in text analysis projects encountered in companies:

  1. Statistical models based on word or character frequency.
  2. Semantic models that exploit the meaning of the text (e.g. BERT).

–          Statistical models

Statistical models mainly exploit word frequency. Take, for example, a text classification task: a company wants to separate its documents into confidential and non-confidential. To identify confidential documents, statistical algorithms rely mainly on keywords such as confidential, restricted or internal. To simplify the detection of these keywords, it is possible to provide the algorithm with a set of confidential documents. The algorithm can then determine by itself the terms specific to confidential documents. Once these words have been detected, they are retrieved from new files to label them as confidential. Statistical algorithms are optimized for computation. They remain highly efficient, even for tens of millions of documents.

Statistical methods are not simply word matching methods. They also associate weights with words. These weights are higher the more important the words are for detecting confidential documents.

One of the limitations of these methods is that a document not containing the words in the detected list cannot be labeled confidential. The algorithm is therefore unable to understand the meaning of the texts it processes. Furthermore, this type of algorithm only analyzes the presence of words in a text, and does not pay attention to the order of words in the text.

–          Semantic models (Transform)

The other category of algorithms, semantic algorithms, specializes in understanding the meaning of sentences. These models are generally based on the Transformer neural network architecture developed by Vaswani et al. (Google) in 2017. This architecture is used in many state-of-the-art models. The best known of these is the BERT model developed by Devlin et al. (Google) in 2018. Numerous models have subsequently been added to the list of neural networks based on the Transformer architecture, such as RoBERTa published by Facebook, AlBERT, FlauBERT and others. These very powerful models are used to understand the semantics of texts. Typically, in our example of confidential document detection, it is possible to match two confidential documents even if they share no words in common.

However, models based on the Transformer architecture do have their limitations: they are very large models, with hundreds of millions of parameters. As a result, they are much slower than statistical models. To give an order of magnitude: a statistical algorithm can process 10,000 texts per second, whereas a Transformer neural network can process around ten. 

In practice, it is not always more advantageous to choose one type of algorithm over another. Although neophytes are often tempted to move quickly towards Transformer models, which perform very well for classification tasks, for example, we need to take the analysis a step further and think about the trade-off between quality and speed. Hybrid methods are often the right choice, allowing you to benefit from both the speed of statistical methods and the relevance of the results obtained by semantic methods. Once again, calling in a specialist is the best way to ensure that the solution adopted is the most appropriate for your specific use case.

The open source trap

Although attractive to businesses, open source is often a bad idea. As artificial intelligence consultants, we are often asked the question: why pay a consulting firm to install open source algorithms in my IT environment? Many companies have the preconceived notion that using open source to create and adapt products to meet their needs is simple and inexpensive.  In the case of text analytics, numerous models are available online. The question is: What can a consulting firm contribute to the creation of a solution for a company?

Let’s take the example of the classification of confidential / non-confidential documents. The aim is to classify documents stored in a data lake in real time. To achieve this, we use a BERT-type model. These models are pre-trained on open source data: for example, one of the largest datasets is automatically extracted from Wikipedia and contains several tens of gigabytes of text. This database contains texts with common vocabulary. As a result, the pre-trained model is not adapted to the company’s context and vocabulary.

–          Let’s test the BERT model to classify this textual data:

In order to adapt the BERT model to the corporate context, it is first necessary to create a database of confidential and non-confidential documents. The creation of this database is time-consuming and tedious for an untrained person:

  1. The first step is to extract the texts from the documents to be classified and clean them up – this is known as data preprocessing.
  2. The next step is to label the data, i.e. to indicate which documents are confidential and which are not. In Data Science, efficient labeling tools can be used to create training databases. Certain algorithms can also be used to artificially increase the size of databases.
  3. Once the training base has been created, the model needs to be trained on it. Since the model had already been pre-trained on the Wikipedia corpus, we call its re-training on the enterprise data “fine tuning”. Fine tuning a BERT model requires calibration of a number of parameters (learning rate, decay rate, number of epochs, etc.). A good understanding of the interactions between these parameters is necessary to avoid lengthy tests: fine-tuning a BERT-type algorithm on a database of a few thousand texts can take several hours. Consequently, testing different parameter configurations can take days or even weeks.

Even in our simple case using a single classification model, we can see that development time can be very long. In more complex projects (as is generally the case), other algorithms need to be added and assembled. Although open source, text analysis technologies require a great deal of experience to be mastered and deployed in data projects.

Conclusion

There’s no doubt that your working environment could be optimized with text analytics. It saves you money by automating processes and saving your teams time. What’s more, your staff will spend less time on redundant tasks, and more time on more stimulating ones, thus contributing to your company’s quality of working life. Finally, it’s important to remember that you can’t improvise your career as a Data Scientist! To save money on your AI transition, don’t forget to surround yourself with Data Science consultants with experience in the field. You’ll avoid finding yourself in a dead end after several months of effort!

Article written by Clément Gueneau – NLP Division Supervisor


  • VASWANI, Ashish, SHAZEER, Noam, PARMAR, Niki, et al. Attention is all you need. Advances in neural information processing systems, 2017, vol. 30.
  • DEVLIN, Jacob, CHANG, Ming-Wei, LEE, Kenton, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • YANG, Zhilin, DAI, Zihang, YANG, Yiming, et al. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 2019, vol. 32.
  • PENNINGTON, Jeffrey, SOCHER, Richard, and MANNING, Christopher D. Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014. p. 1532-1543.
  • AIZAWA, Akiko. An information-theoretic perspective of tf-idf measures. Information Processing & Management, 2003, vol. 39, no 1, p. 45-65.
  • ROBERTSON, Stephen and ZARAGOZA, Hugo. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009.
  • ADHIKARI, Ashutosh, RAM, Achyudh, TANG, Raphael, et al. Docbert: Bert for document classification. arXiv preprint arXiv:1904.08398, 2019.