Text Mining: Machine Learning on text data

7 June 2023 | 6 min reading

1. What is it ?

Text mining is an integral part of data science, and therefore of AI.

This is a set of methods, linguistic analysis techniques, and tools used to manipulate and process textual data. This is mainly unstructured data that is not referenced in a database. These data cannot therefore be interpreted by machines. There are different types of textual data: written texts, Word, emails, Powerpoint, etc.

This technology is also known as text analysis. However, some people draw a distinction between the two terms. Text analysis refers to the application of text mining techniques to sort data sets.

The development of Big Data platforms and Deep Learning now makes it possible to analyze massive sets of unstructured data. This has made text mining more practical for data scientists and other users.

A long-established technology

The use of computers to apply text analysis techniques is not new. For example, in the case of automated text summarization, an article was published in 1957. Even before the term Business Intelligence was coined! (“The Automatic Creation of Literature Abstracts” by Hans Peter Luhn). This article describes how an IBM704 computer (released in 1954) could be used to create a summary of an article. At the time, they used some of the mathematical methods still in use today (e.g. word frequency).

Together, text mining and text analytics help organizations find potentially valuable business information in corporate documents, customer emails, call center logs, text survey comments, social media posts, medical records, and other text data sources. It is also increasingly common to use text mining capabilities in AI chatbots and virtual agents. Companies are using these tools to provide automated responses to customers as part of their marketing, sales, and customer service operations.

2. How does it work ?

Text mining is similar in nature to data mining (a term often used when talking about Big Data). The difference is that it focuses on text rather than more structured forms of data. However, one of the first steps in the Text Mining process is to organize and structure the data so that it can be subjected to both qualitative and quantitative analysis.

This generally involves the use of NLP (natural language processing) algorithms. These algorithms apply the principles of computational linguistics to analyze and interpret data sets.

Initial work includes categorizing, grouping and tagging text, summarizing datasets, creating taxonomies, and extracting information about things like word frequency and relationships between data entities. Analytical models are then run to generate results that can help drive business strategies and operational actions.

In the past, NLP algorithms were based on statistical models or rules indicating what to look for in data sets. In the mid-2010s, however, less supervised Deep Learning models emerged. They offer an alternative approach for text analysis and other advanced analysis applications involving large datasets. Deep Learning uses neural networks to analyze data using an iterative method that is more flexible and intuitive than conventional machine learning.

As a result, text mining tools are now better equipped to discover underlying similarities and associations in text data. For example, an unsupervised model could organize text document or email data into a subject cluster without any help from an analyst.

3. Cases of application

Sentiment analysis or opinion research is one of the most widely used Text Mining applications. It can track what customers think about a company. It extracts text from online reviews, social networks, emails, and other data sources to identify commonalities that indicate whether customer sentiments are positive or negative. This information can be used to resolve product issues, improve customer service or plan new marketing campaigns.

Other common uses of text mining include screening job applicants based on the wording of their resumes, blocking spam, classifying website content, flagging potentially fraudulent insurance claims, analyzing descriptions of medical symptoms to aid diagnosis, and reviewing corporate documents as part of e-discovery processes. Text mining software also offers similar information retrieval capabilities to search engines and enterprise search platforms. However, this is usually only one element of higher-level text mining applications, not a use in itself.

Chatbots answer questions about products and manage basic customer service tasks. They do this using Natural Language Understanding (NLU). This sub-category of NLP helps robots to understand human speech so that they can respond appropriately.

Natural Language Generation (NLG), another related technology, extracts documents, images, and other data and then creates plain text. For example, these algorithms are used to write descriptions of neighborhoods for property advertisements. Explanations of key performance indicators tracked by business intelligence systems can also be obtained.

4. Benefits

Opinion research via Text Mining can help companies detect product and business problems. This allows them to be resolved before they become major problems and affect sales. Text mining customer reviews and communications can also identify desired new features to enhance product offerings. This improves the overall customer experience, which will hopefully lead to increased revenues and profits.

This science can also help predict customer churn. Companies can then take steps to avoid potential churn to commercial competitors. Fraud detection, risk management, online advertising, and web content management are other functions that can benefit from the use of text mining tools.

In the health sector, this can help to diagnose patients’ pathologies on the basis of reported symptoms.

5. Another big challenge

Text mining can be complicated because the data is often vague, inconsistent, and contradictory. Analyses can be disrupted by ambiguities that result from differences in syntax and semantics, as well as the use of slang, sarcasm, regional dialects, and technical language specific to individual vertical industries. As a result, text mining algorithms are constantly evolving to deal with these ambiguities and inconsistencies during analysis.

Thanks to its expertise in Data Science methods, Headmind Partners is able to support its customers in their textual data and Artificial Intelligence projects at every level of business and technical expertise.

Website : FERNANDEZ, Alain. « Qu’est-ce que le Text Mining ». Piloter.org Performance Management Décision. 2018. https://www.piloter.org/business-intelligence/textmining.htm
Website: « What is Text Mining, Text Analytics and Natural Language Processing ? ». Linguamatics an IQVIA Company. https://www.linguamatics.com/what-text-mining-text-analytics-and-natural-language-processing