Real-World Machine Learning Case Study: Clustering Transactions Based on Text Descriptions


We are residing in the era of digital technologies. When was the last time you walked into a store that didnt have a PayTM or BHIM UPI? These digital deal technologies have quickly end up being a crucial part of our daily lives.

And not simply at a specific level, these digital technologies are at the core of every banks. Carrying out a payment deal or fund transfer has become extremely smooth with multiple possible options (like electronic banking, ATM, credit or debit cards, UPI, POS Machines, etc.) having reputable systems performing at the backend.

For every single deal we make, there would be a suitable description message produced versus it, like this:

In this short article, well speak about a real-world use case of a monetary institution utilizing clustering (a popular machine finding out algorithm) to customize their product offerings for its consumer base

Social media platforms like Twitter, WhatsApp, Facebook, and so on have actually ended up being main sources of info for profiling a clients interests and choices. A banks typically incurs huge costs for availing information from 3rd party sources. Even then, it ends up being really hard to map a social networks account to a special client.


Inspiration Behind this Case Study.

As a monetary organization, its always essential to engage the existing client base with personalized deals based on their varying interests. It is a considerable obstacle for any banks to capture the ideal 360-degree view of a client.

So how do we solve this?

We can cluster the transactions performed by a customer into discrete classifications based upon the transaction description message. This approach can be utilized to flag whether a deal was carried out for Food, Sports, Clothes, Bill Payment, Household, Others and so on. We can have a better quote of his/her choices if a consumer has most of the transactions appearing in a specific classification

Heres the Approach We Took.


A partial option to the above issue can be resolved by using internal transaction data offered with the organization.

Lets understand how we approached this problem declaration and the crucial steps we took to find out a service

When were not sure what we are looking for, subject Modelling is a technique for without supervision classification of documents which finds natural groups of products even. It primarily uses Latent Dirichlet Allocation (LDA) for fitting a topic design.

We start the process with all transactions with their description messages mapped to each consumer. To start with, we have an essential job of finalizing the number of clusters (or) categories (or) subjects. To accomplish this goal, we use Topic Modelling.


It treats each file (i.e. Transaction) as a mixture of subjects, and each topic as a mix of words. Heres an example: the word budget plan may take place in the subjects motion picture and in politics. The underlying assumption of this LDA is that every observation in the sample originates from an approximate unknown distribution that can be discussed by a generative statistical design.

Figuring out the variety of Topics.

Let us see this approach to resolve our problem. There exists a generative statistical design that has actually created all the words in the deal descriptions which came from unidentified arbitrary circulation (i.e. unidentified groups or subjects). We attempt to estimate/build an analytical design so that it anticipates the possibility of a word coming from a particular topic.

Subject Coherence.

Topic coherence is used to the top N words from the topic. It is defined as the average/median of the pairwise word-similarity ratings of the words in the topic. A good model will generate coherent topics, i.e., topics with high subject coherence ratings.

We have fixed the overall variety of topics by manually looking at the top keywords across topics. This may be somewhat inconsistent, and we require a subjective way to assess the appropriate number of topics. We use the Topic Coherence step to recognize the correct variety of topics.

Great subjects are subjects that can be explained by a brief label; for that reason, this is what the topic coherence measure captures


Time for Clustering!

Words count, Digits count, Special sign count.
Longest digits sequence length, digits-character ratio.
Typical, Maximum word lengths and so on
. Week, Day and Month of Transaction, is date present, is weekend deal, etc
. Transaction performed in the last 5 days or First 5 days of the month.
Public holiday and festival transactions, etc


We have repaired the overall number of topics/clusters now (i.e. 7 subjects in our case). We must begin designating each of these deal description messages into subjects. Subject modelling alone might not yield precise lead to designating a document to a topic.


Here, we utilize the output of subject modeling together with a couple of more functions to cluster deal description messages using K-Means clustering. Here, well focus on developing a function set for K-Means clustering

. Lookup Features– Top brand names in the industry & & typical nouns are used as lookup names. Count the variety of words in the transaction description associated to a specific market.

Fundamental Features.

Food: Vegetables, Dominos, FreshDirect, Subway, etc


. Costs & EMI: Policy, power, declaration, schedule, withdrawal, phone, and so on

E-Commerce: Amazon, Walmart, eBay, Ticketmaster, and so on

. Others: Uber, Airbus, packagers etc

Ravindra Reddy Tamma– Data Scientist (Actify Data Labs).

It deals with each file (i.e. Transaction) as a mix of topics, and each subject as a mix of words. We have actually fixed the total number of topics by manually looking at the leading keywords across topics. We use the Topic Coherence procedure to recognize the right number of subjects.

. Topic Modelling Features.

Outcomes reveal that observations close to the cluster centers are mostly identified with the right topic. Couple of observations far from cluster centers have been designated the incorrect subject label. Out of by hand examined 350 transaction descriptions, around 240 (~ 69% Accuracy) deal descriptions are correctly identified with the suitable topic.

Around 30 functions are produced every transaction description and we perform K-Means clustering to designate each deal description to one of the 7 Clusters.

About the Author.

Topic coherence is applied to the top N words from the topic. A good model will produce coherent subjects, i.e., topics with high subject coherence scores.

You can also read this article on our Mobile APP.

Ravi has likewise developed a national-level application scams design for unsecured loaning in India using unstructured credit bureau header details. In addition to credit danger, Ravi has deep competence in OCR, image analytics and text mining. Ravi likewise brings-in deep know-how in automating production information pipelines and implementation of artificial intelligence designs as scalable APIs.

Now we have at least a basic price quote of the in-house consumers interests and choices. We can send out personalized offers and alternatives to keep them engaged and enhance service.

Ravi is a device finding out expert at Actify Data Labs. His know-how covers across credit risk analytics, application scams modelling, OCR, text mining and deployment of designs as APIs. He has worked extensively with lenders for developing application, behaviour, and collection scorecards.

Final Thoughts.

Perform Topic modelling on DTM Matrices generated utilizing TF-IDF measure for unigrams & & bigrams. We get 2 sets of 7 different possibilities for every topic for both unigram and bigram DTM Matrices for a transaction description

Related Articles.

The technique of using a subject model is reasonably novel, the method of using deals for classifying interests of consumers has been in use mostly by credit card companies. Such interest graphs not only categorize transactions into major groups like food, travel, and so on but likewise creates micro-segments like Thai Food fans, Wildlife enthusiasts, etc.



15 gadgets that will sell out in 2020