Prediction of Late Payments Using Decision Tree-Based Algorithms

Arthur Flor
5 min readNov 25, 2021

Divide and conquer

Here I am, another post on decision trees. Unlike my first experience, this project was exciting and full of ideas throughout its process, and of course, more motivating. I’m currently at the beginning of my PhD, and this project was carried out in a class at the beginning of the year (2021), but it is worth mentioning it in a post. If you are interested, the link to the project and paper:

What are Decision Trees?

Just to have an overview, a decision tree is a data structure in which the combination of conditions (nodes) leads to a decision (leaf node). So you use machine learning to build the best decision tree based on the data you have.

What is Invoice-to-Cash?

The project context refers to late payments in the business area. One of the processes I had the opportunity to read was about Invoice-to-Cash. This process involves the actions of a company to receive payment. Thus, it is essential for the financial stability of any company, given the collection of accounts receivable as its main activity. So imagine we have the following questions:

How long will the customer take to pay me? Will the customer pay me after the due date or sooner? Will the customer pay me?

With that in mind, it’s common for managers to want some answer and then take action involving the company’s capital.

Dataset

Unfortunately, the dataset is private. However, I can mention that it brings information about customer purchases, such as dates, values, customers, and related companies. In total, there were 2.7 million records initially, considering January 2018 to February 2021.

The step I consider most important is the processing and transformation of data (Fig. 1). Here you can adapt/improve according to your scenario.

First, we have the dataset and data extraction. In this step, I made some filters related to the business rule and removed features that didn’t add to the project context.

Second, outlier detection. It is necessary to carefully analyze possible outliers and, at the same time, validate with the business team.

Third, grouping the data. This step is dataset-specific (I think), as a customer can make multiple purchases on the same day. So we want to group customer invoices by day.

Fourth, historical data. At this point, I generate features related to customer history for each purchase day. In other words, new features related to his past were added for every purchase. This step is essential because we will add relevant information, such as the customer’s financial situation, average customer delay, purchase frequency, etc.

Fifth, time-related information. Finally, the date features are converted to numerical values to find gaps between the purchase and due date, days left until the end of the month, etc.

Fig. 1: Pre-processing steps

Experiment

As a proposed solution for the project, I split the problem into 3 steps (Fig. 2). The first is to identify the invoices that will be paid late. The second is to identify the invoices that will be paid after the due month. Third, estimate how many days are late after the due month. In this way, we were able to bring more information to the results, improving decision-making.

Fig. 2: Steps for the experiment

We can say that for each defined step, the complete experiment was carried out: selection of features, search for hyperparameters (grid search), training with cross-validation, testing, and evaluation (Fig. 3).

Fig. 3: Overview of the experiment performed

Finally, 8 decision tree models were used. Here, the important thing is to realize that: classification models were used for steps 1 and 2; and step 3, regression. The 8 models were: (i) Bagging; (ii) Balanced Bagging; (iii) Random Forest; (iv) Balanced Random Forest; (v) AdaBoost; (vi) Gradient Boosting; (vii) RUSBoost; e (viii) XGBoost.

Results

In step 1, a total of 13,123 records were tested (Fig. 4). Just to remind you, step 1 refers to predicting the invoices that will be paid late. In a way, the models got pretty close results, but XGBoost did better throughout the experiment.

Fig. 4: Step 1 results (P = Precision, R = Recall, F = F1-Score)

In step 2, a total of 5693 records were tested (Fig. 5). So about 85% of invoices paid after the due month were identified. Here, Balanced Random Forest was the one with the best result. It’s worth mentioning that AdaBoost achieved greater precision but a low recall.

Fig. 5: Step 2 results (P = Precision, R = Recall, F = F1-Score)

Finally, in step 3, a total of 1493 records were tested (Fig. 6). So for this step, the models would have to predict the number of days late after the due month.

Well, this step has one comment: the metric. Initially, I didn’t know how to handle the regression results for people outside the computing context. I know MAE, MSE, RMSE… but it didn’t make sense to other people. So I organized the results as follows: accuracy under predictions and day variations. That is, for variation 0, the model has to predict the exact number of days. In variation 1, the model has to predict the number of days with up to 1 variation… and so on. I know… I learned the meaning of the MAPE and MAAPE metrics a while after this project.

In general, AdaBoost stood out a lot in this step (between 60–84%), even achieving low results compared to the steps before.

Fig. 6: Step 3results (A = Accuracy, ± Day variation)

Conclusion

In the end, I saw that dividing the problem into ´subproblems´ helped to achieve the proposed objective. We have more information in the results, which only adds to who will use the models daily.

I would like to mention that the project is not the secret formula but only a possible way to solve the problem addressed. Also, there were several points where I would improve currently. Here I refer directly to step 3. For example, we could have approached it as a sorting step and divided the late days into ranges.

It’s funny to see old projects from another perspective, with more knowledge and self-criticism. For anyone who has read my post about decision trees and signatures, well, I changed my mind about decision trees. Anyway, I hope this journey helps someone and also me in the future. haha

--

--