Navigating the Challenges of Data Drift and Data Bias in AI Adoption

Nowadays, transforming to AI technologies is not optional if you are prepared to innovate and remain competitive. This is because these technologies can shed light on risks or opportunities that you may not be aware of. These technologies primarily depend on the quality of your data. You will not gain the full benefits of using these technologies without ensuring that you are feeding your model with high-quality data. In this article, we are going to illustrate two main issues that have a significant impact on your data and how to deal with them to obtain accurate results. The first issue is data drift, and the second one is data bias.

AI technologies primarily depend on data to learn, and any changes in data (data drift) will affect their performance. There are mainly three strategies to handle data drift: static handling, instance weighting, and dynamic handling.

Static handling is performed by manually retraining the AI model. The model will retrain the dataset after some time to include the new instances in the training process. The second strategy is instance weighting, which is based on training the instances based on their weight, such as the ones that ensemble learning algorithms do. The third strategy, dynamic handling, is performed by first detecting the change and then retraining the model; based on the change in the datasets. The change in the new instances can be captured through the error rate of the test sets or when the performance of the model decreases. However, this can increase the computational cost by adding a detection method and may not be able to handle retraining changes or seasonal changes.

Some considerations should be considered when applying one of the previous strategies:

The static handling method has the risk of including two or more different drift types in the same training process, which can affect the model's performance.
The instance weighting method has the issue of finding the optimal weight of the instances, which increases the tuning process of training the model.
The dynamic handling method can increase the computational cost by adding a detection method, and it may not be able to handle returning changes or seasonal changes.

The Data Bias issue is the situation where the classes of instances are not balanced, and as a result, the model's predictions will be biased towards the majority class. For example, in credit card fraud detection. Using classical machine learning algorithms or deep learning algorithms to classify instances can cause a bias toward the majority class. The bias towards the majority class can cause overfitting issues of the majority class or failure to generalize the model to classify new instances; because the model considers the minority class as noise and discards them in the training process. As a result, these AI models misclassify instances in the testing process. There are four methods to deal with data bias due to imbalanced dataset issues: data-level, metrics-level, and class-level.

Data Level:

The data-level method for handling the imbalanced dataset issue is through resampling the instances of the training dataset or balancing the number of instances in each class to prevent the AI model from considering minority instances as noise. There are two main techniques for resampling the instances: under-sampling the majority class and oversampling the minority class. The goal is to create a balanced training set, which reduces the overfitting issue.

The main techniques of this method are random oversampling or under-sampling, cluster-based resampling, and Synthetic Minority Over-Sampling (SMOTE). Random-based oversampling or under-sampling is performed by randomly eliminating instances from the majority class or duplicating instances from the minority class. The cluster-based technique is performed by applying a clustering algorithm to the minority and the majority class instances to identify the clusters in each class and resample the identified clusters. The number of instances in each class may differ based on the clusters in each class. SMOTE or informed oversampling is a technique that adds synthetic instances to the training set from a subset of the minority class. The goal of SMOTE is to reduce overfitting issues, which can be caused by oversampling techniques.

However, some considerations should be considered when applying data-level methods. These methods can suffer from overfitting issues if the oversampling technique is used, or underfitting if the under-sampling technique is used. The SMOTE method does not consider instances from other classes, which may cause classes to overlap and, as a result, increase noise in the dataset.

Metrics Level:

There are many metrics that can be used to measure the performance of the AI model. However, some of these metrics can be misleading when the dataset is imbalanced, such as the accuracy of the model or the overall error rate. The reason behind this is that some metrics treat classes equally, and as a result, they neglect the performance of the minority class.

The main metrics that can be used to measure the performance of imbalanced datasets are precision, recall, and the area under the curve of the Receiver Operator Curve (AUC of the ROC). AUC of the ROC shows the trade-off between the true positive rate (sensitivity) and the false positive rate (specificity) at different thresholds. The goal of this metric is to show the ability of the model to distinguish between classes.

Recall is the true positive rate or sensitivity of the model, while precision is the positive predictive value or the ratio of the positive instances that are correctly classified. Precision and recall are more appropriate metrics if one class has a higher priority. The AUC of the ROC is preferable when the classes have similar priorities, while the precision-recall metrics are preferable when we are interested in one of the classes.

Class Level

The class-level approach is to penalize the learning algorithm of the AI model in the training set based on the misclassification error. The learning algorithm will have a higher penalty if it misclassifies instances from the minority class. Many studies have shown that learning from an imbalanced dataset can be done through employing cost-sensitive methods.

These cost-sensitive methods can be applied in three main ways:

Applying cost-sensitive functions to find the best distribution for training the dataset.
Applying a cost-minimizing technique through a learning step of ensemble learning algorithms, where there are many classifiers used to find the optimal classifiers.
Including cost-sensitive features or functions with the learning algorithm, such as cost-sensitive decision trees and cost-sensitive neural networks.

Using the cost-sensitive method requires finding the optimal cost-sensitive function and how to incorporate it with the learning algorithms. This approach is usually used when the sampling techniques cannot be applied to the dataset, as it can be more challenging to implement compared to the previous techniques.

In summary, adopting AI technologies is crucial for innovation and competitiveness, as the effectiveness of these technologies is heavily dependent on the quality of the data used to feed the AI models. Two key issues can significantly impact data quality and, consequently, the accuracy of the AI model results in data drift and data bias due to imbalanced datasets. This article aims to further illustrate these two issues and provide guidance on how to address them to obtain accurate results from AI models, underscoring the critical importance of high-quality data for successful AI adoption.