Developing Models with Low Resources? Not an Issue with Citibeats’ Technology!
In this blog post, we’d like to introduce you to one of the day-to-day tasks of our Data Science team that also represents one of their major challenges. In particular, we will describe the developing process of our multilingual complaint detector for social media texts and how we solved some issues related to its implementation.
*DISCLAIMER* We tried to use a not too technical language, so that anyone will be able to read the article and understand it in all its parts. It goes without saying that data scientists, analysts, developers or anyone with some knowledge about machine learning will be able to appreciate all the subtleties.
The Citibeats’ algorithm is used by various clients worldwide (for instance in Latin America and the Caribbean or for the WHO COVID observatory) and supports more than 50 languages with precision and recall higher than 75%.
But before getting to the production phase, the data science team has to face several challenges that are well known among the machine learning community experts. One of them is: How to get a good labelled dataset to train the detector in various languages?
To highlight the challenge, let’s take a closed but easier problem. We will follow the workflow of creating a complaint detector for english’s tweets only. We will not reach the results of our production model, as we simplified the problem, and the objective is to give an overview of potential hard tasks when tackling an a priori simple data science problem.
When we analyzed the literature around complaints detection problems in social media, we saw that researchers generally formalize the problem like a classification task. In other words we would like to have a decision system that would take as input a text and return 1 if the text carries a complaint, otherwise 0. So far, it looks like a classic machine learning issue.
In this section, we will describe the creation workflow of a complaint detector for English Tweets (just in English).
*DISCLAIMER* Please note that, as we have simplified the problem, we will not reach the results of our production model. Besides, the objective is to give an overview of potential hard tasks when tackling an a priori simple data science problem.
Analyzing the literature around complaints detection problems in social media, what we find is that researchers generally formalize the problem like a classification task. Something like: “We would like to have a decision system that takes a text as an input, and returns 1 if the text carries a complaint, otherwise it returns 0”. So far, so good, it looks just like a typical machine learning issue.
If we take into consideration some papers that already hacked the problem, for instance Preotiuc-Pietro et al. (2019), we’ll see that they generally used a supervised algorithm, an algorithm that learns from labelled examples. In the case of Preotiuc-Pietro et al, they used around 2000 labelled examples to train their final classifier. Nevertheless, here lies the main issue: how to get fast and good labeled data to train our classifier in a supervised way?
In this specific case, thanks to the authors who freely released their own dataset, one can train its own classifier fast. But in most of the cases, for instance if we want to do so in a lower resource language such as catalan or swahili, the open-source datasets are scarce and one has to find a new way of getting labelled data with other methods. That will be the analysis of the following section.
Labelled Data Creation
There are several ways to try out to get labelled data even if you don’t have an already existing labelled dataset. Here are listed some methods that you can use – the list is not exhaustive:
We have just mentioned only some methods, but of course there are much more, like for example using few shot learning, to adapt pre-trained models to a new task; or even data augmentation to try to increase the coverage and make the model focus on the words or phrases that carries the meaning of the complaint.
Another method worth mentioning (even if we did not take it up here, but that we will very likely analyze in a future blog post), is called the prompting strategy. For instance, in a blog post Schick describes a prompting approach outperforming GPT-3 with 99.9% fewer parameters.
In a nutshell, the classification task is seen as a close question. It means that we ask an already (really big) trained language model to process some text and fill the blank to the following question: ‘this text is about ___’. The label provided by the model is then used for your downstream classification problem.
In the next section, we will try to detail how we created the dataset to overcome the problem of lacking labelled data. Then, we will provide some results when training a Complaint detector with the resulting datasets.
*DISCLAIMER* For the sake of the comparisons, we used a pre-existing annotated labelled dataset of nearly 1.5k data with around 600 complaints. This provided us a test set to estimate the performances of the model depending on the training dataset we feed the model with.
To compare the experiments’ results depending on the training set, we decided to apply the same training pipeline to all the trials.
We fitted a TF-IDF + Logistic Regression pipeline to all experimental datasets (without any Cross validation step to simplify). We used a NLTK lemmatizer for the TF-IDF tokenizer. We also considered only words appearing at least twice in the training dataset and n-grams till 4.
The first experiment was to manually annotate around 450 texts with labels ‘Complaint’ and ‘No Complaint’. We called the resulting dataset of this experiment v0.
The second experiment consisted in gathering a list of hashtags that would be associated with a complaint in a text. For instance, we considered the hashtags ‘#badbusiness’ or ‘#nevergain’ to be associated with ‘Complaints’. And we did the same for ‘No Complaints’ with random hashtags such as ‘#dogecoin’ or ‘#NationalBurgerDay’. We called the resulting experimental dataset HT.
The third experiment was the one which generated the most enthusiasm. We tried the implementation from Schick et al. (2021). The idea, explained in this blog post, was to give clear instructions to a big pre-trained model such as GPT-3 (model trained on several gigas of data) like “Write a complaint about XXX:” to create a dataset of generated texts with labels from scratch.
To adapt the model to our own problem, we changed the instructions. And we did it twice with two different approaches: firstly, by using the exact same implementation. Secondly, by adapting an open source GPT3 trained on the pile, a big open-source dataset. Finally, we got two datasets with labelled ‘Complaint’ and ‘No Complaint’ texts. We called the two experiments DINO and DINO-GPT3.
We also considered a last experiment: adding a layer to the other attempts. Once we trained the whole pipeline on a dataset, we applied this pipeline to unlabelled data in order to find new “confident” data. Then, we mixed this new labelled data with the previous training set and retrained the whole pipeline.
To define a “confident” data, after applying the already trained pipeline on unlabelled data, we looked at the probability of an unlabelled text to carry a complaint. We labelled all texts with probability higher than 0.9 as ‘Complaint’, and all the texts with a probability lower than 0.1, as ‘No Complaint’. We applied this procedure twice in a row and we called this experiment BStrap.
To establish a solid ground of comparison for our experiments, we took into consideration the performances of a random classifier (Bernoulli random variable with p=0.5) and trained the same pipeline with the open-source data provided by Preotiuc-Pietro et al. (2019) called from now on Preotiuc-Pietro dataset.
Here is a synthesis of the datasets we used and the number of complaints after applying the dataset constructions:
All the results below and the conclusions we drew from them are not universal and apply only to this case. To make them more general, we would need to push forward the analysis.
The Four Experiments
Firstly, we considered each experiment ignoring the bootstrap step:
We saw that all the models gave a F1 higher than the random process – lower than 0.5 as the test dataset was not balanced. And we also noticed that with the dataset provided by Preotiuc-Pietro et al. (2019) we reached the better results with a F1 of 65%.
Secondly, the results highlighted that with only 450 labelled data (v0), we reached an equivalent precision as with the best model (the whole Preotiuc-Pietro dataset). For this pipeline, then, the learning curve could be pretty flat from 450 to 2000 labelled texts regarding precision: the more data available, the better coverage needed. The precision was quite higher with manually labelled dataset than with other methods.
Thirdly, the results from the DINO implementations were pretty disappointing. Indeed, even if the perspective to create our dataset from scratch was clearly appealing, we saw that our results with this pipeline were poor, with a precision equivalent to the one of the random classifier.
However, we did not generate a dataset as big as the one in Timo Schick blog post (~20k), and we will try out to improve this whole pipeline in the future. Moreover, the perspective to automate such a pipeline would be incredible for a lot of binary classification problems.
Finally, the best outcomes came from the HT dataset. Indeed, with this pseudo-labelled dataset, we got almost the best results achieved by the pipeline on the bigger manually labelled dataset. Even if the precision was much lower, the coverage was high and the prospect of using this technique at one point to improve results looked to be promising.
Adding a Bootstrap Layer
In a second analysis, we looked at how the models behaved when adding data labeled from the Bstrap strategy. In other words, when we fitted a first pipeline, collected ‘confident’ data to increase the first version of the dataset, and trained again the pipeline on the new bootstrapped dataset.
In the chart below, we see the metrics differences after and before applying the bootstrap methodology:
A few things worth to notice: first, the recall increased for two out of three datasets. Second: it decreased only for the HT dataset that had the best recall by a large margin before the bootstrap step. And finally, the precision went the other way around: the higher the gain in recall, the higher the loss in precision.
So, if we consider F1, it looks like the decrease of the precision was compensated by the larger increase of recalls, thus the results seemed better for this metric. Maybe the original dataset sizes also played a role regarding the amplitude of the changes.
This experiment seemed to confirm the possibility of improving the F1 metrics with some bootstrap processes. Nevertheless, the trade-off was clear: the loss of precision will echo any gain in recall. The decision to apply the bootstrapping process to your classifier depends just on your goals as far as your product is concerned.
Add Pseudo Labels to Labeled Data
Finally, as the pseudo-labeled dataset (HT) looked promising, we tried to concatenate it with data manually annotated. We thought that a mix of both could result in a good trade-off between annotating data and using an automatic procedure to label text.
Below, you can see the results with the original manually datasets and when the pseudo-labels extracted from hashtags were added:
As you can see, we improved the results for the two datasets in F1 thanks especially to a higher recall (+20pts for v0 and +5pts for Preotiuc-Pietro dataset) while we kept a stable precision (less than 1pt decrease for both datasets).
When adding the pseudo labels, we noticed a clear difference of amplitude rise between both manually annotated datasets. This difference could depend on the initial volume of each dataset higher for the Preotiuc-Pietro dataset.
Finally, when looking at the F1 results, we noticed that the difference between v0 + HT and Preotiuc-Pietro dataset + HT was lower than 1 pt, it had more or less the same precision and differed mainly on the recall metrics. But we had 10 times more annotated data in the Preotiuc-Pietro dataset than in the v0.
And considering only the Preotiuc-Pietro dataset, the combination of a small batch of annotations and pseudo labels beat the results of the classifier trained on the bigger manually annotated dataset.
In this blog post, we have presented one big challenge we faced at Citibeats and a very common one in the data science industry: how to develop a model with low resources. In other words, how to overcome the lack of good labelled data to train a classifier.
As we’ve seen, part of the investigation at Citibeats is to find ways to solve this problem. For example, by using some tricks like pseudo-labelling or unsupervised learning methods to increase a small batch of labelled data; or by implementing more advanced machine learning techniques like text generation to create a labelled dataset. In this case, it was really helpful for us to train a Complaint detector.
But this is just one of the many tasks we perform at Citibeats. Every day we face new, inspiring challenges such as adapting methods working for English to more than 100 other different languages; or reduce the bias of the results (because maybe the final training sets are poisoned with latent biases), and so on. But we will talk about these and other topics in the future, so better stay tuned! We look forward to hearing your comments and let us know if you’re interested in a follow-up article!