Using Machine Learning to Calibrate Online Opinion Bias

Every day at Citibeats, we strive to gain a deeper understanding of people’s opinions with the intention of having their voices be heard – by companies, governments and other decision-makers that are best positioned to address their needs. 

We use a lot of text sources such as Twitter, forums and blogs. But, opinions on online platforms are not always representative of the global population. For instance, in 2020, women represent 26.2% of Twitter users in France, whereas Insee reported that French women represent 51.7% of the total population. Clearly, there is a discrepancy. So, we’ve made it one of our main goals to remove any bias from people’s opinions and calibrate the results before delivering them to our clients. 

Here we will explain why reporting results without calibration may lead to false interpretations and how we deal with this issue at Citibeats. The methodology we use is in line with the findings made by Ricardo Baeza-Yates et al. (2020) in a case study conducted in Chile and Argentina about representativeness in the abortion legislation debate. We thank Ricardo for his insights into the subject and for inspiring our approach.

The issue of accurate representation in opinions

The best way to illustrate the bias issue is with an example. The government of Skotoprigonievsk, an imaginary city-state with a population of 20 million with an equal 1:1 male to female sex ratio, is using Citibeats to understand people’s opinion on a particular subject. To do so, we collected 10 million users’ opinions on online platforms. The raw results can be seen below:

Illustration 1: Raw results on subject opinions

At first glance, such results look great. Nevertheless, it’s important to note that none of the characteristics of the users we collected the opinions from were taken into account. But, by using Citibeats’ technology, it becomes possible to infer such characteristics with precision.

On platforms like Twitter, a significant percentage of posts come from NGOs, firms, famous figures or bots. Thus, as we pointed out above, the demographics may not be representative of reality. Below, we present the demographics of the collected opinions and how they change once they have been calibrated.

Illustration 2: Results before and after calibration

In actuality, 20% of the opinions collected came from firms or NGOs. Not only that, after discarding those institutions, but 80% of total opinions were also from men. So, results needed to be calibrated to make sure that the opinions were not skewed, and that they were representative of the Skotoprigonievsk’s population. Being presented with the first set of results versus the calibrated results would change Skotoprigonievsk’s government’s conclusions considerably.

This example highlights the challenge that every opinion survey faces: ensuring that a representative sample of the population is polled, in order to avoid misleading people and exacerbating common misconceptions about Internet data. 

If you want to know more about this topic, read our article: Calibrating for gender bias in online data.

The Citibeats approach

We worked on demographic segmentation for opinion representativeness. We assess users’ gender but also their age or location. Plus, we identify institutions and bots in order to reduce the noise that could interfere with an accurate portrayal of people’s true opinions. By using this method, internet opinions can be used to collect representative or calibrated data, making them comparable to carefully prepared surveys – only with the advantage of being in real-time, and at enormous scale.

We used several steps to identify demographics in order to calibrate our results – including collecting labeled data and modeling gender probability. We used Data Science and Machine Learning algorithms and did all development and model training from scratch.

The following two sections include a highly technical explanation of our methodology. For readers who are not interested in the technical details, you can skip directly to the results

Data collection

First we had to collect labeled data. In other words, user account information with easy identified gender (or institution). For our first shot we used a priori gender markers:

We used several steps to identify demographics in order to calibrate our results – including collecting labeled data and modeling gender probability. We used Data Science and Machine Learning algorithms and did all development and model training from scratch.

The following two sections include a highly technical explanation of our methodology. For readers who are not interested in the technical details, you can skip directly to the results

Illustration 3: Collect user and identify gender from scratch

We randomly collected users on Twitter with their name, username and biography. We only used that data to train our models without storing it, in order to protect users’ privacy. To extract the labels, we used names and descriptions to count obvious gender markers. In the second example above, we see that Esther is a female marker and the emojis give one male and one female gender markers. Thus we classified this user as female.

This method enabled us to collect a high variety of examples: we succeeded in labeling more than 50k institutions accounts and more than 100k users for both main genders, in 4 languages (English, French, Portuguese & Spanish).

A note on privacy: at Citibeats, we only display the gender data at the aggregate level, in a way it is not possible to know the gender of any given individual.

Second we tried several approaches:

  1. A Rule-based model with a priori Gender Markers, often used to make human comparisons
  2. A Bag-of-Words representation combined to a Logistic Regression. 
  3. A Deep Learning model inspired by Wang et al, 2019.

Modeling the genders’ probability

The final model takes as inputs the user name, screen name and description. The output is the probability distribution across genders. We mask all the gender markers, in the inputs, that helped us to determine the gender or the organisation status of a user.

First we train for each input an independent Deep Learning model to predict the gender (bottom chart on illustration 4). For instance, only with the input name, we train a deep learning model (bi-directional long short term memory recurrent neural networks – LSTM RNN) to predict if the name is more likely to belong to a man, a woman or an organization. 

Illustration 4: Deep Learning model architecture. Top: the global architecture. Bottom: the detailed architecture of the deep learning models

Second, we get the 3 trained models back, but we discard the final softmax layer of each model. Instead we concatenate the last ‘Concat’ layer (bottom chart in illustration 4) of the three models and add a new softmax layer on top of it (the new classifier). 

We train this architecture in two steps (top chart in illustration 4). During the first one, the warm-up step, we freeze all layers but the one on the top, the final softmax layer, and train the model. The second part consists in defreezing all layers and training all layers together. 


We used bidirectional LSTM RNNs to learn the best representation of each input. Concretely, learning the best representation of an input means extracting the best insights of the name that would help the classifier (softmax layer) to give the best genders’ probabilities.

Illustration 5: vanilla RNN architecture

LSTMs advantage is that they learn long-distance relationships. For instance at the first layer of the LSTM, each block gets a letter as input, and it also gets a hidden state from the previous block. This hidden state is written by the previous recurrent block. 

In fact each block has learnt what to forget from the previous hidden state and what to write from the input and what it has read from the previous hidden state. In other words, the 4th block has learnt what to forget from arn and what to write as a new hidden state from arn and the new input a

The blog post Written memories: understanding, deriving and extending the LSTM, is an excellent resource to understand the LSTMs.

Impact Report 2019

Impact report

Do you want to know how we have helped to create better policies, more effective budgets and earlier interventions with Artificial Intelligence?

Results of the modeling

For evaluating our results, we benchmarked Wang et al results as the state of the art (SOA).
We should mention three main differences between Wang et al. ‘s model and ours: they trained their network to predict ages, they used profile pictures (features we don’t want to use) and they trained on 32 languages on a dataset 200 times larger.

1. Institution Identification

First, we look at the results of institution detection. We are below the state of the art standard, but when we consider that we trained our models with just over 100k users without using profile pictures versus their 24 million users, the results we received are quite good. 

Illustration 6: Institution identification Results

The Deep Learning model clearly outperforms the baseline but the linear model reaches equivalent results. It means that we should increase our training dataset volume to get the full potential of the Deep Learning architecture.

 Nevertheless, our methodology shows signs of going in the right direction with such close results with less data; we may reach or beat this state of the art with more data.

2. Gender Identification

When we look at the gender differentiation, we are almost at the level of the SOA – again, with much less data.

Illustration 7: Gender Identification results

We have good results across languages: all languages have F1 over 90% – even in English where the gender is more difficult to detect than in the Latin languages.

Illustration 8: Male Vs Female - All languages

We have evidence that our model learned how to effectively differentiate women and men. First, we computed the empiric probabilities associated with some features such as diminutive names (top table in illustration 9): Cris has a probability of 75% to be associated with women, as it can also be used by men. 

We also noticed an interesting fact: men use fewer emojis. And the probability to be a man with a man head as emoji is not 100% but 89%. Indeed, in the collect data part above, we saw in the example with Esther that women may use emoji to describe the sex of their children.

Illustration 9: Some examples of learned features. Top: some names or diminutives. Bottom: emojis we found,

In fact, those results are even more promising because – with our easy data collection methods – we made some identification mistakes. And when we look at the prediction on the training set, the results of the deep learning algorithm corrects it.

Illustration 10: Example of the algorithm correction

Typically such corrections can help us to apply a bootstrap methodology to classify a lot of new users to increase our training dataset size, and why not beat the state-of-the-art.

We applied, as recommended by Ricardo Baeza to legitimate our results, the algorithm on datasets we collected in South America to compare our Twitter demographics estimation with Hootsuite’s surveys. We are close to Hootsuite’s assessments for a lot of countries. We highlight the results of 4 countries in the following illustration. This validates in part our approach to estimate Twitter demographics.

Illustration 11: Comparison with Hootsuite surveys

The gender calibration is an example of the features we develop at Citibeats to assess populations’ demographics on online platforms. We also work on other users’ characteristics such as age or location to sharpen our demographics assessments. 

Final Thoughts

At Citibeats, we dedicate much effort in unskewing the results from the data we collect. Thus we can be more representative of people’s opinion by not only discarding any noise from corporates, brands or NGOs, but also by assessing the demographics of collected opinions to calibrate final results.

One of Citibeats’ main concerns is using AI for good, or Ethical AI. Therefore, our data collecting and processing methodology takes special care of people’s privacy, making sure that the data is securely stored and deleted when it’s not needed. Also, all gender segmentation data is always shown at the aggregate level, in a way it is not possible to know the gender of any given individual..

Want to hear from us?

Subscribe to our newsletter and stay updated!

Thank you!

We’ve received your information correctly.
Our Sales Team will contact you shortly.

Do you want to know more about Citibeats? Please take a look!