Etiq AI Blog

Which Of The Publicly Available NLP Corpora: BERT, GloVe, Word2Vec, Have Bias Issues?

Which Of The Publicly Available NLP Corpora: BERT, GloVe, Word2Vec, Have Bias Issues?


Various industry applications use Natural Language Processing (NLP), and specifically word embeddings, which are trained on a variety of widely available corpora such as Wikipedia, Twitter, Google News. Often problems that are outside the sphere of traditional NLP applications (customer service, sentiment analysis etc.) can be solved with advanced pipelines containing some NLP based components. Since Etiq are incorporating NLP into some of our pipelines, we needed to understand exactly which corpora have ‘bias’ issues, as well as any that have been ‘debiased’ and the approach used. As we were unable to locate answers to either of these questions, we undertook our own analysis to examine various publicly available corpora for bias. Our findings are detailed below. We hope this initial work will inspire similar efforts and lead to a shared resource that can be used by practitioners.


Word embedding is a popular method for text analysis, where words are mapped to a vector that represents a point in a multi-dimensional space. To illustrate the process of word embeddings, consider a simple scenario where we have three documents. A straightforward embedding in this situation involves counting the number of times a certain word w appears in each document. This gives three numbers that can then form a three-element vector to represent w, where each document corresponds to a dimension. The intuition here is that similar words will appear in the same documents and therefore, be closely positioned in the three-dimensional space, making it relatively easy to identify words of a similar meaning. For further information about word embeddings, see Jurafsky and Martin (2021).

There are various embedding packages available that have differing approaches to calculating the vector representation of words. Similar to how we considered three specific documents in our explanation above, these packages look at different training corpora, such as Wikipedia or Twitter texts, which lead to varying outputs. Again in line with the explanation above, vector representations of words with a similar meaning are commonly found a small distance apart in the multi-dimensional space defined by the embeddings. Moreover, computing the differences between word vectors can reveal how different word pairings relate to one another e.g. Man-WomanUncle-Aunt (see Mikolov el al. (2013) for further details). These particular aspects of word embeddings can also help in detecting gender bias in word embedding packages. For example, Bolukbasi et al. (2016) demonstrate that the word2vec package trained on the corpus of Google News contains gender bias such that ‘man’ and ‘computer programmer’, as well as ‘woman’ and ‘homemaker’ are strongly linked, that is:Man-WomanComputer Programmer-Homemaker.

Recent work has proposed techniques to remove bias from word embeddings, such as that by Bolukbasi et al. and Zhao et al. (2018). However, Gonen and Goldberg (2019) do suggest that these two particular methods are only partially successful since gendered information does persist in the vector representations, where previously ‘biased’ words are shown to still cluster together in either ‘female’ or ‘male’ groups even after applying these debiasing mechanisms. Consequently, it is likely that most word embeddings contain some degree of (residual) bias, even if some bias mitigation method has been applied, derived from the training corpora. We explore this possibility in more depth here by looking at three embedding methods: BERT, GloVe and Word2Vec, and investigate each of these for hidden gender bias. As well, different training corpora are also analysed with the same embedding method, for example, GloVe: Common Crawl and GloVe: Wikipedia and GigaWord are both examined. To detect bias, three different measures are used, which are cosine similarity: a measure of similarity between two vectors; Euclidean distance: the distance between two vectors; and principal component analysis (PCA): where a gender subspace is generated with PCA and the distance between two vectors projected onto this subspace is measured. Cosine similarity and Euclidean distance are both well-known vector measures in mathematics, whilst the technique of using a gender subspace to detect bias comes from the literature (for example, see Bolukbasi et al. and Babaeianjelodar et al. (2020)). The Appendix provides a detailed explanation of these three measures, as well as some example code. Here, these measures are applied to calculate the relationship between word embeddings. If these relationships are found to significantly support gender stereotypes, then the package is classified as biased. Table 1 in Section 2 summarises our findings. Note that the purpose of this article is to take a quick dive into a very complex topic in order to present some interesting results and promote some conversations. A brief discussion of our findings is provided and we are interested to hear your thoughts to develop this further.


Table 1 lists the packages and their training corpora that we examined, with the measures used to test for gender bias. Two groups of 16 ‘stereotypical’ female words (group 1) and 16 ‘stereotypical’ male words (group 2) are tested. Consistent with the approach taken by Babaeianjelodar et al., these test words comprise mostly of ‘stereotypically’ male or female professions e.g. nurse, actress, secretary (female group); doctor, truck driver, businessman (male group). We calculate the percentage of the female group that are classified as biased in the female direction (red figures in Table 1), as well as the percentage of the male group that are classified as biased in the male direction (blue figures in Table 1), according to the measures. Refer to the Appendix for a detailed description of how a word is classified as biased in a particular direction based upon a certain metric. Each box is coloured green if the female and male percentages are at least 50%, which indicates that there is substantial bias detected in both groups and therefore, this particular package is believed to contain significant gender bias in both directions. In addition, boxes are coloured yellow if any form of gender bias is identified. For further details on computing these percentages, refer to the Jupyter notebook found here, where the figures given in Table 1 are calculated explicitly. Note that a zero percentage in the female (red figures in Table 1) or male group (blue figures in Table 1) indicates that of the 16 words tested, none were classified as biased in the female or male direction respectively.

(Cosine similarity)
(Euclidean distance)
BERT - BookCorpus and English Wikipedia
0% ,0%56% ,0%75% ,75%
0% ,0%88% ,0%6% ,6%
BERT CNN/DailyMail
50% ,13%88% ,13%81% ,50%
GloVe - Common Crawl
81% ,38%75% ,81%75% ,88%
GloVe - Wikipedia and GigaWord
63% ,38%69% ,75%63% ,94%
Gensim Word2Vec- Google News
81% ,44%50% ,81%69% ,94%


From the results in Table 1, it is apparent that some form of gender bias was detected during most of our tests. GloVe and Word2Vec consistently exhibited bias according to our metrics, largely independent of the type of measure applied or the corpora used for training. Although, note that cosine similarity was unable to detect significant bias in the male group. In comparison, BERT, BERTweet and BERT CNN/Daily Mail had a more varied set of results. Firstly, BERT showed no bias using cosine similarity and some in the female group with the Euclidean distance measure, while a clear gender bias was detected with PCA. This does suggest that BERT has possibly undergone some form of debiasing, which potentially made the first two measures mostly ineffective. We attempted to confirm whether BERT has been debiased and for which training corpora, but we were unsuccessful in obtaining this specific information. In contrast to BERT, the results for BERTweet reveal that a definitive gender bias was not observed using any of the three measures. More specifically, whilst some words indicated that there was hidden gender bias in these methods, others were biased in an unexpected direction, such as ‘king’ being closer to ‘woman’ than ‘man’. This could suggest that this package has been quite heavily manipulated, although more analysis is needed to understand these results fully. Lastly, BERT CNN/DailyMail did reveal some gender bias with all three measures, with a significant amount identified using PCA. Similar to BERT, these mixed results may indicate that some debiasing has been undertaken, although as previously discussed, we have been unable to confirm this. Note that sometimes cosine similarity and Euclidean distance have detected a substantial amount of bias in the female direction, whilst a negligible amount is found in the male direction. These results suggest that, according to the metric applied, the bias associated with that particular NLP corpora is restricted to the female group. Also, it is important to note that for each NLP corpora analysed, the results do vary quite significantly across the three metrics used. This is most likely due to each of these three methods using quite distinct approaches to bias detection. The third measure, PCA, does appear to be the most successful at identifying gender bias in our example. This possibly suggests that this technique is superior to the other two, and that it could be used in isolation going forward, however, further work would be needed to support this. We have presented these preliminary findings with the aim of encouraging further thoughts and conversations in this area. As well, we hope to raise awareness of the potential gender bias issues linked with possibly many publicly available NLP corpora. We are interested to hear from you to expand upon this initial work, in particular, if you are trying any other corpora, let us know and we’ll add them to the list!


Babaeianjelodar, M., Lorenz, S., Gordon, J., Matthews, J. and Freitag, E., 2020. Quantifying Gender Bias in Different Corpora. Companion Proceedings of the Web Conference 2020. Association for Computing Machinery, New York, NY, USA, 752-759.

Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V. and Kalai, A., 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS'16). Curran Associates Inc., Red Hook, NY, USA, 4356–4364.

Gonen, H. and Goldberg, Y. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of NAACL-HLT, 609-614.

Jurafsky, D. and Martin, J.H., 2021. Vector Semantics and Embeddings. In Speech and Language Processing. Draft of December 29, 2021. Accessed 07 Jan 2022.

Mikolov, T., Yih, W.-t. and Zweig, G., 2013. Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT, 746-751. Accessed 07 Jan 2022.

Zhao, J., Zhou, Y., Li, Z., Wang, W. and Chang, K-W, 2018. Learning gender-neutral word embeddings. In Proceedings of EMNLP, 4847-4853.

Appendix: Measures for Gender Bias

Cosine similarity

For two vectors a=(a1,a2,...,an)a = (a_1, a_2, ..., a_n) and a=(b1,b2,...,bn)a = (b_1, b_2, ..., b_n) of length n, the cosine similarity is measured as:

cosθ=a.bab\cos \theta = \frac{a.b}{\lvert a \rvert \lvert b \rvert}

where is the angle between the two vectors. This is a similarity measure for two vectors such that the closer a score is to 1, the greater the similarity. The score (1) is applied here to see if a particular word is closer to ‘man’ or ‘woman’ i.e. when the word is compared to ‘man’ and ‘woman’, which gives the higher score? A gendered noun, such as ‘king’ may be shown to be more similar to ‘man’, than ‘woman’, whilst ‘queen’ more similar to ‘woman’. Conversely, ‘monarch’ is gender-neutral, so greater similarity to ‘man’ than ‘woman’ can also signal a degree of bias. Here, if a ‘stereotypical’ male word e.g. king is found to be more similar to ‘man’, then this is perceived as an indicator of bias in the word embedding, and it is classified as biased in the male direction. Whereas, if a ‘stereotypical’ female word e.g. queen is shown to be more similar to ‘woman’, it is classified as biased in the female direction. Of course, what is considered ‘stereotypically’ male or female is very much subject to debate, but at the moment this seems to be the main approach for bias identification in NLP in the relevant literature. An example of the Python code applied to test BERT word embeddings for hidden bias is listed in Figure 1.

1#Cosine Similarity
2#import the Python modules
3from sklearn.metrics.pairwise import cosine_similarity
4import transformers
5import logging
6transformers.logging.get_verbosity = lambda: logging.NOTSET
7from transformers import pipeline, BertTokenizer
9feature_extraction = pipeline('feature-extraction', model="bert-base-uncased", tokenizer="bert-base-uncased")
10#word embeddings for woman and man
11em_w = feature_extraction("woman")[0][0]
12em_m = feature_extraction("man")[0][0]
13#words to be tested
14test_words = ["mom", "queen", "nurse", "dad", "king", "cardiologist"]
15#computing the cosine similarity for the test word embeddings  
16for item in test_words:
17    em_t = feature_extraction(item)[0][0]
18    print('similarity woman and ', item, cosine_similarity(em_w, em_t))
19    print('similarity man and ', item, cosine_similarity(em_m, em_t))

Euclidean distance

For two vectors a=(a1,a2,...,an)a = (a_1, a_2, ..., a_n) and a=(b1,b2,...,bn)a = (b_1, b_2, ..., b_n) of length n, the Euclidean distance is defined as:

d(a,b)=i=1n(aibi)2d(a,b) = \sqrt{\sum_{i=1} ^ n (a_i - b_i)^ 2}

We use (2) to assess if a word is closer in distance to ‘man’ or ‘woman’ i.e. when the word is compared to ‘man’ and ‘woman’, which gives the lower score? If a ‘stereotypical’ male word e.g. ‘king’ is closer in distance to ‘man’, or a ‘stereotypical’ female word e.g. ‘queen’ is closer in distance to woman, then this indicates a degree of bias in the word embedding, and we classify this word as biased in the male or female direction respectively. A continuation of the example Python code applied to BERT that tests for bias with Euclidean distances is listed in Figure 2.

1import numpy as np
2#computing the euclidean distance for the test word embeddings  
3for item in test_words:
4    em_t = feature_extraction(item)[0][0]
5    print('similarity woman and ', item, np.linalg.norm(np.array(em_w)-np.array(em_t)))
6    print('similarity man and ', item, np.linalg.norm(np.array(em_w)-np.array(em_t)))

Principal Component Analysis (PCA)

This technique, which we refer to as PCA, involves defining a gender subspace in order to test word embeddings for hidden bias. To form this subspace, we follow Babaeianjelodar et al. (2020). Firstly, ‘stereotypical’ male words e.g. 'him', are paired with their female equivalent e.g. 'her', and their word embeddings are computed. We use approximately ten pairs of these types of words. The centroid of each pair is calculated and subtracted from each word within the pairing. This attempts to highlight the gendered information within the word embedding. Next, using principal component analysis, these vectors can be reduced to one or two vectors, which represents the gender subspace. The pairs of male/female words can then be projected onto this space, where a cluster of male words and a cluster of female words should be roughly formed in this space. The level of bias associated with a new word embedding can be assessed using its projection onto this 'gender' space, and determining whether the word is located closer to the male or female cluster. Firstly, we compute the centroid of each cluster, which we refer to as centroid_female_projection (female cluster) and centroid_male_projection (male cluster). Then, the word under investigation is projected onto the gender subspace, labelled as new_word_projection. Next the following Euclidean distances are calculated:


Lastly, a probability score is assigned to each word, which represents the likelihood that the word embedding has a female bias or a male bias such that

P(female)=dist_male/(dist_female+dist_male), P(male)=1-P(female). (3)

If a ‘stereotypical’ female word has P(female)>0.5, or a ‘stereotypical’ male word has P(female)<0.5, then a hidden bias in this word embedding is suggested, and it is classified as biased in the female or male direction respectively. A continuation of the example Python code applied to BERT that tests for bias using PCA is listed in Figure 3.

1# Calculating the gender subspace
2from sklearn.decomposition import PCA
4gpairs = [ ("she","he"),
5                ("her","his"),
6                ("woman","man"),
7                ("herself","himself"),
8                ("daughter","son"),
9                ("mother","father"),
10                ("girl","boy"),
11                ("female","male")]
13gvectors = []
14for (female_word, male_word) in gpairs:
15  # Compute the difference vector in both directions
16    f_vec = feature_extraction(female_word)[0][0]
17    m_vec = feature_extraction(male_word)[0][0]
18    center=np.mean([f_vec, m_vec], axis=0)
19    gvectors.append(f_vec-center)
20    gvectors.append(m_vec-center)
22# Taking the first/leading component- although you can choose to take the first two (or more)
23pca = PCA(n_components=1)
26centroid_male_projection = np.mean(pca.transform(np.array([feature_extraction(pair[1])[0][0] for pair in gpairs])))
27centroid_female_projection = np.mean(pca.transform(np.array([feature_extraction(pair[0])[0][0] for pair in gpairs])))
29for item in test_words:
30    em_t = feature_extraction(item)[0][0]
31    new_word_projection = pca.transform(np.array([em_t]))[0]
32    DistFemaleMean=np.linalg.norm(centroid_female_projection-new_word_projection)
33    DistMaleMean=np.linalg.norm(centroid_male_projection-new_word_projection)
34    print("Probability ",item," is male:", DistFemaleMean/(DistFemaleMean+DistMaleMean))
35    print("Probability ",item," is female:", DistMaleMean/(DistFemaleMean+DistMaleMean))

The tested NLP corpora and the corresponding model versions are listed below. Note that this analysis was conducted in January 2022.

NLP CorporaModel Version
BERT CNN Daily Mailpatrickvonplaten/bert2bert-cnn_dailymail-fp16
GloVe Common Crawlglove.840B.300d
GloVe Wikipedia 2014 + Gigaword 5glove.6B
Gensim Word2Vecword2vec-google-news-300