Algorithmic Bias

Algorithmic Bias. It’s a hot topic in any data talk, but few think of the role that humans play to bring it to life.

A lot of bias emerges when we leap from descriptive analysis to prescriptive analysis.

Descriptive analysis has been documented in statistics for a long time and includes means, variances, and thresholds etc. The biases are also largely known, for example, the difference between sample and population means. This new conversation about bias mainly arises from the recent introduction of prescriptive analysis; getting the data to tell you what to do in the future not just what has happened. Data describes multiple things: the reality itself, the way the data was collected, who it was collected by, amongst others. All this is useful for finding out what happened, but we might not want it all to influence our decisions.

One of my favourite advancements in NLP, and one that provides quite a clear example of prescriptive bias, is Word Embeddings. They’re essentially where you assign a high-dimensional number to a word in such a way that the number relates to the meaning of the word. In short, they let you do maths on words. With these embeddings we can create an algorithm to answer logical analogies: “Man is to King as Woman is to <?>” which would be filled in with Queen. All good right? What if we try “Man is to Professor as Woman is to <Associate Professor>”. Oh. That’s no good, we must have a biased algorithm. Let’s scrap the project.

Looking into the ‘why’ a bit more we can guess that the data it has learnt this association from has a higher association of men to professors than women, perhaps the Wikipedia pages of famous scientists? Unfortunately, this is most likely an accurate representation of academic posts. It’s fine for us to use this for descriptive analysis: this is an accurate bias, but it’s not allright to use it for prescriptive analysis, e.g. trying to tell who should get the new post; looking at you Amazon CV screener.

Actually, I don’t know if the data has this skewed association. The process I used to fill in the analogy prevents you from getting the same word out, as “Man is to King as Woman is to <King> doesn’t quite have the same ring to it. I introduced this bias in how the answers were extracted from the data. A human error.

The moral of the blog post, as all blogs seem to? If we don’t understand what we’re doing we will make mistakes, mistakes which are now less transparent than ever and usually end up being blamed on “biased data” or “biased algorithms”. Maths and code themselves aren’t biased but we can use them to say biased things; the same with words, no word is biased by itself but we can all think up of some very biased sentences.