Another issue was that the algorithm learned to conflate toxic comments with nontoxic comments that contained words related to gender, sexual orientation, religion or disability. For example, one user reported that simple neutral sentences such as “I am a gay black woman” or “I am a woman who is deaf” resulted in high toxicity scores, while “I am a man” resulted in a low score.

Following these concerns, the Conversation AI team invited developers to train their own toxicity-detection algorithms and enter them into three competitions (one per year) hosted on Kaggle, a Google subsidiary known for its community of machine learning practitioners, public data sets and challenges. To help train the AI models, Conversation AI released two public data sets containing over one million toxic and non-toxic comments from Wikipedia and a service called Civil Comments. The comments were rated on toxicity by annotators, with a “Very Toxic” label indicating “a very hateful, aggressive, or disrespectful comment that is very likely to make you leave a discussion or give up on sharing your perspective,” and a “Toxic” label meaning “a rude, disrespectful, or unreasonable comment that is somewhat likely to make you leave a discussion or give up on sharing your perspective.” Some comments were seen by many more than 10 annotators (up to thousands), due to sampling and strategies used to enforce rater accuracy.

The goal of the first Jigsaw challenge was to build a multilabel toxic comment classification model with labels such as “toxic”, “severe toxic”, “threat”, “insult”, “obscene”, and “identity hate”. The second and third challenges focused on more specific limitations of their API: minimizing unintended bias towards pre-defined identity groups and training multilingual models on English-only data.

Although the challenges led to some clever ways of improving toxic language models, our team at Unitary, a content-moderation AI company, found none of the trained models had been released publicly.

For that reason, we decided to take inspiration from the best Kaggle solutions and train our own algorithms with the specific intent of releasing them publicly. To do so, we relied on existing “transformer” models for natural language processing, such as Google’s BERT. Many such models are accessible in an open-source transformers library.