Classifying Hate Speech: an overview

Classifying Hate Speech: an overviewA brief look at label classification and hate speechJacob CrabbBlockedUnblockFollowFollowingMay 28By Jacob Crabb, Sherry Yang, and Anna Zubova.

What is hate speech?The challenge of wrangling hate speech is an ancient one, but the scale, personalization, and velocity of today’s hate speech a uniquely modern dilemma.

While there is no exact definition of hate speech, in general, it is speech that is intended not just to insult or mock, but to harass and cause lasting pain by attacking something uniquely dear to the target.

Hate speech has been especially prevalent in online forums, chatrooms, and social media.


org, a Canadian company that created a multilingual dictionary of words used in hate speech, has the following criteria for identifying hate speech (source):It is addressed to a specific group of people (ethnicity, nationality, religion, sexuality, disability or class);There is a malicious intent;There is one main problem with hate speech that makes it hard to classify: subjectivity.

With the exceptions from the First Amendment, hate speech has no legal definition and is not punished by law.

For this reason, what is and isn’t hate speech is open to interpretation.

A lot depends on the domain and the context, according to Aylin Caliskan, a computer science researcher at George Washington University (source) A seemingly neutral sentence can be offensive for one person and not bother another.

But since humans can’t always agree on what can be classified as hate speech, it is especially complicated to create a universal machine learning algorithm that would identify it.

Besides, the datasets used to train models tend to “reflect the majority view of the people who collected or labeled the data”, according to Tommi Gröndahl from the Aalto University, Finland (source).

One more complication is that it is hard to distinguish hate speech from just an offensive language, even for a human.

This becomes a problem especially when labeling is done by random users based on their own subjective judgment, like in this dataset, where users were suggested to label tweets as “hate speech”, “offensive language” or “neither”.

So when designing a model, it is important to follow criteria that will help to distinguish between hate speech and offensive language.

It is worth pointing out that Hatebase’s database can be very useful in creating hate speech detection algorithms.

It is a multilingual vocabulary where the words attain labels from “mildly” to “extremely offensive” depending on the probability of it to be used in hate speech.

While there are different opinions on whether hate speech should be restricted, some companies like Facebook, Twitter, Riot Games, and others decided to control and restrict it, using machine learning for its detection.

Model SensitivityOne more problem with detecting hate speech is the sensitivity of the machine learning algorithms to text anomalies.

A model is “the output of the algorithm that trained with data” (source).

Machine learning algorithms, or models, can classify text for us but are sensitive to small changes, like removing the spaces between words that are hateful.

This change can drastically reduce the negative score a sentence receives (Source).

Learning models can be fooled into labeling their inputs incorrectly.

A crucial challenge for machine learning algorithms is understanding the context.

The insidious nature of hate speech is that it can morph into many different shapes depending on the context.

Discriminatory ideas can be hidden in many benign words if the community comes to a consensus on word choice.

For example names of children’s toys and hashtags can be used as monikers for hateful ideas.

Static definitions that attribute meaning to one word in boolean logic don’t have the flexibility to adapt to changing nicknames for hate.

“What A.


doesn’t pick up at this point is the context, and that’s what makes language hateful,” says Brittan Heller, an affiliate with the Berkman Klein Center for Internet and Society at Harvard University (Source).

Every company that allows users to publish on their own face the challenge that the speech becomes associated with their brand.

Wired’s April issue describes how relentless growth at Facebook has created a major question “whether the company’s business model is even compatible with its stated mission [to bring the world closer together].

” (Source) We may not be able to bring people together and make the world a better place by simply giving people more tools to share.

In the next section, we will look at a case study of how another company, Riot Games faced the challenge of moderating hate speech.

A Case Study: League of Legends by Riot gamesFor those who don’t know, League of Legends is a competitive game where two teams of five players each attempt to destroy the opposing team’s base.

On release in 2009, Riot games were, of course, looking to create a fun environment where friendly competition could thrive.

This was the original goal.

League of Legends uses built-in text, and as the game’s popularity and user base grew, the competition intensified.

The players begin to use the chat to gloat about their in-game performance or tear down the enemy team’s futile attempts to stop inevitable defeat.

This was still within the company’s goals.

Soon, however, it devolved until it was commonplace to see things such as: “your whole life is trash”, “kill yourself IRL”, or numerous other declarations of obscene things the victors would do to the losers.

This became so commonplace that the players gave it a name: Toxicity.

Players became desensitized to the point were even positive, upstanding players would act toxic without thought.

Riot Games saw that of their in-house player classifications (negative, neutral, and positive), 87% of the Toxicity came from neutral or positive players.

The disease was spreading.

(Source)In 2011 Riot Games released an attempt at a solution called “The Tribunal” (Source).

The Tribunal was designed to work with another in-game feature called “Reporting” (Source).

At the end of a game, if you felt another player had been toxic, reporting was a way of sending those concerns to Riot Games for review.

Riot Games would then turn the report over to The Tribunal.

The Tribunal was a jury based judgment system comprised of volunteers.

Concerned players could sign up for The Tribunal, then view game reports and vote on a case by case basis whether someone had indeed acted toxic or not.

If your vote aligned with the majority, you would be granted a small amount of in-game currency and the offending toxic player would be given a small punishment, which would increase for repeat offenders.

Riot Games also enacted small bonuses to players who were non-toxic.

These combined efforts saw improvements in the player base, and Riot found that just one punishment from The Tribunal was enough to curb toxic behavior in most players.

This system had two main problems:It was slow and inefficient.

Manual reviews require those chat logs to be pulled out to The Tribunal website, then having to wait for responses from enough players, and then decide on a penalty from there.


It was at times wildly inaccurate (especially before they removed the reward per “successful” penalty, which lead to a super innate bias in the system).

(Source)Riot closed down the Tribunal in 2014.

It had worked for a while, but toxicity was still a problem.

A Machine Learning Solution:After that though, Riot Games took their approximately 100 million Tribunal reports (Source) and used it as training data to create a machine-learning system that detected questionable behaviors and offered a customized response to them (based on how players voted in the training data’s Tribunal cases.

)While The Tribunal had been slow or inefficient, sometimes taking days or a week or two to dish out judgment, (long after a player had forgotten about their toxic behavior) this new system could analyze and dispense judgment in 15 minutes.

Players were seeing nearly immediate consequences to their actions (Source).

“As a result of these governance systems changing online cultural norms, incidences of homophobia, sexism and racism in League of Legends have fallen to a combined 2 percent of all games,” … “Verbal abuse has dropped by more than 40 percent, and 91.

6 percent of negative players change their act and never commit another offense after just one reported penalty.

(Source)Machine Learning methodologyMachine learning approaches have made a breakthrough in detecting hate speech on web platforms.

In this section, we will talk about some techniques that are traditionally used for this task as well as some new approaches.

Preprocessing DataNatural language processing (NLP) is the process of converting human words into numbers and vectors the machine can understand.

A way of working between the world of the human and the world of the machine.

Naturally, this requires quite a lot of data cleaning.

Typically, cleaning means removing stop words, stemming, tokenization, and the implementation of Term Frequency-Inverse Document Frequency (TFIDF) which weights words of more importance heavier than words like “the” which get penalized for adding less meaning.

(Source) Lemmatization is a more computationally expensive method used to stem, or take the root, words.

(Source)Model ImplementationOnce the data is clean we use several methods for classification.

Common methods of classifying text include ”sentiment analysis, topic labeling, language detection, and intent detection.

” (Source) More advanced tools include Naive Bayes, bagging, boosting, and random forests.

Each method can have a recall, precision, accuracy, and F1 score attached to how well it classifies.

Then we want to test these methods over and over.

As awesomely accurate as our artificial intelligence can be with trained data sets, they can be equally rogue with test data.

We need to make sure our model is not overfit to our training data in a way that makes it excellent at classifying test data but poor at accurately classifying future data.

Below are three further ways to deal with challenges in classify text.

Multilabel ClassificationThe baseline multilabel classification approach, called the binary relevance method, amounts to independently training one binary classifier for each label (Source).

This approach treats each label independently from the rest.

For example, if you were trying to classify ‘I hate that kind of food, let’s not have it for lunch.

’ For the labels: lunch talk, love talk, hate talk.

Your classifier would go through the data three times, once for each label.

For the data and labels below, (after preprocessing) the binary relevance method will make individual predictions for each label.

(We will be using a Naive Bayes classifier, which is explained quite well here)data = pd.



head()# using binary relevancefrom skmultilearn.

problem_transform import BinaryRelevancefrom sklearn.

naive_bayes import GaussianNB# initialize binary relevance multi-label classifier# with a gaussian naive bayes base classifierclassifier = BinaryRelevance(GaussianNB())# trainclassifier.

fit(x_train, y_train)# predictpredictions = classifier.

predict(x_test)#print the keywords derived from our text#along with the labels we assigned, and then the final predictionsprint(data.

comment_text[1], '.', data.

comment_text[3], '.', y_test, '.', predictions, '.')hate love kind food okay lunch food hate love get lunch lunch_talk love_talk hate_talk1 1 0 13 1 1 1 (0, 0) 1 (1, 0) 1 (1, 1) 1 (0, 2) 1 (1, 2) 1So in this simple example, binary relevance predicted that the spot in the first row in the first column (0,0) was true for the label “lunch_talk”, which is the correct label based on our original inputs.

In fact, in this very simple example, binary relevance predicts our labels perfectly.

Here’s a link to this example on Github if you’d like to see the steps in more detail.

Or better yet, check out this blog on the subject, which has more detail, and a link to the Github page I used as a starting point.

Transfer Learning and Weak SupervisionOne bottleneck in machine learning models is a lack of labeled data to train our algorithms for identifying hate speech.

Two solutions are transfer learning and weak supervision.

Transfer learning implies reusing already existing models for new tasks, which is extremely helpful not only in situations where lack of labeled data is an issue, but also when there is a potential need for future relabeling.

The idea that a model can perform better if it does not learn from scratch but rather from another model designed to solve similar tasks is not new, but it wasn’t used much in NLP until Fastai’s ULMFiT came along (source).

ULMFiT is a method that uses a pre-trained model on millions of Wikipedia pages that can be tuned in for a specific task.

This tuned-in model is later used to create a classifier.

This approach is impressively efficient: “with only 100 labeled examples (and giving it access to about 50,000 unlabeled examples), [it was possible] to achieve the same performance as training a model from scratch with 10,000 labeled examples” (source).

Another advantage is that this method can be used for languages other than English since the data used for the initial training was from Wikipedia pages available in many languages.

Some other transfer learning language models for NLP are: Transformer, Google’s BERT, Transformer-XL, OpenAI’s GPT-2, ELMo, Flair, StanfordNLP (source).

Another paradigm that can be applied in case there is lack of labeled data, is weak supervision, where we use hand-written heuristic rules (“label functions”) to create “weak labels” that can be applied instead of labeling data by hand.

Within this paradigm, a generative model, based on these weak labels is established first, and then it is used to train a discriminatory model (source).

An example of using these two approaches is presented in this article.

Abraham Starosta, Master’s Student in AI from Stanford University, shows how he used a combination of weak supervising and transfer learning to identify anti-semitic tweets.

He started with an unlabeled set of data of about 25000 tweets and used Snorkel (a tool for weak supervision labeling) to create a training set through writing simple label functions.

Those functions were used to train a “weak” label model in order to classify this large dataset.

To apply transfer learning to this problem, he fine-tuned the ULMFiT’s language model by training it on generalized tweets.

He then trained this new model on the training set created with weak supervision labeling.

The results were quite impressive: the author was able to reach 95% precision and 39% recall (probability threshold of 0.

63), while without using weak supervision technique, it would be 10% recall for a 90% precision.

The model also performed better than logistic regression from sklearn, XGBoost, and Feed Forward Neural Networks (source).

Voting Based ClassificationsVoting based methods are ensemble learning used for classification that helps balance out individual classifier weaknesses.

Ensemble methods combine individual classifier algorithms such as: bagging (or bootstrap aggregating), decision trees, and boosting.

If we think about a linear regression or one line predicting our y values given x, we can see that the linear model would not be good at identifying non-linear clusters.

That’s where ensemble methods come in.

An “appropriate combination of an ensemble of such linear classifiers can learn any non-linear boundary.

” (Source) Classifiers which each have unique decision boundaries can be used together.

The accuracy of the voting classifier is “generally higher than the individual classifiers.

” (Source)We can combine classifiers through majority voting also known as naive voting, weighted voting, and maximum voting.

In majority voting, “the classification system follows a divide-and-conquer approach by dividing the data space into smaller and easier-to-learn partitions, where each classifier learns only one of the simpler partitions.

” (Source) In weighted voting, we can count models that are more useful multiple times.

(Source) We can use these methods for text in foreign languages to get an effective classification.

(Source) On a usability note, voting based methods are good for optimizing classification but are not easily interpretable.

Awareness of how powerful machine learning can be should come with an understanding of how to address its limitations.

The artificial intelligence used to remove hate speech from our social spaces often lie in black boxes.

However, with some exploration of natural language processing and text classification, we can begin to unpack what we can and cannot expect of our A.


We don’t need to be part of a tech giant to implement classifiers for good.


. More details

Leave a Reply