The Definite Guide For Creating An Academic-Level Dataset With Industry Requirements And Constraints

A Review and Benchmark Evaluation article published recently in ACM Transactions on Management Information Systems, which benchmarks 28 top academic and commercial systems on 5 distinctive Twitter data sets, shows an overall average sentiment classification accuracy of 61% across systems and domains.

Maximum accuracy measured as 76.

99%, as seen in Figure 1.

In another study, using deep-learning (Figure 2) we are seeing up to 87.

5% classification accuracy.

These figures give us a relatively clear picture of what is the SOTA and that using various algorithms on different domains will not produce the same results.

Figure1: sentiment analysis results using 28 algorithms on 5 distinct datasetsFigure 2: sentiment analysis results using machine and deep-learning algorithms using binary and 5-class sentiment-based datasets.

Criteria and WorkflowGoing over relevant literature, it seemed like the important points for an academic-level annotation project were:Per-post Majority-vote.

Simple definition and short task description.

Routinely measure annotator performance & agreement between annotators.

Sample appropriately & create the dataset.

Carefully choose an annotation solutionUse a simple annotation interface to aid annotators to focus on a single and clear task.

Per-sample majority-voteDeciding on a majority-vote-based project was an important first step, it is a critical point in any annotation project because it adds complexities to the nature of the project.

Complexities such as spending more money on additional annotators, creating an infrastructure that will support multiple annotators, developing scripts that track their progress and investing time in daily analysis in order to make sure that each annotator is on track.

We don’t want a single annotator to influence, add bias or introduce noise into our new dataset.

For example, annotators that share the same room can influence themselves if they re-interpret the instructions while having a conversation with each other.

An annotator can wake up one day and completely change their methodology, therefore, it is advisable to have a monitoring mechanism and fine-tune their understanding when that happens.

Simple definition and short task descriptionWe also wanted to replace our 4-page instruction manual.

The idea was to decide on a new city-related-sentiment definition that focused on our clients’ needs and create a very short task description, i.


, a single line definition for the new sentiment and less than half a page instructions.

The final definition was as “an indication of resident satisfaction (positive) or dissatisfaction (negative) with anything happening in the city”, as can be seen in Figure 3.

Figure 3: the new sentiment definition and annotator instructions.

As you can see, the definition is very clear, and the instructions are short.

We can see in the following examples that the neutral example might be considered negative in terms of classical sentiment, however, according to the new definition, it is neutral.

The writer doesn’t explicitly say if they are satisfied or not with the issue they are writing about.

Please note that the negative and positive examples clearly follow the new definition.

Neutral: A Bexar County grand jury decided Tuesday not to indict the ex-San Antonio police officer accused of beating his then-girlfriend with a rock outside her apartment.

Negative: Not cool.

Package thieves are already on Santa’s naughty list.

This guy took a package off the porch from a home in the 9999 block of X lane.

If you recognize him, please contact Det.

Blanco at 999–999–99999 or email her atPositive: The Arlington Police Department is proud to employ many current and former members of the United States Military.

We are thankful for their service, and we are honored they chose to work for our department.

Measuring performance and agreement (or disagreement)During the project, we needed to measure several metrics in order to understand and control various aspects of the annotation process.

Metrics such as:self-agreement, which helps us identify low-quality annotators.

We do this by inserting repeating posts every K samples.

Inter-agreement, which is a good estimator of the objective difficulty of the task.

For multi-annotators, we use Fleischer’s Kappa and for two annotators we use Cohen’s Kappa.

For us, a value higher than 0.

55, was considered as a good estimation.

Percentage of agreement — between all annotators.

It is another estimator for inter-agreement, for us a value 0.

65 was highly correlated with Fleischer’s kappa and was considered as a good estimation for inter-agreement.

Ground-Truth validation.

We inserted samples that were internally labeled by us, using the same majority-vote process, into the dataset.

Keep in mind that because we created the new definition we are the experts, therefore, if we measure each annotator against a small well-labeled dataset, we can see if the annotators are agreeing with us.

Please note that there are many other metrics that one can create and measure, but these were enough for us.

They helped us understand on a daily basis, per annotator, if the newly labeled data can be used or not and whether we needed to fine-tune our annotators understanding of the new labeling scheme.

You can use the following package to measure disagreement among annotators.

SamplingSo now that we know what to measure, we need to create our dataset, this is another point where many decisions contribute to the quality of our final dataset.

Several questions arise:Which data to sample.

Which sampling distribution to follow and when?Which sampling algorithm to use.

How many to sample?In our case we decided to sample human-approved data that was already labeled, naturally, we didn’t use the labels themselves.

We understood that the best option was to follow a distribution for each use case.

The following are several guidelines that worked for us, it should provide a good start for your own data:1.

Ground truth samples should be sampled uniformly, per class.

Imagine having only a few samples in one of the classes, i.


, you starved one class and with a very small sample you will never know for certain how well your annotators perform in that class.


Your main dataset should be sampled according to how your production data is distributed, which will allow your model to learn the same distribution.

Keep in mind that these rules should be changed if you feel they are not meeting your requirements, data or do not fit your needs.

Creating the datasetNow that we know which data we are sampling and which distribution we should follow, it’s time to decide on how we will do the sampling.

You can probably think of many methods whether simple or more complex, but the basic ones we tried were:1.

Using keywords — defined by someone who understands the domain.


Using an algorithm — it may be an off-the-shelf algorithm, a quick-and-dirty algorithm that you can easily create, or alternatively, doing some kind of transfer-learning from a dataset close to your domain.

In our case we adopted a dataset of user reviews which is somewhat similar to our domain, we chose reviews that were starred as most positive and most negative and used them to train a sample-selection classifier.

Both methods worked well.

However, in our case, it seems that the keywords approach yielded better results.

Now it’s time to decide on the number of samples that we need for our project.

Obviously the more we have the better our model.

I found a nice calculator, as seen in Table 1, originally created for surveys that helped me estimate the number of samples needed for a known population size and confidence level that you want to keep.

In our case, it seemed that for an estimated population size of 100,000 unique writers with a confidence level of 95%, around 10,000 samples was good enough to start with.

The number of samples directly affect our budget and choice of algorithm.

Let’s say we have 5 annotators and we need 10,000 samples from each, we need a total of 50,000 annotated samples.

In some services, as we’ll see later, this amounts to a lot of money.

Luckily, my company was willing to invest money, time and effort into building several of these datasets.

We understand that the high quality annotated datasets contribute directly to better and more accurate models and ultimately to the value we give our clients.

The number of samples also affects our choice of algorithms, in our case we tried classic-machine-learning and deep-learning algorithms.

However, it seemed that our dataset was still small for all deep-learning algorithms that we tried.

Table 1: sample size calculation for a set population size based on a confidence level.

Annotation SolutionsThere are many annotation solutions to choose from, some are cheaper than others, some will give you a complete solution while others will just give you the infrastructure or the crowd.

However, each has its own pros & cons.

First, I’ll list a few known ones and then we’ll dive into a few notable ones.

You can choose one of the following services 1.

Amazon’s Mechanical Turk (AMT), 2.

Figure-Eight (Crowd Flower) 3.


io 4.

Crowdsource / Workforce / Onespace, 5.

Jobboy, 6.

Samasource, 7.


Additionally, you can choose or one of the following tools: 1.

Prodigy 2: BratThe following (Table 2) represents my initial calculations for a total of 50,000 annotations.

I chose to compare AMT, premium and outsourcing companies as they represent the services and tools that were available to us.

Table 2: various features for each annotation solutionAs you can see, this is another good example of how many options we have.

In general terms AMT is the cheapest option, you can use the calculator to figure out how much it will cost you, but bare in mind that you may not get picked up by any annotators if you pay them the basic 1 cent per task.

In terms of reliability, I was told by many researchers that its a roulette, even if you pay a premium.

Nothing is guaranteed because anyone can work for AMT.

it’s worth noting that the documentation is confusing and unclear and if you want to use an external tool you would need a data engineer to help you.

Finally, the experiment is fully managed and the control you have on the process depends on you.

On the other hand, premium services manage nearly everything in collaboration with you.

However, some claim that they have experienced annotations and they maintain “trust” by promising over 80% ground-truth confidence.

You provide them with your task description and they train the annotators until you are satisfied with the results.

The best of both worlds is probably outsourcing to an external annotation company, they have full-time employees that you can communicate with on a regular basis.

They have redundancy plans, i.


, if an employee is sick, someone else will take their place.

However, you have to sign contracts and commit to hiring them for a certain period and once you start, you can’t pause.

You also have to develop or provide a solution for the annotation if you want to have optimal results.

Annotation Tools:I found several off-the-shelf annotation tools, but the most promising ones were Brat and Prodigy.

Brat is open source and seemed to be sparsely maintained, it mostly fits tasks similar to POS or NER.

Prodigy, on the other hand, is relatively cheap ($390) for a single commercial user with infinite annotators, a lifetime license and it works for web and smartphone.

We chose prodigy because it was simple to use, highly configurable and it worked in a “factory” mode, focusing on a single, supercharging speed with keyboard shortcuts.

It’s worth noting that prodigy supports single-user active learning, however, if you plan on doing multi-user active learning you will have to synchronize between all annotators and that feature doesn’t exist yet.

Additionally, the concept of a ‘user’ doesn’t exist, therefore, support for managing a team of annotators and saving their work to a single database needs a data engineer.

Prodigy also lacks performance monitoring and the documentation is wanting.

However, there is active support on their forums and that is a huge plus.

Figure 4.

Shows prodigy’s interface, it’s simple and easy to work with and allows an annotator to focus on his task with minimal interruptions, i.


, “factory-style”.

Figure 4: Prodigy’s user interface.

Results:Our project yielded 14,000 majority-voted samples, which means that our annotators labeled around 75K posts, exceeding our expectation of 10,000 posts per person.

In Figure 5, we show the correlation between inter-agreement and Fleischer's Kappa, each with its own threshold.

The threshold helped determine if the daily data can be trusted and whether we should include it in the final dataset.

In Figure 6, we see that there is a make-sense correlation between the ground-truth metric compared to the amount that the annotators did on a daily basis, we see that the less than they did the higher the ground-truth accuracy was.

We can also see the team’s average ground-truth accuracy throughout time in Figure 7.

We chose a value of 70% agreement that will account for various reasons that the ground-truth accuracy will not be a perfect 100%, one of the main reasons is that some ground-truth posts had a majority that relied on a single vote, i.


, 2:3 vs 3:2 and may be a mistake in the ground-truth.

However, that is the process and that is why we have thresholds.

Figure 5: a correlation between Inter-agreement and Fleischer Kappa, showing the threshold we chose for keeping the daily annotated data.

Figure 6: a comparison of sample count vs ground truth accuracyFigure 7: average annotators accuracy throughout timeUsing the final dataset, we create features using various methods such as TF-IDF, LSA, W2V, Deep-Moji, and ELMO and augmented the data with methods such as back translation.

We trained quite a few classification algorithms with such as Calibrated SGD, ULMFit, Random Forest, XGboost, LightBoost, Catboost, LSTM, 1DCNN and also stacking several of them.

Our highest cross-validated model achieved an Accuracy of 89% using TF-IDF & Calibrated SGD (SVM) with the following metrics, as seen in Table 3.

Table 3: precision and distribution figures for each class.

Data Analysis:We tested the model on 245,469 samples, from 22 American cities.

The negative sentiment dropped by 6%, the positive sentiment dropped by 19%, and the neutral sentiment increased by 25%, as seen in Figure 8.

We expected these figures to change according to the new definition, we effectively cleaned the negative and positive classes from non-related samples, such as ‘alert’-like posts, accidents, traffic, emergencies, and others, which now belong to the Neutral class unless user dissatisfaction appeared in the post.

Figure 8: distribution statistics for the old sentiment vs the new sentimentLooking back, the process worked well, we found a workflow for creating quality datasets that worked for us.

Please note that the model we trained is only valid for a single point in time and must be retrained periodically.

For this, I recommend using multi-annotator active-learning with the same majority-vote process.

However, our journey is not finished yet, we still need to analyze and validate the model in a production-ready environment and to make sure it aligns with our client’s expectations.

I would like to thank my fellow co-workers from Zencity, who were a crucial and integral part of this project: Samuel Jefroykin (research), Polina Sklyarevsky (new definition & data), Alon Nisser (engineering), Yoav Talmi, Eyal Feder-Levy, Anat Rapoport, Ido Ivry and Gali Kotzer.

I would also like to thank several friends and colleagues whom I consulted with prior to starting the project, Dr.

Orgad Keller, Dr.

Eyal Shnarch, Dr.

Hila Zarosim, and Netanel Davidovits.


Ori Cohen has a Ph.


in Computer Science with focus in machine-learning.

He leads the research team in Zencity.

io, trying to positively influence citizen lives.


. More details

Leave a Reply