The importance of annotating the right dataset

High-quality data is the foundation of any successful machine learning project. It's also a time consuming task, so it pays off to be sure we have the right dataset before starting the data annotation process.

We all know how frustrating it can be when your model doesn't work as you expected. We've seen this happen many times during the last months with our clients, and they pointed out that the problem was at the beginning of the process, even before the annotation process, there was not the right dataset!

So in order to help them to improve the accuracy of our clients models across-the board, we’ve added some brand-new features that focus specifically at the quality of the dataset.

We are happy to present you the Data Quality features for datasets!!

‍

Now, you can run data quality checks on your dataset. For each one, you’ll see the results and an evaluation: Satisfactory, Good, Unsatisfactory. Satisfactory means it's good enough for what we're trying to do with this data. Good means there are some errors in the content that might difficult the model training process. Unsatisfactory means it is recommended to review the dataset before annotating it.

This evaluation is just informative, because it depends on the objective of your model. For example: If you want to train a machine learning algorithm for tweets processing, then grammatical errors will not pose much problem since they can be ignored as irrelevant noise in this particular case, but if your dataset contains tweets longer than 280 characters it could potentially be a problem in your training process.

What kind of data quality checks can you run today on the M47AI platform?

Grammatical errors: Grammatical errors should be avoided at all costs, as they can cause major problems for machine learning models. For example if a dataset has lots of spelling or grammar mistakes then the resulting algorithm will fail badly when applied to another set with less abnormal data. We need good quality input but also maintain some natural writing.

Lexical diversity: Lexical diversity is an important measure for the success of any machine learning project. Too low lexically can mean that we have phrases with very repeated words, while too high a level will result in datasets full or synonyms which makes training models more difficult (and maybe even impossible).

Unknown words: Evaluate the errors and decide if you need to fix them. Not all unknown words are a problem: if a word appears in a text in another language (a proper name, for example) it will be detected as an unknown word, although it is not necessarily an error.

Sentence length: With this feature you can detect outliers based on errors when creating the dataset, but also evaluate depending on the type of the model you want to train. Sentence length is important for translation tasks because it can affect the complexity of sentences. However, when sentiment analysis needs to occur in a model there are different lengths that will allow learning capabilities based on context which could be beneficial during training processes.

Relevant terms. Term Frequency — Inverse Document Frequency (TF-IDF): The key to getting the most relevant data for training your model is making sure you input terms that represent what’s actually in there. Otherwise, it will just be a bunch of words with no meaning and can't help the model on finding patterns.

Zipf’s Law - Distribution of Frequency: Zipf’s Law is a distribution law which states that most words are used very rarely, while a small number of them stay in use over time. The process detects if there's an unusual pattern or trend among these rare terms and phrases to indicate they might have been generated artificially by computer programs rather than being naturally occurring expressions of human thought processes- something we can't yet prove but would guess at based on context clues like where these items show up within text blocks when compared against each other.
‍
We strongly recommend you check the accuracy of your dataset before creating an annotation project.

M47AI Machine Learning team is always working on improving the quality of your dataset. You should keep an eye out for new features that may be added soon!

Try it for free Book a demo