Home Data Engineering Data Education Should we remove duplicates from a data-set while training a Machine Learning...

Should we remove duplicates from a data-set while training a Machine Learning algorithm (shallow and/or deep methods)?


Mostly it depends on what your goals are and what your dataset looks like. There are two big divides here on both sides.


  1. Structured Data – here, duplicates very much come with the territory. In this situation you’ve also likely got a lot of implicit ambiguity in your problem. Let’s say that you have 5 structured columns with 3 categories each. In addition you’ve got one target column. You might have a million rows in your dataset and there’s no way that your columns are totally predictive of your target column. In this case you need to understand how reliable a given outcome is given a set of inputs. Duplicate inputs result in some distribution across your output and thus you need to retain that distribution. In this case removing examples is highly destructive and must be avoided.


  1. Unstructured Data – here duplicates are weird. They’re drastically less common than in the structured space, and importantly much more problematic. They typically represent strange edge cases, ETL issues, or other aberrations in data processing and if you see duplicates with any kind of real frequency here, either your unstructured task is actually a structured task or (much more likely) you’ve got an ETL issue. In these cases as a general rule it’s a good idea to remove duplicate data – or at the very least understand why it’s there and whether or not it is legitimate.


  1. Modeling – Maybe not the best term, but the idea here is that in a modeling context you have pretty strong certainty and guarantees that the data you’re looking at is fully representative of the data you’ll be testing against. You have a clear sense of what metric you’re looking to optimize and your goal really is to just optimize that metric within your closed system. If you have a very high degree of certainty that this is the case then feel free to leave your duplicates in. These problems are common in academia, and rare in industry. Important: This does not apply to most problems, and is generally assumed to be more common than it actually is.


  1. Understanding – Here, quite simply, the goal is to create some piece of functionality (i.e. sentiment analysis) that operates in an open system. Something with free user input, something where you can say with a reasonable degree of certainty that your training data is not fully representative of the eventual data your model will be applied to. Hopefully the distributions are close, but especially in the unstructured case your ability to make any assumptions around this are very limited. In these cases you absolutely should remove duplicate data as it will otherwise give you an inflated sense of model efficacy. The high-level logic simply being that for general understanding no singular example is worth more than another, and any memorization is not understanding.


Now, this is where things get tricky. If you’re reading carefully you’ll note that the first point I made on the structured side – that duplicate inputs can be important to inform the output distribution. This is much rarer on the unstructured side, but importantly it means something very different.

To use the example of sentiment analysis – this means that you have two identical pieces of text, one that has been tagged as positive, and the other that has been tagged as negative. This means that you have an issue with your problem framing. There are three main possibilities:

  1. You’re asking the wrong question – this is the most likely. In the case of sentiment analysis the issue is that positive/negative actually makes little sense in most cases. Adding an additional neutral class, or further specifying what you mean by “sentiment” (i.e. “I love your customer support, but I don’t want to renew my subscription”) can help dramatically. Ideally you discover issues like this early on and determine resolutions that heavily mitigate any disagreements on duplicates.


  1. The problem is intrinsically ambiguous – We like to believe that humans generally agree on common sense, but study after study shows that this just isn’t the case. Even for something as basic as sentiment there may not be a true sense of “positive” and “negative” the concept itself is fuzzy. In this case a small number of duplicates is acceptable, but if you start seeing duplicate levels greater than a couple percent and there are frequent disagreements, then you’re back up at point 1.


  1. The problem is impossible – We like to joke at indico about the use case that every customer starts with: “Is there any chance that you could just scrape the internet and find us trading signal?” No joke, we get that ask every day. We once worked with a customer asking us to again “scrape the internet” and then tell us which companies were about to go through M&A activity. After digging a bit deeper we found that they had an existing team of ~60 people that were already doing this. We looked at a small sample of data and found large distributional errors (and a high duplicate rate). After digging in a bit deeper we asked: “Does this process work?”, the answer was a resounding “no”. As a general rule if humans can’t do something with unstructured data then computers sure as hell can’t.


The last point I’ll bring up is that above all academic integrity is what matters. In some cases including duplicate data makes sense, in some it does not. In some cases the mere presence of duplicate data might sour the entire use case. Duplicate data has very unintuitive effects on metrics of model efficacy that mean that interpretation of even something as simple as an accuracy metric is impossible without a good understanding of the rates of duplication and contradiction in your dataset. Correspondingly you must be certain to disclose these rates. Ideally you should even report efficacy metrics both with and without duplicates to give a little more light into the generalizability of this particular model.

Maintain academic integrity, disclose curiosities of your data, and if you wittingly allow test/train contamination then you must disclose that. Everything else is secondary.

Source link

- Advertisment -

Most Popular

Data Analytics vs. Machine Learning

Technological advancements have changed the way we perform a lot of tasks. Today, we have powerful devices that have made our work quite easier....

Nanotechnology: What is it and how does it improve CBD?

There is just about any kind of CBD product you can think of - there are edibles (with vegan options because why not?), oil...

Make Your Own Virtual Zoom Background | Beginner Python Coding Tutorial 

A lot of video calling software like Zoom and Google Hangouts now let users use a virtual background behind them. In this project, we'll...

The Evolution in Data Science Jobs

AutoML is poised to turn developers into data scientists — and vice versa. Here’s how AutoML will radically change data science for the better. In...

Understanding the Difference between Blockchain and Relational database

What is a blockchain database? If we consider all that we have learned about blockchains so far, we can say that blockchains are quite sophisticated and complex....

Understanding the Future of Money

Five years ago, Bitcoin and its cousins in cryptocurrency seemed so unimportant that central banks could hardly be bothered to sneer at them. Now...
- Advertisment -