A simpler approach—good data, SQL queries, if/then statements—often gets the job done.
It may be oversimplified to say, as data scientist Noah Lorang did years ago, that “data scientists do math primarily”. But it is not far, and he and Yan are certainly right that so often it is best to start small, even if we want to complicate the process of providing data.
Data Scientists are paid a lot, so it might be tempting to justify that paycheck by wrapping things like predictive analytics in complicated jargon and heavy models. Lorang’s view of data science is as true today as it was a few years ago: “There is a very small subset of business problems that are best solved by machine learning; most of them just need good data and an understanding of what it means.” Lorang recommends simpler methods, such as “SQL queries to get data, … basic arithmetic on that data (computing differences, percentiles, etc.), graphing the results, and [writing] paragraphs of explanation or recommendation.”
I am not saying that it is easy. I say machine learning is not the starting point when it comes to extracting insights from data. There is also no need for large amounts of data. In fact, as a suitable CEO, Katelyn Gleason argues that it is important “to start with the small data because it is the visual anomalies that have led me to some of my best results”. Sometimes it can be enough to draw distributions to look for obvious patterns.
Yes, that’s right: The data can be “small enough” that a person can recognize patterns and gain knowledge.
So it’s no wonder that iRobot data scientist Brandon Rohrer blatantly suggests “When you have a problem, build two solutions—a deep Bayesian transformer running on multi cloud Kubernetes and a SQL query built on a stack of egregiously oversimplifying assumptions. Put one on your resume, the other in production. Everyone goes home happy.”
Again, this doesn’t mean you should never use ML, and it’s definitely not an argument that ML doesn’t offer any real value. Far from it. It’s just one argument against starting with ML. Yan’s article on the subject is worth a look.
Humans getting to know data
First, Yan notes, it’s important to recognise just how hard it is to pull meaning from data, given the critical ingredients: “You need data. You need a robust pipeline to support your data flows. And most of all, you need high-quality labels.”
In other words, the inputs are so complicated that it doesn’t make much sense to cast ML models on the problem first. At this point, you only know your data. Try to solve the problem manually or with heuristics (practical methods or shortcuts). Yan underscores this line of reasoning from Hamel Hussain, a machine learning engineer at GitHub: “It will force you to become intimately familiar with the problem and the data, which is the most important first step.”
Assuming you’re dealing with tabular data, Yan says it’s worth starting with a sample of the data to run statistics, starting with simple correlations and visualizing the data, possibly using scatter plots. For example, instead of building complicated machine learning. As a model for recommendations, one could simply “recommend high performing items from the previous period,” argues Yan, and then look for patterns in the results. This helps the ML practitioner become familiar with their data, which in turn helps them build better models when needed.
When does machine learning become necessary or at least advisable?
According to Yan, machine learning makes sense when maintaining your non-ML heuristic system becomes too cumbersome. Strive to build and implement an ML-based system. “You have robust data pipelines and high-quality data labels, which indicates good data is to rely on regression analysis or some if / then statements rather than ML.
Of course, there’s no solid science on when this happens, but when your heuristics are no longer practical shortcuts and instead keep breaking, it’s time to think about machine learning, especially if you have strong data channels and high quality data labels have dates that indicate good.
Yes, it is tempting to start with complex ML models, but arguably one of the most important skills a data scientist can have is common sense knowing when to trust regression analysis or if / then statements, instead of ML.
Source link