Errors can occur when implementing machine learning. Unlike programming or computational errors, the inherent nature of ML models can make it difficult to identify the resulting problems or biases. To address this, Michael Lones, an Associate Professor at Heriot-Watt University wrote a paper that walked through the five phases of the machine learning process that describe how to reliably build and evaluate models, compare models fairly, and report results correctly.
While it is aimed at academic researchers studying ML, it provides budding data scientists and business managers immersed in AI-centric solutions with a great overview of the key considerations when working with ML, including challenges and areas in which errors can occur.
Start with quality data
It’s important to make sure the data is from a reliable source and was collected using reliable methodology, says Lones. Training a model on faulty data is likely to result in a faulty model (popularly known as junk in the trash). Exploratory data analysis and selection of missing or inconsistent data sets before training a model.
It makes sense to back up enough data before exercising, although this can be difficult to determine. According to Lones, this depends on the “signal-to-noise ratio” of the data set and may not be visible until you start building the models. The data can limit the complexity of the ML models used, for example deep neural networks that require many parameters must be avoided.
Speak with experts
If you are solving problems with ML in any particular area, be sure to reach out to the experts who work in those areas. Domain experts can help you understand the data and provide information about features that are likely to be predictive, says Lones. Also, identify useful problems to be resolved, a consideration that goes very well with selecting the most relevant business problems to work on.
Use the right model
While it can be fun to experiment with multiple approaches to see “what stays”, it can lead to a disorganized mess of experiments that is difficult to justify. An organized approach with proper optimization of hyperparameters, which are numbers or settings that affect the configuration of the model, is ideal. A common problem is that test data can invade setup, training, or model selection, writes Lones (a common mistake made by medical researchers working on the use of AI in medical imaging). “The leakage of information from the test suite into the training process is a common reason ML models fail to generalize.”
In closing
It’s worth noting that a higher precision number doesn’t mean a better model. Lones points out that this could simply be due to the use of different data sets or different hyper parameter settings. Ultimately, the ML space is evolving, with exponentially more powerful new AI processing tools and capabilities that can be accessed in the cloud or deployed locally. The increasing adoption of AI will no doubt lead to new types of errors that require new strategies to protect yourself.
“You have to approach machine learning like any other aspect of research: with openness, the willingness to keep up with the latest developments, and the humility to accept that you don’t know everything,” he sums up.
Source link