Techniques in AI/ML have made great advances and it is a great exploratory step. But for now, it has not matured enough to reach an inferential step.
The concepts in statistics and mathematics are the building blocks of the techniques and tools we use to gain deeper insights into structured and unstructured data. Statistical concepts lie at the heart of data science.
In this informative session at SkillUp 2021, a two-day event organised by Analytics India Magazine, Rajeeva Karandikar of Chennai Mathematical Institute, presented a few examples (from history) to explain how to make the most of the available data and enormous computing power by combining statistical ideas with modern AI/ML tools.
Rajeeva Karandikar is the Director at Chennai Mathematical Institute. He is a Fellow of the Indian Academy of Sciences and Indian National Science Academy. His research interests include probability theory and stochastic processes, applications of statistics and cryptography.
Statistics quo
“Perhaps in 90% of the problems that need some decision based on available data, the standard tools in artificial intelligence or machine learning and statistics will yield the best or nearly the best answer. But the remaining 10% will need something more than just the tools,” said Karandikar.
He said not all data problems don’t have the benefit of big data, such as opinion polls, quality control, vaccine identification and approval, drug discovery and approval. Thus, statistical ideas and techniques are definitely relevant in such cases.
Karandikar called up instances from history to prove the significant role of statistics and data. Sir Francis Galton, cousin of Darwin who studied the inheritance of genetic traits, was his first example.
“It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but always to be more mediocre than they- to be smaller than the parents, if the parents were large; to be larger than the parents if the parents were very small”- Galton
Karandikar explained that Galton obtained data on the heights of parents and (grown-up) sons and got a confirmation of his ideas. He chose heights as it was easy to obtain data on them. His analysis of the data confirmed his hypothesis.
“Today we can obtain data on heights of a large number of individuals and their father’s heights (say from India passport database). It can be seen that any data-driven tool will confirm the conclusion reached by Galton. However, interchanging roles of the heights of sons and fathers lead to an exactly opposite conclusion. This nature can also be seen in simulated data,” Karandikar said.
Correlation and regression
Next, Karandikar discussed some important topics of statistics, such as correlation and regression. Most of the data-driven analysis tries to discover relationships among different variables and this is what correlation and regression are all about. It is also important to understand that correlation does not imply causation.