A major problem with discussing the relationship of statistics to data science is agreeing on a definition of data science. I am punting on the definition of data science, but I argue that data science has two aspects, i.e., forming two sides of the same coin, to use three worn-out clichés in one sentence.
Before moving to my main thesis, some initial observations about data science and statistics:
- Data science is constantly evolving in response to new applications, data characteristics and size, and computing technology, but the field itself is as old as statistics.
- Domain scientists and engineers arguing they can apply statistical methods and models without detailed understanding of the theory and without collaborating with statisticians is an old story. They can be right, but examples of the consequences of un-informed application include loss of knowledge, wasted experimentation, misinterpretation, wrong conclusions, loss of opportunities for insight, lack of reproducibility, and ignoring uncertainty in conclusions.
- Statistics is built around inventing, theoretically and empirically assessing, and applying new methods and models, among its other research concerns like quantifying uncertainty. In other words, statistics is philosophically aligned towards innovation, notwithstanding the caution of statisticians towards adopting new methods without theoretical understanding.
Roughly speaking, data sciences has two components, which might be called algorithmic and inferential data science.
- Algorithmic data science is focused on algorithm development with the goals: efficient calculation; correction, efficiency and accuracy of implementation; technical issues with implementation on specific platforms; and practical use of data science/statistics methods, models, and tools.
- Inferential data science is concerned with extracting knowledge and drawing conclusions from data reliably and robustly with quantification of all uncertainties, understanding of conditions for applicability and failure, and evaluation of risks for incorrect scientific inferences.
- The interface between these parts is careful quantitative experimental computation, sometimes verifying theoretical understanding and sometimes augmenting what is understood. It is a mixture of algorithmic and inferential data science. This is distinguished from the anecdotal evidence of one-off application successes that often characterize algorithmic data science.
These two parts of data science are fundamentally important. Ignoring algorithmic issues prevents the adoption of new methods and models and limits the amount and complexity of data that can be treated. Ignoring the full range of questions involved with scientific inference and, in particular, the measurement and impact of uncertainty, leads to poor and risky decision making. This, for example, is the root of the crisis of reproducibility that is plaguing a number of scientific fields.
Statistic encompasses inferential data science and quantitative experimental computation and thus, statistics is a fundamental part of data science and absolutely necessary for the purposes of robustly accurate scientific inference.
The balance between algorithmic and inferential data science is always in flux and there is often a significant imbalance on the frontiers of emerging research. For example, we illustrate the evolution of the balance when a new method is introduced. Initially, it is all about the algorithm.
New methods are often introduced in the application domain, supported by heuristic reasoning, anecdotal evidence, and wishful thinking. Commonly, little is known theoretically and careful numerical studies have not been carried out. In the middle phase, a combination of conservatism on the side of statisticians and overselling of results in the application domain raise hurdles for theoretical study. But individuals pursue the challenges with careful computational experimentation – with a view towards computational validation in both applied and theoretical terms. This provides evidence as to how methods and models perform and their applicability, and motivates further theoretical studies. Collaborations increase, stronger applied evidence is accrued, more careful numerical studies are conducted, and analysis catches up to practice. The limitations and applicability of the new methods are identified, while good implementations become available. Balance is restored.
The dynamics of balance can go the other way, e.g., when there is a promising theoretical framework that cannot be used until fundamental research on the algorithmic side is carried out.
The value added by considering the inferential part of data science include:
- Understanding when and why algorithms work, and when they do not, and how to quantify performance and efficiency.
- Placing ideas in a general framework so that the methodology can be ported over to other applications. A statistics framework provides a coherent approach to data analysis and robustness.
- Determining models that allow for interpretability of analysis, which is an important goal in many applications. Many applications need more than simple prediction.
- Quantification of uncertainty and quantified assessment of risk.
< Back to Don Estep's Essays