Since the 1960s when the first foundation for bringing method to the madness of data science as brought out by the Crisp Methodology, we have come a long way.
Data Science is the buzzword that has every technology company – big and small – wondering how they can maximize the benefits from it. It is the one thing that took previously boring data and married it to analytical methods and realized that business strategy can be significantly impacted by it. But making sense of data has been something logisticians, scientists, statisticians and historians have been doing for time immemorial. So what has changed now and when did it all start?
Way back in the 1960s, the word data analysis was used to understand how an important empirical science would affect the relationship between various data points in different fields of science. Then we moved to data mining, the first stepping stone towards modern day data science. During this time, there were two popular methodologies that outlined the process of data mining attempting to bring structure to the process – Cross Industry Standard Process for Data Mining (CRISP-DM) and SEMMA(Sample, Explore, Modify, Model , Assess). Both define a set of sequential steps to guide the process, assigning specific tasks and defining the results that are expected to be obtained in each stage.
The CRISP Methodology
From a business perspective, CRISP-DM was the more popular method. It provides an overview of the life cycle of a data mining project, in the same manner as it is done in software engineering with life cycle of software development. CRISP was an all-encompassing method that considered took in a lot of factors that brought about better business understanding keeping in mind customer needs. The six stage process treated each data mining set as an independent project. CRISP-DM strong points are on the “Analytical” side of implementing Data Mining/Predictive Analytics projects. However CRISP-DM does not cover the infrastructure/operations side of implementing a data mining/predictive analytics project. Its early contributors and proponents were from proprietary technology companies, which helped it be practical. As times changed and big data concepts emerged, a new generation of engineers demanded a white-box approach moored in open source and so CRISP-DM lost favour.
The ASUM Way
Come 2015 and IBM introduced a new methodology, Analytics Solution Unified Method for Data Mining/Predictive Analytics (ASUM-DM). This retained the “Analytical” activities and tasks of CRISP-DM but the method was augmented with missing activities and tasks as well as templates and guidelines. ASUM gained relevance, but again with proprietary overtones. This was more enterprise ready, agile, comprehensive and scalable. It was created to accelerate the time to value and lower risk by establishing consistent approaches and processes that increase implementation efficiency. The most significant drawback was that it was licensed to only IBM customers and its cost was restrictive discouraging widespread use.
It should be noted that over time there have been numerous efforts by users specialized in the use of particular software (e.g. Microsoft Excel, SAS) to document and propagate best practices in the use of those software to solve data mining problems. However, no innovation on the methodology itself has been forthcoming from such efforts to date.
Microsoft Team Data Science Process
The Team Data Science Process (TDSP) is a framework developed by Microsoft that provides a structured methodology to build predictive analytics solutions and intelligent applications efficiently. This service has been made available as recently as late 2016 (https://blogs.technet.microsoft.com/machinelearning/2016/10/11/introducing-the-team-data-science-process-from-microsoft/) on Microsoft’s Azure cloud. A lot of the steps would look familiar to those already comfortable with CRISP-DM though the adaptation to the use of cloud technologies and agile methods of development makes it relevant in this day and age.
The Open source proponents
Once data science moved to mainstream business, data analytics libraries began to spring up in almost every single open source programming language. Without any prejudice or any interest in being biased, R and Python have certainly emerged as 2 of the most popular programming frameworks for modern day data science over the last decade. Structured methodologies for carrying out data science have only recently caught hold (e.g. https://www.tidyverse.org/).
It is in this context that RoboticDataScience, our pioneering methodology for systematically automating data-driven decision-making, should be looked at. More about RDS has already been described http://www.thedatateam.in/index.html, http://www.thedatateam.in/white-papers, and http://www.thedatateam.in/podcast.
About the author: