The development life cycle of a data science project is different than the traditional software development life cycle. Though the development methodologies and practices vary across organisations but most of them have similar processes. One such well known process is the Cross Industry Standard Process for Data Mining (CRISP-DM) and this blog will present a summarised version of it.
Data Science Life Cycle
The life cycle of a data science project is divided into six phases.
Business understanding – Understanding the business context and objectives both short and long term
Data understanding – Understanding the availability of quality and quantity of data
Data preparation – Prepare right datasets, feature and data engineering to use in the models
Modeling – Choosing the right modeling techinques, algorithms and frameworks
Evaluation – Model evaluation, bench marking and metrics
Deployment – Deployment of the final model
The following diagram show a typical data science project life cycle.
In this phase, business requirements and goals are understood. This phase is about assessment, planning, defining the governance model and the success criteria.
In this phase data is acquired and examined. Data understanding can include exploratory data analysis , data visualisation , assessing the quality and quantity of data.
The data preparation phase is one of the most important phases in the data science project life cycle. Some activities done during this phase are to determine the right datasets, data cleansing, levelling, data and feature engineering.
This is one of the most exciting phases in the life cycle . Datasets are usually split into test, training and validation sets. The algorithms to be used are determined. Models are built and assessed continuously. Results of different models are interpreted based on the success and test criteria. This is an iterative phase and is continued until the results reach the expected benchmarks.
The evaluation phase focuses mainly on assessing the model based on business objectives. This evaluation is different that that done in the previous phase wherein models are assessed technically. The overall evaluation involves validating and measuring against the success criteria and the metrics defined.
In this phase the model is deployed and made operational. Machine Learning models are usually integrated and coupled with products and applications. These can be web , desktop or mobile applications. Machine Learning models are also deployed on devices and nowadays gaining adoption and popularity in the field of edge computing.
The content in this article references the CRISP-DM processes. There are other known processes for data science and data mining projects like SEMMA , Knowledge Discovery in Databases (KDD) etc. With the wide adoption of Agile and Scaled Agile methodologies most of these Data Science Life Cycle processes are tailored to meet specific business needs with focus on iterative and incremental development and visibility.