Migrating to Cloud: Inhouse Hadoop to Databricks

Reading Time: 6 minutes

Migration of applications is a good thing. It forces the organization to clean up junk, that is never used. It adds a lot of innovation and new ideas to your engineering teams. It is important to build confidence in our teams that future migrations are not stressful and pushes teams to design systems to be flexible. It sends a message to vendors that you are not bluffing about pulling the plug if you don’t see the results you expect.

Some of the benefits of migrating (Our customers achieved) in case of the on-premise solution to databricks include

  • Tangible Benefits
    • Commercial License and Maintenance cost
    • Reduced cluster costs, as you can leverage databricks auto-scale up/down and spot instance pricing
    • Reduced labor cost of creating new infrastructure
  • Intangible
    • Avail cloud-based services (Azure data factory, Azure DevOps for example) and all the cloud-native services, like lambda, EKS, S3/AZFS, etc
    • Reduced maintenance costs
    • Easier version upgrades
    • Improved performance due to databricks file system performance innovations
    • Easier development with notebooks
    • The list goes on

But, it is also important that the migration delivers something tangible for business. Keeping your business partners aware of the migration goals, expected results will enormously increase confidence in your capability and fosters team spirit.

Following is the Knoldus Migration Framework that has been tried and tested, and covers the most important points of a typical migration:

Phase 1: Planning and Communication phase

In this phase you will achieve the following:

  • Just like the white house coronavirus task force, form a team of experienced project managers, architects, business users. Ensure there is sufficient technical expertise (Since this is primarily a technical project)
  • Establish a communication plan with the impacted teams. More often, migrations impact multiple organizational teams, which could be a group of application owning teams and/or internal teams (Security, infrastructure, database, etc).
  • Collect inventory of applications with thorough details including application complexities, critical blackout periods that impact schedules, critical people needs, etc.
  • Publish a roadmap, with tentative dates that are subject to change based on the application complexities.
  • Establish the KPIs.
    • Business KPIs ( eg. Accuracy of predictions.)
    • Performance KPIs (Total run time)
    • Financial KPIs (Total monthly cost reduction)
    • Operational KPIs (Number of people required for maintenance)
  • Define the organization structure

Establishing a team involves several different factors. For a large organization, we established the following structure, however, you should consider your own organizational factors before designing the migration team.

Central Migration Team

Sample Questions to Ask for Cloudera-Databricks

  • What is the key goal of this migration?
    • Sun setting Cloudera to save license cost?
    • Improve pipeline performance (Total end to end time-lapsed)?
    • Cloudera cluster needs more capacity, hence want a flexible resource model?
    • Intend to leverage other cloud services (For example Azure data factory)
    • Better automation?
    • Ease of use for data scientists? (Ie new features using notebooks)
    • Reduce infrastructure maintenance costs?
  • What is the size, nature of the data that needs to be migrated?
  • What is a high-level of data ingress and egress needs?
  • Is GitHub, Jenkins, Jira, and Confluence setups locations identified?
  • Who has to approve the merge requests?

Phase 2: Architecture Detailing Phase

This is by far the most critical phase, and the success heavily depends on what happens during this phase.

  • Engage an experienced ‘Target System Specialist’ to take a look at the current applications, from an architecture standpoint.
  • Identify mismatches in architecture.
  • Prescribe target architecture by collaborating with the target system vendor.
  • Define projects to re-engineer the current system, if that is required prior to migration
  • Adjust and publish schedules back to the teams based on this detailed assessment. At this point schedules tend to be much more clearer and detailed

One of the most important decisions in-migration of any application is whether to make it ‘Cloud Native’ or ‘Lift and Shift’ or something in between. This decision should be taken after understanding the current application in detail.


One of our customers has recently migrated from Cloudera to databricks. The customer is a large successful American Grocer, who needed to predict future sales based on historic sales data and promotions. These predictions happened at an item category level. The current pipeline accomplished this, by running the entire data related to one category in a large R application, which is single-threaded with extensive use of Memory.

The architectural choices were to rewrite the code to use Spark parallelized algorithms, which means, the entire pipeline needs to be rearchitected from the ground up. Or, use lapply, a pseudo parallelization construct in spark, that lets us run the code in its entirety, in native R run-time without having to rewrite. Upon discussion internally, due to time constraints, we decided to migrate without rewriting the code, though it would be a better choice in the long run.

The bottom line is, such decisions should be done well before, if you have the luxury of expertise and time, failing which, you would put the team in extreme pressure, which may result in production failures and failed projects.

Lift and Shift

Far too often the companies, with the stress of migration resort to a lift and shift approach. Knoldus highly recommends a cloud-native approach, wherein the application leverage the full potential of cloud-based architectures to gain long term customer delight and reduction in support costs.

Lift and Shift Migration

However, should you decide to go with lift and shift, consider the following.

  • Is the application of incoming data-intensive or outgoing data? this has implications on data transfer costs.
  • Do you intend to plug in local or cloud-based monitoring systems?
  • How much of intermediary storage is required?
  • How do you manage the configurations of the application to tune the behavior of the application?
  • What kind of integrations are necessary?

Sample questions to ask

  • What are the key components used?
    • ML
    • External libraries and Enrichment of data
    • ETL
    • Security / Data Redaction
    • Programming languages used
  • Are there any non-standard architectures or procedures used?
    • Single-threaded apps
    • High RAM requirements
    • Joins that are too large
    • Broadcasts that are too large
  • Observe current spark job output for high shuffle memory usage, task failures
  • Are applications enabled with CICD
  • Are applications use logging extensively
  • What parts of code will be in notebooks vs what part in Jars
  • Are there any monitoring tools or logging tools currently that. also needs migration.
  • Job Dependencies
  • Criticality of output
  • Common Errors

Phase 3: Pre Execution (Build Jira board)

Architecture detailing will give sufficient details to build the Jira board.

  • At knoldus, we use the SAFe Agile process for managing multiple projects at the same time.
  • Conduct a program increment planning, that plans and identifies relationships and dependencies between multiple teams.
  • Breakdown overall goals into sprint goals
  • Identify EPICs, features, stories, and spikes
    • Document Spikes and their potential scenarios. For example, if we want to convert a critical piece of logic from R to scala, what. will be the plan if it succeeds or fails?
  • Create your Jira board
  • Provide sufficient time for teams to understand their next 3-week sprint goals and discuss issues raised. Use the inputs to adjust the stories.
  • Some level of estimations is important to recognize large tasks. Too large tasks need to be split so that they are manageable within the sprint.
  • Document key architectures, and pipelines on confluence. Do an architecture review with key stakeholders.
  • Document environment strategy? Are clusters dedicated to testing, stage, and production?

Sample questions to ask:

  • What is the current collaboration design ? for example, can multiple users execute the same job?
  • Is this collaboration transferable to databricks notebooks?
  • What is the definition of done? Is CI/CD pipelines included
  • How do we test the output accuracy? Do we need to write code to automatically test results on a new platform?
  • What is the testing process? Are test scripts prepared and ready?

Phase 4: Execution

This is the easy part. Its time to just execute based on the jira board.

  • Is the foundation laid well?
    • Are clusters deployed?
    • Security setup in place? which notebook folders are open for which users? How do users share code and data?
  • Are users trained on the new technology?
  • Ensure Jira board updates are reflecting on each team’s Jira boards.
  • Scrum master to check with other scrum teams if the dependencies expected to be complete are on track or if that will impact the sprint deliverables.
  • is Unit testing is being rigorously followed?
  • Are we following true agile where in some functionality is being demonstrated in demos?
  • Are there any overlap issues in using the infrastructure
    • For example, if a job is run by two different users, what is the damage.
  • Are we using slack to effectively notify all teams of the potential shut down?

Phase 5: Closure

  • Measure and understand if KPIs are met.
  • If not met, introspect, and identify what needs to be done.
  • Are basic essential KPIs met, so that we can go live and address the technical debt?
  • Identify all technical debt, document
  • Define a plan to address technical debt.
  • Is a new system up and running for sufficient time to hand over for production support.
  • Celebrate.

Once you are in the cloud, you will have access to several tools, frameworks, and new architecture patterns at your disposal and immensely increases your ability to respond to business needs.

Cloud managed services

We encourage to work with experienced application architects and teams who have exposure to cloud-native and reactive architectures to continue the journey of digital transformation. We hope Knoldus can be a partner in your journey. Get in touch with us to schedule a call with our expert or drop us a line at hello@knoldus.com.

Written by 

Head of AI & Advisory Consultant