1. Skip to Menu
  2. Skip to Content
  3. Skip to Footer>

Expert Sessions!

How to avoid disaster during disaster recovery test (for HANA)?

Wednesday, 27 July 2016 09:03

Written by Prakash Palani (prakash.palani@basisondemand.com)

Print E-mail

 

 

Introduction

Like any other databases, disaster recovery is a very important aspect of an overall architecture of any HANA deployment. As there are so many articles on the internet which are focusing on the steps taken to setup disaster recovery for HANA, my focus for this article is primarily on how to perform a successful disaster recovery test. In our way of working, preparation is where we spend significant amount of effort to be able to have smoother execution, hence the article will be focusing on how to create an execution plan which will detail out the preparatory, execution and the post processing steps in detail to be able to perform a successful disaster recovery test.

Picture # 1 : Disaster Recovery Setup based on HANA System Replication

 

logreplay

 

Preparation

Below section covers a comprehensive list of topics that needs to be focused as part of the preparation.

Identify Key Stakeholders

It is not a task of someone from SAP Basis or Infrastructure or Database team to own the end-to-end responsibility for disaster recovery.  It is a combined effort from various teams, and that's where it makes sense to have a DR coordinator or manager who can integrate multiple service lines to get the job done in an efficient manner.

 

 

Since this article is written from the perspective of an SAP BASIS / HANA Administrator, it covers the actions carried out by the Basis Team in detail.

Picture # 2 : Stakeholders

stakeholders

 

Define the scope of DR

In most of the cases, it is not possible to perform Disaster Recovery test for the entire IT / SAP estate. As a first step, understand the criticality, impact, agreed SLA of the systems in the landscape to arrive at the scope of SAP systems to be included in DR Test. This is something to be signed off by the management before you proceed on the next steps as it plays a crucial role in identifying the key stakeholders.

Communication Plan

Clear and crisp communication around disaster recovery test plan is a key to making the whole exercise successful from the organization front. Define a detailed plan to on the content, frequency and the relevant audience to be kept informed, this will make the whole communication  flow through the necessary stakeholders with appropriate information.

Identify technologies used for each of the systems in scope

It is not necessary that every single application in scope will follow the same Disaster Recovery strategy / technology, for example, a system with criticality rating 4 may have lower RTO-RPO, hence it may require backup/restore method for the disaster recovery.  In order to avoid last minute surprises, make sure to identify the DR technologies used for each of the systems in the landscape.

Tips for HANA system replication

In case of replication based DR :  then identify the speed / duration – This will be required for you to make a decision on how you want to perform the fail-back to original, in case of system replication you may have two choices :

  1. Setup replication back from site-b to site-a as soon secondary becomes primary. Please note that full sync needs to happen, hence depending on the database size and the replication speed, it may take significant time to replicate back to the primary site. You may use "HANA_Replication_SystemReplication_Bandwidth" sql script from OSS Note 1696700 to understand the speed in which the replication is happening.
  2. If the time estimated for system replication back to site-a / primary doesn't fit in your DR Test Schedule, then break the replication and bring the database up in the primary (with the status before the DR, in this case, any change made @ the DR site will be lost)

 

Identify Single Sign-On and Trusted Relationships

If there are any interfaces with the trusted relationships or SSO configurations, it may have to be handled with additional care, one of the most important aspects of DR is that you will never know what to expect on the DR side until you really perform a DR takeover. In most of the cases, the team which performed DR setup will be different than the team performing the actual DR / Test.  Hence, be prepared for what to expect on the DR side from the SSO or trusted relationship front as well.

Understand current monitoring setup

Identify the list of tools which are monitoring your SAP environment and triggering alerts, this is something similar to any of your outage windows, identify the alerting tools to be able to switch them  off during the actual test.

Understand the network setup for end-users and interfaces

This is the most critical step as this is the whole reason why the Disaster Recovery test is performed, it is very imperative to understand how end-users and interfaces are communicating with your SAP systems? This will give you the indication on what kind of changes needs to be done at the network layer. In most of the cases, DNS switch for the virtual hostname will be performed, in some cases, it is also possible that the IP addresses are manually changed at the time of disaster recovery test. Get the inventory and location of interfaces and verify them well ahead of schedule.

It is also recommended to check the ports that are required to be opened to be able to make successful communication between the systems.

Identify the current status of stand-by systems

In this step, we will be checking the current status of the disaster recovery, for example, if HANA system replication is used, then one should check the replication status and make sure that replication is happening as per the design.  In the case of Backup/Restore, then try the restore option on a different host to be sure about the recovery. This is something to be performed well ahead of Disaster Recovery test schedule.

Identify HANA Profile Differences and fix them

Use "HANA_Replication_SystemReplication_ParameterDeviations" sql script from OSS Note 1969700 to quickly identify the parameter deviations.

Identify SAP Profile Differences and  fix them

You have to manually compare the profiles to identify the differences between SAP instances @ primary site and secondary site.

Keep the logon credentials at all the layers handy

In order to avoid last minute rush, make sure that you have necessary logon credentials for Operating System, SAP and HANA at both the sites.

Check NFS Mountpoints, if any

This is something referring to interfaces, if there are any specific NFS mount-points, then that should also be replicated to the DR site.

Identify the technical steps to be performed

  1. Make a note of last shipped log time
  2. Make a note of replication speed
  3. Make a note of last backup time stamp
  4. Make a note of when the backup was restored last time
  5. Make a list of other technical steps to be performed for the takeover and failback
  6. Get the lessons learned from last DR Test exercise

 

Identify the technical tests to be performed

  1. Before takeover –try to perform some updates in the database (i.e. change first name in SU3 tcode)
  2. Test RFC Destinations
  3. Test inter-server communications
  4. Observe CPU and Memory Utilization
  5. Observe Response Time Trend
  6. Try a printer test

 

Define fallback plan for each of the systems

Like any other critical basis action, be prepared with the fallback plan to handle the crisis situation if anything goes wrong. For example, if the failback doesn´t complete (i.e. due to slow network speed' within the stipulated timeframe, what kind of steps to be taken to bring the situation back under control.


Execution (Based on HANA System Replication - On the day of DR Test)

Below list contains all the basic steps that are required to be able to fail-over to DR site and fail-back to Primary site. (in this scenario, it is assumed that the DR instance is hosted on Quality Environment, hence the reference to QA environment)

Step

Command Reference

Stop SAP Instances

stopsap

Stop HANA Database

HDB stop

Stop QA Instances on DR Site

stopsap

Stop QA Instances on DR Site

HDB stop

Failover to DR Site

hdbnsutil -sr_takeover (To be executed @ Secondary Site)

Start SAP Instances at DR Site

 

Enable Replication from DR Site to Primary site

hdbnsutil -sr_register --name= --mode=async --remoteHost= --remoteInstance=00

Monitor the replication process

 

Perform Basis Tasks

 

DNS Switch to forward the traffic to DR Site

 

Perform Functional Tests

 

Sign-off

 

Failover back to Primary Site

hdbnsutil -sr_takeover (To be executed @ Primary Site)

Start SAP instances in primary site

startsap

DNS Switch to forward the traffic to Primary Site

 

Enable Replication from Primary Site to DR site

hdbnsutil -sr_register --name= --mode=async --remoteHost= --remoteInstance=00

Perform Basis Tasks

 

Start QA Database in Secondary Site

HDB start

Start QA SAP in secondary site

startsap

Sign-off

 

Quick Tip for DR coordinator

Getting a job done from all the service lines will really be a tough task, if you are not from SAP background, then get someone from SAP / HANA team to lead the exercise along with you (I am not saying it just because I am from Basis/HANA background, it is just that Basis team is the only team which has interaction with all the key stakeholders participating in DR Test, they connect with OS, Storage, Network, Application and Management layers very often). In addition to that, put emphasize on documenting and centrally storing the entire DR test exercise, it may include.

  • Project Plan
  • Technical and Functional Steps Followed
  • Record Timing for each of the steps
  • Issues Faced and Related Solutions
  • Performance Benchmarking
  • Lessons Learned
  • Risks and Mitigation Plans
  • Contact Details
  • KPIs that helps to declare whether the DR Test is a successful one or failure
  • Clear Decision Tree Map (this will help Management team while making decisions)
  • Evidences for Sign-Off

Conclusion

As it is mentioned in the introduction, this paper focused only on the SAP BASIS/HANA side of the disaster recovery test, it is strongly suggested to discuss the network/storage/OS aspect with the necessary stakeholders to make your disaster recovery test plan much stronger. All the best for your test.

How to avoid disaster during  disaster recovery test (for HANA)?