One of the software that VMware have released back in 2008 and that was always my favorite after vSphere itself was always SRM (Site Recovery Manager), back then, I installed it in so many customer sites and even have a session about it at VMworld 2009 as one of the first production installation at a customer site..back then configuring the SRA (the “communicator / translator” between the vCenter to the storage array) was a pretty difficult task and im so glad things have changes so drastically over the years, im also very happy to see that as of SRM 5.8 (and of course 6.0), it has been fully merged into the vCenter web interface as seen below
another new feature is that there is No need to manage IP address changes on an individual level anymore (though those options do remain if needed). These can now be mapped from one subnet to another and applied at the Site>Network Mapping level. There is the option of using both, eg. Subnet mapping for the subnet, and individual mapping for VMs within that subnet
also, as part of VMware global initiative to not force you the customer you use MS-SQL or ORacle DB, you can now use the embedded vPostgres Database option that is built into the installer for SRM. It is an additional option beyond the currently available Databases and is supported, though not tested, up to the SRM maximums. There isn’t a way to convert or migrate an existing database to vPostgres.
•Site Recovery Manager is designed for virtual-to-virtual recovery for the VMware vSphere environment
•Built for two-site scenario, but can protect bi-directionally. Can also protect multiple production sites and recover them into a single, “shared recovery site”.
•Site Recovery Manager integrates with third-party storage-based replication (also known as array-based replication) to move data to the remote site, our focus in this post is the RecoverPoint / XtremIO SRA
Site Recovery Manager is designed for the scenario that we see our customers most commonly implementing for disaster recovery—two datacenters. Site Recovery Manager supports both bi-directional failover as well as failover in a single direction. In addition, there is also support for a “shared recovery site”, allowing customers to failover multiple protected sites into a single, shared recovery site.
The key elements that make up a Site Recovery Manager deployment:
-VMware vSphere: Site Recovery Manager is designed for virtual-to-virtual disaster recovery. It works with many versions of ESX and ESXi (consult product documentation for more details). Site Recovery Manager also requires that you have a vCenter Server management server at each site; these two vCenter Servers are independent, each managing its own site, but Site Recovery Manager makes them aware of the virtual machines that they will need to recover if a disaster occurs.
-Site Recovery Manager service: the Site Recovery Manager service is the disaster recovery brain of the deployment and takes care of managing, updating, and executing disaster recovery plans. Site Recovery Manager ties in very tightly with vCenter Server — in fact, Site Recovery Manager is managed via a vCenter Server plug-in.
-Storage: Site Recovery Manager requires iSCSI, FibreChannel, or NFS storage that supports replication at the block level. in our case, we support FC /iSCSI
-Storage-based (also called array-based) replication: Site Recovery Manager relies on storage vendors’ array-based replication to get the important data from the protected site to the recovery site. Site Recovery Manager communicates with the replication via storage replication adapters that the storage vendor creates and certifies for Site Recovery Manager. VMware is working with a broad range of storage partners to ensure that support for Site Recovery Manager will be available regardless of what storage a customer chooses, so expect the list to continue to grow.
-vSphere Replication has no such restrictions on use of storage-type or adapters.
Users can manage both protected and recovery SRM instances from a single UI interface, obviating the need to open multiple clients or run particular management tasks from a specific location.
This is completely independent of vCenter Linked Mode. Linked mode is still helpful, because it will automatically migrate SRM licenses from site to site as VMs are migrated or fail, and also for standard non-SRM related infrastructure management.
SRM 5.8 /6.0 is fully supported with the vSphere Web Client and no longer available for use with the vSphere Client.
in the case of RecoverPoint / XtremIO there is a special UI to cover two very specific features SRM itself can only failover to the last point in time which isnt that helpful, especially in our case, you see, the value of RecoverPoint/ XtremIO is the ability to go to ANY point in time, so we can leverage the vCenter plugin to select the point in time you want to failover to and then SRM “think” it’s the last point in time (see a screenshot below)
the other special feature is the ability to give you, the vSphere admin or the storage admin tge insight to see which VMs are protected, which VMs aren’t and which VMs are partially protected, this is done with the unique integration of the RecoverPoint GUI to vCenter (as seen below)
If using storage-based replication, integration with the arrays with vendor-specific replication and protection engines are a very fundamental. This integration is provided via code written by the array vendors themselves. the SRA for RecoverPoint that support XtremIO is 2.0.2
SRAs have advanced for SRM 5, improving the integration with array-replication software for functionality like reprotect/replication reversal and failback.
SRA information is enhanced within SRM 5 and shows not only information about paired remote devices, datastores, and relevant protection groups, but will also show an arrow indicating the direction of replication for each device.
This gives very quick visibility into what is being protected and to where. This is particularly important during reprotect and failback operations.
Installing & Configuring the XtremIO / RecoverPoint SRA
The installation itself is pretty trivial, just download the SRA form the VMware SRM web site and install the executable at both the SRM servers or in my lab case, the vCenter servers which are also acting as the SRM servers, once the SRA have been installed, you will have to restart the SRM service.
once everything is installed, you will have to configure the SRA using the SRM web interface, configuring it gets as simple as it can gets, basically, you need to point the SRA to the RecoverPoint virtual management IP and feed it with the username / password to manage the RPA’s cluster, you will need to then repeat it at the recovery site as well.
lastly, in order for the SRM SRA to control RecoverPoint, you need to change the management of the consistency group (CG) to SRM, again, this will allow RecoverPoint to be managed by an “external application” which is in our case, VMware SRM
lets take a look at the example above, at the protected site I have couple of datastores, each one can contain 1 VM or more, each datastore (lun at the storage level) can be a part of a protection group however, if you take a look at the purple example, a vm CAN spab across multiple datastores and hence, the protected group can span across multiple datastores (luns)
then, on the right side (my recovery side),I define the recovery group which is really a logical container for the protected groups I put inside of it.
By ensuring virtual machines are stored in a logical fashion on disk according to their protection group, administrators can minimize “shuffling” of VMs to fit optimal layouts for SRM.
VMDKs of a similar priority, or that will belong to the same protection group should be stored in the same datastores to minimize the amount of replication required to create efficient protection groups and thereby recovery plans.
Ensuring that your storage layout and VM placement has been organized with this in mind will mitigate many issues.
Workflows and Use Cases
Allows for a data synchronization as part of the process, Will stop on errors and allow you to resolve them before continuing Since it shut’s down the virtual machines being migrated, application consistent VM’s are recovered on the recovery side!
Allows for a data synchronization as part of the process, Will not stop on errors If the protected site is available, than the virtual machines being migrated will be application consistent at the recovery side. If the protected site is not available the consistency state will be what was designed in the solution.
Allows for a data synchronization as part of the process, Supports a recovery that uses a different network, uses a clone or snapshot for the test.
Can be run following a successful recovery. Reverses the direction of replication, and protects virtual machines back to the original site. This enables a failback to recover the environment back to the primary site.
This is done following a test recovery. Removes the snapshot or clone created during the test. Powers off and deletes test VMs. Recreates the shadow VM indicating protection of the relevant VM from the primary site. The cleanup creates its own history report. Following a cleanup, the relevant plan is once again ready to be run.
Recover from unexpected site failure, Full or partial site failure. The most critical but least frequent use-case. Unexpected site failures do not happen often. When they do, fast recovery is critical to the business
Anticipate potential datacenter outages, For example: in case of planned hurricane, floods, forced evacuation, etc.Initiate preventive failover for smooth migration. Graceful shutdown of VMs at protected site. leverage SRM ‘planned migration’ capability to ensure no data-loss
Most frequent SRM use case, Planned datacenter maintenance, Global load balancing. Ensure smooth site migrations. Test to minimize risk. Execute partial failovers Use SRM planned migration to minimize data-loss. Automated Failback enables bi-directional migrations
Running a Test Recovery Plan
SRM offers two UI buttons to run test recoveries, or a test may alternately be initiated through a call to the API. Note the “Synchronize storage” option. This ensures very current copies of the VMs for the test.
This is a test recovery ready for users to test. Cleanup would occur after testing is complete by simply pressing the “cleanup” button. The virtual machines run from the cloned / snapshot environment at the recovery site, and replication and protection of the protected environment is not impacted during tests.
Following a cleanup, there is are no running virtual machines associated with the recovery plan that was tested, and associated snapshots / clones created by the test plan have been eliminated.
Shadow VMs have been recreated on the recovery site to indicate those VMs that are protected on the primary site and will be instantiated on the recovery site when a recovery plan is run.
Running a Recovery Plan
Two different UI buttons can start recoveries, or alternately it may be executed by an API call.
A recovery plan can be run as either a Planned Migration, or a DR event. Note that both types of execution will attempt to synchronize storage early in the recovery. The data synchronization attempt is to ensure application consistency, and will execute as an early initial step in a recovery plan after an attempt to shut down the protected VMs, to ensure data is recent and synchronized after the VMs are quiescent.
The difference between a Planned Migration and a Disaster Recovery is that a Planned Migration will automatically stop on errors and allow the administrator to fix the problem. A Planned Migration is designed to ensure maximum consistency of data and availability of the environment. A DR scenario is instead designed to return the environment to operation as rapidly as possible, regardless of errors.
If a Recovery Plan is run as a disaster recovery, the goal is an aggressive Recovery Time Objective, and SRM will not halt the plan from continuing regardless of any errors that might be encountered.
Running a Recovery Plan – Storage Layer
Notice that during a recovery plan execution, replication is interrupted. The mirror image, or replication destination datastore, is now promoted and made read/write. The virtual machines in it are registered in vCenter in place of the shadow VM placeholders.
Failback is a process of “Reverse Recovery”
Failback combines recovery plans and reprotect.
“Failback” is the capability of running a recovery plan *after* an environment has been migrated or failed-over to a recovery site, to return the environment back to its starting site.
After a failover has occurred, the environment can be reprotected back to the original environment once it is again safe. Following this reprotect the recovery plan can be run once more, moving the environment back to its initial primary site.
Next it is imperative to reprotect once more, to ensure the environment is once again protected and ready to failover.
With SRM 5 VMware introduced the “Reprotect” and failback workflows that allowed storage replication to be automatically reversed, protection of VMs to be automatically configured from the “failed over” site back to the “primary site” and thereby allowing a failover to be run that moved the environment back to the original site.
After running a *planned failover only*, the SRM user can now reprotect back to the primary environment:
Planned failover shuts down production VMs at the protected site cleanly, and disables their use via GUI. This ensures the VM is a static object and not powered on or running, which is why we have the requirement for planned migration to fully automate the process.
Once the reprotect is complete a failback is simply the process of running the recovery plan that was used to failover initially.
ok, if you have read to this point, you probably want to see it all in action, please see a demo I made showing the integration of VMware SRM and XtremIO/RecoverPoint