Dynamics CRM Online Availability and Recovery

Having worked with a number of customers over the last few years who use CRM Online I have found that during the project lifecycle the discussion often comes up that we have chosen the SaaS version of CRM but what does that mean in terms of how high availability works and disaster recovery. While the easiest way to look at it is that you have outsourced the problem to Microsoft who will take this on for you allowing you to focus on the application part of the solution, you can be sure that at some point the questions in the business continuity space will come up.

When implementing CRM on premise the problem domain is well understood because people know they have servers, infrastructure, networking, storage, and SQL Server among other things to think about and in most organisations people are familiar with discussions around DR and HA in these spaces. When considering the cloud you may have out sourced the solution but you can still have scenarios where availability and disaster recovery can happen so its important to have an understanding of what Microsoft will do on your behalf and how it may affect you.

Whenever I have checked on the information for this I have tended to find that there is a number of fragmented sources and its not always easy to get a straight explanation of how things stack up. With this in mind the below is my interpretation of the information I have read. I am sure there may be the odd thing I have gotten wrong so I welcome feedback from people so I can update this article and hopefully it will provide a useful source for others.

Top Level Microsoft Statements

The official statements from Microsoft include:

  • CRM Online has a monetary backed 99.9% committed up time.  If the SLA is not met then we get credits back
  • Microsoft has operations staff 24*7 monitoring its services

Dependency Tree

When considering the overall availability and recovery for CRM it is important to understand the dependency tree for CRM and its fellow services.

  • Dynamics CRM
    • Depends on Office 365
      • Depends on Azure Active Directory
        • [Optional] Can depend on ADFS
          • Would depend on your on premise Active Directory
    • Can use SharePoint Online
    • Can use Azure custom components
    • Can use market place products
Dependency Considerations
Azure Active Directory
  • – Azure Active Directory is a global service on Azure
  • – 99.9% SLA for basic and premium (the last year had 99.99% uptime)
  • – Service is geo-distributed, multi-tenanted, multi-tiered cloud service
  • – Runs across 27 data centres
  • – Azure AD will maintain zero Recovery Time Objective (RTO) for token issuance and directory reads and in the order of minutes (~5 mins) RTO for directory writes. We maintain zero Recovery Point Objective (RPO) and will not lose data on failovers.
Azure Custom Extensions
  • – A Dynamics CRM solution may use other components hosted externally on Azure
  • – These run out of process and their availability and up time may affect your solution but will not affect the core of Dynamics CRMs availability
Office 365
  • – Office 365 underpins some usage in CRM
  • – If it went off line then some features may be affected, eg license allocation
SharePoint Online
  • – Dynamics CRM is not fully dependant on SharePoint online
  • – SharePoint online is only used for some features of CRM which you may or may not use
Marketplace Add ons
  • – 3rd party or Microsoft Apps you can add to Dynamics
  • – Each may have additional dependencies
  • – Expected to have some impact on Dynamics functionality but not affect the overall service

Understanding some technical bits

When you setup Dynamics CRM you choose a region and Microsoft will have 2 data centres in that local region.  As an example if we choose Europe we may have Dublin as the primary data centre and Amsterdam would work as a secondary sister datacentre.  In this scenario Microsoft would do the following with our data:

  • Two copies of your data are written to local storage in the primary data centre
  • an additional 2 copies of the data are written to storage in the secondary data centre
  • Additionally a daily back up is taken by default and held in offsite storage

Under the hood Microsoft are using SQL Always On with the SQL instances which back CRM which allows them to do these additional writes across the data centre pair.  This is important as it allows them to failover data centre from Dublin to Amsterdam seamlessly to the customer in a local data centre DR scenario.  If this occurred then it would be expected that the RPO and RTO would be in the region of a small number of minutes to seconds.  We would expect almost zero data loss.

Recovery Time Objective

In terms of RTO we are very much in Microsoft’s hands.  If we have a scenario where our CRM primary data centre goes down and Microsoft flip the switch to transition us to the sister data centre then our recovery time would be the time it took for Microsoft to make that decision and to execute the failover.

Currently there is no clear statement I can find on this as an SLA but we assume they would often make the call on a case by case basis depending upon the issue.

If we felt like the downtime was going to be an extended period, we would have the option to be able to take our daily back up and deploy it to CRM in another region.  This could be a potential option if the downtime in our primary DC was going to be too long before the flip to sister DC or if both paired DC were down.  If we took this option we would need to consider how we may roll back to the original data centres when they are back online.  This would be via a backup and restore again which would involve some downtime.  The url for our CRM instance would also change in this scenario.

Recovery Point Objective

We can assume there are 3 possible recovery points which we can use on the cloud service:

  • The most common one will be based on the cross DC writes which CRM uses out of the box.  We would expect the recovery point to be within milliseconds/seconds of the state of the system at the time it went down
  • The next option would be if we choose to roll back to the last system back up.  This would put our recovery point to anywhere between 0 and 24 hours depending on what time of day we roll back and when the backup was taken
  • The final option would involve choosing to roll back to a custom back up.  In this case the recovery point would be at the time the backup was taken

In the default scenario we expect that Microsoft would be recovering the CRM service and our recovery point would be almost exactly the time the system went down

What about Backups

In addition to the Microsoft approach to multi-writes which offers a good recovery point solution, there are also system backups which are taken daily and kept for around 3 days.

It would be possible for us to use one of these backups as a restore point.

We expect in most cases we would not use this approach and rely on our trust of Microsoft.  We would be more likely to use backups to restore to sandbox instances of CRM if we wanted to troubleshoot something.

There is also the option to create custom backups at specific times for the same purposes.

Deployment Considerations

Considering all of the above our approach has a big trust on Microsoft, however it is possible that CRM could be working fine as a service but we could accidently break our own solution.  Based on this we feel that a good practice would be to create a custom back up prior to any deployment activity so that we have a safe roll back if the deployment had problems.

Useful Documents

CRM High Availability

Office 365

Azure AD High Availability

Summary

As I mentioned earlier without coming across another source discussing these considerations for choosing CRM Online I am hoping my interpretation of the materials out there is useful to others, but also please feedback any bits I may have missed or misinterpreted.