On the back of some recent discussions following an outage at one of the cloud vendors the usual rounds of discussions came up again about how companies need to think about how to use multiple cloud vendors etc etc. While I generally agree with this I think the real context of this is a much more complicated discussion and unfortunately for many organisations time is not sufficiently invested to come up with an appropriate strategy around this.
The idea here is that I can create a system that is resilient to an outage on my cloud provider if it is able to be hosted on multiple cloud providers at the same time.
In this article I wanted to put down some of my thoughts in this area and some of the things I think need to be considered.
What do you do on the cloud
Most organisations these days are hybrid based organisations with some applications in the cloud and some applications on premise. The first consideration to make is how well do you understand your application estate. If you have a good handle on your existing application estate and its capabilities then you are in a place to start considering what they offer and what constraints you may have in line with this strategy.
If you do not understand your application estate well then you should probably take a step back and get a clear catalog of your applications and their capabilities otherwise you may make some bad decisions which you will not find out until much later when it could be more expensive.
How do you measure value of the strategy?
The measurement of how you define which applications you want to include in the scope of a strategy like this is very important. The costs of making an application highly available across regions and cloud vendors is likely to be quite expensive and you need to be clear on what value is gained by doing this to ensure that your strategy will add value. In an initiative like this there will also be many hidden costs that are not immediately obvious so it is not just as simple as adding the cost of compute per hour on 2 cloud platforms.
Some of the costs involved include:
- Running costs of both cloud platforms
- Additional deployment costs to deploy to both clouds
- Time and Effort
- Increased management and operations overhead
- Extra testing of the solutions for performance and functional testing on both clouds
- Extra time taken to troubleshoot issues
- Managing compatibility between your code and multiple clouds
- Managing your teams knowledge of both clouds and how to use them
- Additional cost for non production environments
- Additional vendor and account management costs
In addition to these costs you also need to understand the benefits for your application. Probably one of the most common ways is to workout how much money per hour your business loses per hour if your systems are down. In the case of your HR system you might only be losing the productivity cost of a small team of HR people but for your customer orders system a period of downtime may result in a whole call centre of people doing nothing and also a large set of your customers who are unable to make orders which could be significantly more expensive.
The difficult part of this is how to create a comparison between the two. To achieve this its best if you can create a per hour cost of the system being down versus a per year cost of the additional overhead of the highly available system. You could then workout how many hours of downtime per year would balance out the cost of the additional overhead of the multi-vendor solution. An example may look like:
$ lost per hour during downtime
$ cost per hour of multi-vendor solution
Hours downtime to break even
With this measurement approach its not easy to work out your costs but you need to have some numbers to justify a decision you might make. In the above example you can see that making the HR system highly available across vendor might be a lot cheaper than making the customer orders system but in the HR case we would need 1000 hours of downtime to justify the extra expense so it is probably a bad investment to use a multi-cloud strategy here.
One of the first things that comes up in discussions I have had in this space is the Reactive Manifesto which talks about system design to increase tolerance to faults. I think the aims of this project are good but I think its important to consider that building a reactive system and building a reactive enterprise are two very different things. The Reactive Manifesto is evolving and is something to keep an eye on for useful information which may help you in the design and implementation of your strategy, but it is important to remember that it is not a silver bullet to solve the problem.
If your interested in finding out more about reactive manifesto implemented with AWS and Azure check out Elton Stoneman’s Pluralsight course about this.
Highly Available Systems vs Highly Available Enterprise
Creating a single application solution which is highly available in the cloud and is also tolerant to a fault where a cloud vendors entire platform in all regions goes off line is a tough enough problem but is solvable. To achieve this same level of service for all systems within your entire organisation is a very different thing.
In your strategy I think it is important to define the scope of what systems need to be within this strategy. As an example your customer serving systems maybe very important and need to fit into your strategy but your accounting systems and back office systems may not benefit the business from being within the strategy. Using the benefits measurement approach that you defined will help you weight up the costs of being in the strategy versus the benefits you would gain.
For many organisations selecting a few key systems to include within this strategy and omitting many others will provide the best balance of cost vs benefit.
Understanding your Points of Failure
When considering your solutions for multi-vendor high availability it is important to be clear on the points of failure within your system. Does your organisation have a clear outline of what the system points of failure are and also an appropriate risk assessment of them.
Imagine a scenario where you invest heavily in a multi-vendor approach but miss one of those points of failure because the system was poorly understood and you suddenly have a component in the system which can not fit into the strategy. This miss could be very expensive.
I think its important to be very clear on the solutions you are considering to include in your strategy and to understand their make up.
Single Cloud H/A vs Multi-Cloud H/A
When considering your multi-vendor approach it is important to consider what you get now from your current cloud provider. Most cloud providers will offer hosting in multiple regions. If you are wanting to improve your system availability and tolerance to faults then you need to consider if you are using your current providers regions effectively. As an example if you are currently only utilising one region then you might find that you could make a big improvement to your service by using multiple regions on the same provider for a much smaller cost than supporting multiple vendors.
Be Prepared to Rebuild
In many business critical solutions you may be considering moving an existing solution into your multi-vendor cloud strategy. With this kind of strategy being a relatively new thing you need to consider that your system probably wasnt designed with these kind of requirements in mind. You should be prepared to accept that some parts of the solution may need to be changed to support running on 2 or more cloud platforms.
Data Locations & Regulations
For some companies there will be regulations which affect how you can use the cloud. This could have a big impact on your ability to implement a multi-vendor cloud strategy. An example of this could be a country where you can only host data for your customers in your own country. This could give you a problem if the only other hosting partners available are in other regions.
Data Centre Locations
This might sound rather obvious but if I decide to support a multi-vendor approach I need to consider the proximity of the data centres provided by each company. Imagine a scenario where I chose the AWS Data Centre in Dublin and the Azure Data Centre in Dublin. I may have protected myself from a Global outage at one provider but what would happen if there was a natural disaster. You need to ensure that the vendor sites are a safe distance apart and in appropriate locations.
Active/Passive vs Active/Active
When you look to implement your multi-vendor cloud based solution, how are you going to use these data centres. Are you going to have an active solution using 2 or more data centres at the same time of will you fail over from one provider to another in the event of a failure. Both of these have significant impacts on how the solution would be designed and the operational side of your solution.
If you implement an active/active solution then you will have the challenge of keeping the data in sync across the vendors where as in an active/passive scenario the act of pulling the lever to move from one provider to another would be a major operational process and how would you manage this or test it. Also how would you decide if you should fail back over when things are good.
You may also need to consider what events would constitute a cloud provider failure. This could sometimes be simple such as everything just breaking, but what if 90% of the cloud provider was working OK but just one component was broken. This could render your solution off line in one provider. In fact taking that a step further imagine a solution hosting a website a queue system and a database which was hosted on both AWS and Azure. Imagine the database went off line in AWS and the Queue system went off line in Azure. For both providers this would not be an entire failure of their platform it would just be one component of it. For your solution however this would be a complete failure in both providers. Its almost like at every level in your application stack you need to be able to reroute traffic to each cloud provider. You can not simply have a copy of the entire application stack in each cloud provider and trust it will just work.
Do your on premise systems support this aim
Many solutions these days incorporate hybrid integration making cloud applications and on premise applications work well together. At the same time for many of the situations where I have seen organisations bring up the discussion on multi-vendor cloud solutions in this context it was known that the organisation had a dubious story for the system availability in their on premise systems. In this case you need to consider how effective a multi-vendor strategy might be if it would be underpinned by a fragile on premise infrastructure.
One of the key use cases for cloud adoption is the improved time to market and agility you can obtain when implementing cloud based solutions or also SaaS based solutions. If you need to implement a multi-vendor strategy then you need to accept that there will be a trade off that you may be restricted in terms of which cloud features you can use. If we imagine a case where you have chosen to use Microsoft Azure and chosen to implement an API hosted in the cloud. On Azure you may choose to use Azure API Management to sit in front of your API to provide security features and other stuff which will give you an additional level of agility around the development of API solutions. If your other vendor was AWS you would have a problem in this space as there isnt an AWS API Management feature. You could of course add a 3rd party product but then your solution is varying on each cloud provider. You need to trade off that you use a reduced set of features and therefore reduce the agility the cloud can offer you or you accept some risk that the solution is different on each cloud.
Custom Development vs Out of the Box
Today most out of the box applications you will buy or rent are not build to run side by side on multiple clouds concurrently. Maybe it would be an interesting challenge to find an app that can, but it is certainly the exception rather than the rule. If your solution development approach is to build your own solutions using bought or open source components then you have a lot of control over the system architecture and you will probably find it a lot easier to build a solution which supports a multi-vendor cloud approach. If your buying or renting an application like Sales Force or Dynamics CRM Online (or server) then you will find these are not intended to be ran side by side on multiple cloud platforms and can not be changed/customized by you to easily support this. You might need to wait to see if this is something that the vendor would ever support.
Which Technologies should you Use?
If you decide to try to commit to a strategy for a multi-cloud platform strategy then the biggest constraint is likely to be the technologies that you use and how they may affect support on other cloud platforms. If you are creating solutions that use Infrastructure as a Service and Windows or Linux virtual machines then you are probably in the best place to be able to support a multi-vendor strategy. Most cloud providers will support servers (either physical or virtual) and you could use networking technologies to connect the virtual networks such as VPN. Also one of the newer approaches is the use of a container technology such as Docker which can be hosted on different cloud platforms. These provide a place to run your custom code or potentially applications which can support hosting on multiple clouds.
As you move up the technology stack and move more to the PaaS and SaaS space you will tend to find that cloud vendors will tend to offer different products which are not necessarily built for interoperability or to be equivalents of each other. As an example if you were hosting an ASP.net website on Windows Servers hosted in both Azure and AWS then it would be lower risk than it would be hosting the ASP.net website in Azure Websites and AWS Elastic Beanstalk. At the IaaS level the hosting solutions are almost identical where as at the PaaS level there are more unknowns and more differences.
In other places there are types of component which are not offered on some cloud platforms. As an example I mentioned earlier that Azure offers API Management and AWS does not natively. Also something like Office 365 can not be hosted on another cloud provider.
One of the biggest areas affected by your multi-vendor cloud strategy is going to be your operations or dev ops area. I have seen many times where strategies like these do not consider the overall impact on the company and the organisational impact of this strategy is quite big its not just a technical problem. Your operations team need to be skilled in all cloud platforms and also in how your solution will use them both. This means twice as much training and probably more people needed in the operations areas. Getting buy in from the operations area and leading or guiding them through this journey is important, you should just not dictate an approach and then walk away.
In summary I believe that it is realistic to implement a multi-vendor cloud strategy but it is a big job with lots to consider. Its an architecture approach that you need to specifically design to accommodate.
The problem is that many organisations want to have a one click approach to this and not put the effort into understanding how their strategy needs to be laid out. Organisations need to also accept that this kind of strategy could be expensive to implement and you need to have an approach to work out the candidate parts of your architecture which would benefit from the strategy for an acceptable cost and also which areas should not be included.
This kind of strategy should not be an after thought but with some thinking up front you can focus the solutions you develop on the cloud to focus on the use of certain technologies which will help you to implement this kind of strategy now or in the future without having to replace significant parts of your architecture.
I would love to hear feedback from others on their experiences in this space and any real world case studies which have implemented such solutions successfully.