Recently we had an issue with a WCF service hosted on Azure App Service. This lead to an interesting troubleshooting scenario which I thought would be interesting to share. The experience showed some of the opportunities of cloud and some of the challenges at the same time so I thought it would be interesting to share this.
The scenario we had was similar to one of the patterns for Service Bus Relay you can use the relay as a secure channel between data centres. While our scenario was slightly more complex the below diagram shows a common implementation of this pattern.
The pattern is that we have a WCF service listening on the Service Bus Relay using the HttpRelayBinding, a partner sends data to the relay endpoint which ends up at our WCF service and its written to a SQL Azure Database.
The Service Bus Relay acts as a secure channel between the partner and our WCF service. Now a days you may consider using Azure API Management, but this architecture had been around for a while and using Service Bus Relay allows us to have a channel the partner can use which has a secure key, etc but in our WCF Service we only include the relay binding to that the service is only accessible via the relay so its protected from the outside world.
This solution has been running in our production and test environments nicely for few years now.
On the evening of 28th Feb suddenly in production our service broke. The service would no longer activate on the relay and was returning the error code below:
|No protocol binding matches the given address ‘http:// <Replaced>.azurewebsites.net/Service1.svc’. Protocol bindings are configured at the Site level in IIS or WAS configuration.
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.
Exception Details: System.InvalidOperationException: No protocol binding matches the given address ‘http://<Replaced>.azurewebsites.net/Service1.svc’. Protocol bindings are configured at the Site level in IIS or WAS configuration.
With this error message your first indication from the error message and googling is that its something to do with the config file or IIS. Unfortunately with Web Apps the equivalent of IIS is managed for us to a degree so we were slightly restricted in the areas we could look.
Upon initial investigation we were able to identify the following symptoms:
- The service in production was broken with the above error message
- The service in all test environments was working fine
- The components had not been changed in months
- The component in production was working fine until around 10:30pm at night
- Other WCF services in Web Apps on the same App Service plan were working fine
- We methodically compared the Web App settings between what worked in Test and what didn’t in Production and there were no differences
We tried some of the following things:
- Resetting the web app had no effect
- Deploying previous versions of the component had no effect
- Changing the service bus relay instance it was pointing to had no effect
- Creating a new Web App (on a new plan or the existing one) had no effect
- Copying the code from the Web App to an Azure VM and running the code (As is with no changes) under IIS worked absolutely fine
This lead us to zone in on some combination of the fact the code used service bus plus that it was in a Web App was the root of the problem. We knew it was unlikely to be something easy to fix so our approach was to restore the service asap and then look at how to identify and resolve the root cause.
Restoring the public service
We knew that the code worked on a VM so we had the opportunity to get the production service working again by deploying the code to a VM in Azure and slightly changing the architecture for a short period while we fixed the problem. To do this we implemented the diagram below.
We created 2 VM’s so we had some redundancy, added them to a Virtual Network, setup an NSG and then deployed the component to IIS and the services came online and we flipped the relay listener from being the Web App to the VM. This change meant that production service was restored and bought us some time to look at the root cause of the problem.
Because of the cloud easy provisioning we were able to restore service pretty quickly.
While we had restored production service with the workaround above, the next night all of our test environments suddenly started having the same problem, this happened about 24 hours after the production environment was impacted.
We could use the same workaround for non production also.
Initially I wanted to check a few other things around our component so I tried the following:
- Move the components to the latest versions of .net to see if there was anything about them running on not the latest version which might cause the problem. Because the component had not changed in ages it was still running on .net 4.5.2
- I also tried to update all NuGet packages to the latest versions to make sure there was nothing about these that were the problem. Certainly the version of the ServiceBus dll was a good few versions old
All of these changes made no effect.
By this point I was pretty sure the problem was caused by something on the Azure side but before raising a support call I knew I needed to simplify the scenario so that I could make the support teams job easier allowing them to focus on the real problem and not have any of our code/logic getting in the way. My next step was to create a new WCF Service and deploy it to the Web App to make sure the web app worked fine with just vanilla WCF. This worked fine.
Next I added the Service Bus nuget package and all of the extensions that go in the config file. I didn’t add the service bus endpoint yet but methodically worked through adding a bit then deploying it to see if it still worked. I added the behaviours and configuration for the bindings and this was all good with each deployment right up to the point where I added the endpoint. As soon as I added the endpoint and deployed it we got the error message on the Web App. This clearly pointed the finger at the activation of the WCF endpoint within the web app and now that I had a very vanilla solution we could be 100% sure it was nothing to do with our code.
At this point I reached out on the MVP network to see if anyone else was having similar problems. Id checked earlier in the day and knew no one was reporting app service problems and with our scenario being slightly niche I wanted to me more confident that it wasn’t our mistake before escalating. When I reached out on the MVP network, as often happens one of the members of the product team offers to have a look and in this case Apurva Joshi (AJ) offered to check things out. He suggested that there had been an update to App Service which may have caused this problem as a side effect to the update to a seemingly unrelated feature.
The update Microsoft had applied was to prevent random cold starts of the Web App after a slot swap. He suggested if we add the below setting to the WebApp settings then it would revert that behaviour back to the original way:
WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG = 0
I implemented this on our broken test environments and they started working. We are then able to implement the change on our production web app and turn off the VM’s and that is working too. Really great help provided by AJ !!
Based on the error and solution I think this error could affect you if you are using the combination of Azure App Service/Web App along with WCF and a non out of the box binding. In particular if you are using the Service Bus WCF bindings. It does not seem to affect you if your component is a sender to the relay, only a listener on it.
The scenario really reminded me of a blog post I wrote a few years ago called “Keeping up with the Jones’s in the Cloud”. The point of that post was that in the cloud the more you off load to a vendor about the solution the less work that you give yourself, but there is a trade off in that there are areas where things are changing that you may not be aware of or keeping up to date with.
In this case we had the following changes:
- The component was on .net 4.5.2 where as the latest is .net 4.7.1 so it was lets argue 2 minor changes behind. That said the Microsoft website indicates that the support lifecycle for .net 4.5.2 follows the Windows 2012 R2 support lifecycle so its still under mainstream support until late 2018. That said our code is running on the cloud and not directly windows
- The NuGet packages we had installed were not the latest versions of each and in particular the Service Bus package was a good few releases behind but still within 1 major release and its backwards compatible anyway
- You could also argue we should switch the new Relay client on NuGet but at the sametime that has only 1 release out of preview and not that many downloads
- The Azure App Service is constantly evolving, when we first started using it in this scenario things like not .net apps weren’t really much of a thing for Web Apps
- Linux is now and option and wasn’t before
- There are more networking features that didn’t used to be there
While all of these things external to our solution have been evolving and changing, there has been no business requirement function to change our component so it hasn’t been updated or deployed for a while. This is one of the risks you need to accept and try to mitigate with the cloud. In our case we use an automated build and deployment process to try and mitigate this, but in this particular case the staggered roll out of the Azure Patch meant that we didn’t catch this error before it hit our service.
One of the good things however is that using App Insights there is cheap/free and very good monitoring available which alerted us to this issue as soon as it happened.
While we were able to workaround and fix this problem very quickly I thought the journey of troubleshooting the issue might be interesting to people. This is one of the harder parts of IT and I think having a structured approach to ruling out areas and finding evidence which helps you to zone in on the likely causes of the error is a good way to work so you can be effective even when the problem is complex and possibly outside of your control.
AJ has published details of the fix on this link if anyone needs it – https://social.msdn.microsoft.com/Forums/azure/en-US/652bbf10-4f82-4993-b9f3-76543cce7482/wcf-applications-on-azure-appservice-fail-with-error-no-protocol-binding-matches-the-given-address?forum=windowsazurewebsitespreview