Service Bus HA with paired namespaces

Monday, August 29, 2016

Azure Service Bus is one of the oldest cloud services on Azure. As any service living in the cloud, it grows, iterates, and changes. Among various features that the service has, there's one that could deserve a little more attention: paired namespaces.

An application communicating over Azure Service Bus is usually utilizing a namespace. Services and sometimes even entire regions can go down. That makes a single namespace a single point of failure. A possible solution is to enable high availability of the service. This is where Service Bus client can help. It can do so by using two namespaces rather than one. A primary and a secondary namespaces. The client is using both namespaces to achieve high availability. The feature is called PairedNamespaces.

A quick recap of how it works

The client is normally sending to a primary namespace. In case primary namespace is no longer available, Azure Service Bus client will automatically switch to the secondary namespace. While sending messages to the secondary namespace, the client will also continue pinging the primary namespace behind the scenes to see if the primary namespace availability has changed and it's up again. Ping messages are short-lived messages configured to self destruct after 1 second. While primary namespace is not available, the sender will be sending the messages to the secondary namespace. The secondary namespace is solely used as storage until the primary namespace is back online. Once the primary namespace is back, messages from the secondary (temporary storage) namespace will be moved to the primary namespace.

The secondary namespace has a very rigid structure. The structure is always a queue with the following path <primary-namespacename>/x-servicebus-transfer/<backlog-queue-index>

enter image description here

Backlog queues are used to hold the messages and number of the queues is defined when paired namespaces are configured using MessagingFactory.PairNamespaceAsync(SendAvailabilityPairedNamespaceOptions) method.

In short, the primary namespace goes down; no message is lost, and all send operations are redirected to the secondary namespace. Once the primary namespace is back, all backed up messages are moved from the secondary to the primary namespace, to the appropriate queues/topics. To determine the right destination, each message is stamped with a x-ms-path custom property, indicating the destination entity for the message.

enter image description here

Except there are a few small details that need to be taken into consideration.

To test a failover scenario, the primary namespace has to be down. There's no API failover exposed. Therefore a workaround is to trick the client to think that namespace is down by pointing the primary namespace domain to some invalid IP address using the hosts files. According to documentation:

"Failures that trigger the failover timer are any MessagingException in which the IsTransient property is false or a System.TimeoutException.".

Unfortunately, this is not what's happening. During a send attempt, once primary namespace was not available and failoverInterval time elapsed, MessagingCommunicationException was thrown by the ASB client. Surprisingly, the exception is marked as transient. To get the code working and verify that failover took place, the message had to be retired. Once the send operation was retried, ASB client uses the secondary namespace, and everything worked as expected. It's unfortunate that the exception is marked as transient, where it shouldn't be. On the positive side, it's not a bad idea to retry the operation, even if default retry policy does it already. The retry will ensure the failed send operation gets executed against the secondary namespace.

Sample sender code

    static async Task MainAsync()
    { 
        var connectionString1 = "Endpoint=sb://primary-pairednamespace.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=[key]";
        var namespaceManager1 = NamespaceManager.CreateFromConnectionString(connectionString1);
        var messagingFactory1 = MessagingFactory.CreateFromConnectionString(connectionString1);

        var connectionString2 = "Endpoint=sb://secondary-pairednamespace.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=[key]";
        var namespaceManager2 = NamespaceManager.CreateFromConnectionString(connectionString2);
        var messagingFactory2 = MessagingFactory.CreateFromConnectionString(connectionString2);

        if (!await namespaceManager1.QueueExistsAsync("testing"))
        {
            await namespaceManager1.CreateQueueAsync("testing");
        }

        var sendAvailabilityPairedNamespaceOptions = new SendAvailabilityPairedNamespaceOptions(
            secondaryNamespaceManager:namespaceManager2,
            messagingFactory: messagingFactory2,
            backlogQueueCount: 3,
            failoverInterval: TimeSpan.FromSeconds(10),
            enableSyphon: true);

        await messagingFactory1.PairNamespaceAsync(sendAvailabilityPairedNamespaceOptions);

        var sender1 = await messagingFactory1.CreateMessageSenderAsync("testing");
        var receiver1 = await messagingFactory1.CreateMessageReceiverAsync("testing");
        var messageId = 1;

        while (true)
        {
            try
            {
                var message = new BrokeredMessage("testing") {MessageId = messageId++.ToString()};
                // Set a breakpoint here and modify hosts file to contain "8.8.8.8 primary-pairednamespace.servicebus.windows.net"
                await sender1.SendAsync(message);
                Console.WriteLine(".");
            }
            catch (Exception e)
            {
                var me = (MessagingException) e;
                Console.WriteLine(e.GetType());
                Console.WriteLine(me.Detail.ErrorCode);
                Console.WriteLine(me.Detail.ErrorLevel);
                Console.WriteLine(me.Detail.Message);
                Console.WriteLine(me.IsTransient);
            }

            await Task.Delay(2000);
        }
    }

Conclusions

The feature was documented and API was detailed. However, it is still not commonly used. Can't be right. How come a feature, allowing HA out of the box is so neglected? Would be a killer feature if not a few caveats.

###Failover API and testability

The failover API was baked into the native client and was not exposed. Changing hosts file is far from an ideal way of verifying the feature works.

###Feature Design

Lack of explicit contract when failover is about to happen. General MessagingCommunicationException could be replaced with a more intention revealing exception that would not require relying on the transient nature of the exception. Not to mention that in the test above it turned to be not quite as expected.

Additional substantial factor has to do with the fact that paired namespaces were designed with sender and receiver as two different parties. I.e., sender application vs receiver application where each part would perform one type of operation, but not both. To demonstrate what I'm talking about, add a receiving operation to the sender application. While failover will take place and send operation will be successful, the receive operations would fail. This is due to the fact that this is not an active-active HA, and entities required by the receiver only exist in the primary namespace and not the secondary namespace. Could have been addressed by the ASB client if receive operations would just return null messages when requested to fetch and the client was in failover mode. But again, design decisions are hard to get right. Especially on the first time.

###Cost

Each send operation would become several operations, tripling the cost of a message when failover is occurring:

Send to the secondary namespace to store when the primary namespace is down.
Retrieve from the secondary namespace when the primary namespace is back.
Re-send to the primary namespace.

Let's be honest, what's a higher cost when you can still send messages that are vital to your business? An acceptable cost.

Summary

If you require high availability today and can partition your application act as solely sender or receiver, paired namespaces feature could be a good option to use. If that's not your scenario, you could implement a more of an active-active approach, where entities are created and maintained on one or more namespace.

Credits

Special thank you To John Taubensee for help with testing this feature.

Hi,

Thanks for the article, it suits exactly the kind of scenario I'm testing at the moment.

What do you mean by "entities required by the receiver only exist in the primary namespace and not the secondary namespace"?

Also, about the three operations per message in case of a failover, I got the point for the first two: send to the secondary namespace, copy it back to the primary when it becomes available again, but what do yo mean by "Re-send to the primary namespace"? Are you talking about the same message?

Horia - Monday, January 16, 2017 2:52:30 AM

What I mean is that during failover, your application will be only able to send, but not receive. If you have receivers set up, they won't work, because non of the entities exist in the secondary namespace. Secondary namespace is solely used as a storage during failover.

When you get an exception while trying to send to the primary namespace, you'd have to re-send the same message after handling the exception to ensure it's stored on the secondary namespace. Once the primary ns is available, you don't need to do anything.

Frankly, I'd advise not to rely on this feature. It's very problematic and design is brittle. If you really need HA, perhaps NServiceBus with ASB could be an option: https://docs.particular.net/nservicebus/azure-service-bus/multiple-namespaces-support

sfeldman - Monday, January 16, 2017 1:49:47 PM

Excellent and very useful article as we're just looking at HA for Service Bus

Steve Culshaw - Wednesday, April 12, 2017 10:08:05 AM

@steve this HA method is only good when your endpoint is a sending only endpoint and the drawbacks I've outlined could be substantial. Alternative method would be to use multiple namespaces.

Sean Feldman - Sunday, April 16, 2017 6:39:20 PM

A quick recap of how it works

Sample sender code

Conclusions

Summary

Credits

4 Comments