Hi All, <<this is heavily edited post of a white paper style article i made up at www.simpleit.tumblr.com>> Have a look and let me know your thoughts

I came across a really tough and intricate issue at a customer site the other day that caused me and my colleagues a lot of pain. The issue may or may not be unique to the specific environments listed below, but it is worth mentioned none the less

The basic issue was that my customer was losing mail at 8:00AM every morning for two weeks, on a Monday morning, no event logs, the only diagnostic information we had was that we could no long ping their NLB cluster I.P Address

After the 2nd week of failures, the failures started happening every day, then multiple times a day. As a Managed Service Provider, we are there to restore service, unfortunately, with email being so mission critical, we really needed to get the service back up as quickly as possible which made the issue even harder to trouble shoot. The only thing we could do to restore service was reboot the server and the issue went away.

The issue with having no diagnostic information is where do you start troubleshooting?

Before I go on about the troubleshooting steps we took, consider this as a very roughly sketched representation of the “standard” highly available Microsoft Exchange 2010 environments we would have put together to date.

The Exchange environment:

I have a customer with Microsoft Exchange 2010, 4 servers, 2 CAS servers, 2 Mailbox servers
The two CAS servers are set up in Microsoft's Network Load Balancer, operating in Unicast mode
The two CAS servers were served by one DNS record and NLB took care of the rest 

It isn’t the explanation but the simple facts of the build are

- Each server has a DNS entry for its Primary Ethernet adapter
- EXCH_01 and EXCH_03 are part of an NLB cluster, they have an NLB Ethernet adapter
- EXCH_01 and EXCH_03 share an I.P Address as the second I.P on their Ethernet adapter
- CUSTDNS01 holds the DNS record for EXCH_CLUSTER
- Priority exists on the NLB cluster for which server takes
- Two mailbox servers sit on a different subnet and the I.P addressing is considered correct
- The two mailbox servers sit in a database availability group cluster, EXCH01 is the witness
- The relay I.P addresses are irrelevant in this scenario, ignore them

It of course gets way more complicated than that but for a high level over view of a “highly available Microsoft Exchange 2010” set up, this level of understanding will get you by for the time being

Underlying Architecture:

The servers sit across VMware ESX 4.1 servers inside a HP Blade system

Networking equipment:

For the sake of this post, the core switch is a HP Procurve 8212ZL

The Microsoft NLB configuration

Once we suspected the NLB configuration was the issue, we investigated exactly how it was setup. It was setup correctly according to Microsoft; the cluster mode was set to Unicast mode. For information on why this is problematic, keep reading.

The problem:

When operating 2 Microsoft Exchange CAS servers in an NLB cluster, set to UNICAST MODE, you have to meet a set of preconditions if you are running them virtualised in a VMware setup. This Microsoft Exchange environment was set to run in UNICAST mode and Exchange was crashing badly. Clients would lose connectivity to exchange, and you could no longer ping the NLB Cluster address. 

When it was happening we initially thought it was a certificate issue, nothing in the event logs get logged, there was no apparent storm of traffic on their switches. When we thought maybe it was the NLB setup, we didn't know where to start. The only thing we had was that mail went down and could no longer ping the cluster.

When we finally looked at the NLB in UNICAST mode, we found an article that said for proper CAS operation in UNICAST mode. For UNICAST mode to be supported in VMware, both CAS servers needed to be on the SAME ESX HOST! How stupid is that? An NLB cluster on two CAS servers should be separated onto multiple ESX hosts - it is the point of redundancy. So we had to get rid of the UNICAST mode operating on the NLB cluster

If you want to configure NLB in unicast mode, or if you have, and are having troubles, See this article.

What we thought was the fix:

We changed the NLB cluster to MULTICAST. Low and behold mail appeared to work properly and was stable for 2 whole days! Problem solved right? Wrong. Whilst it fixed mail, it broke everything else. The SAMSUNG PABX system was crashing, the VOIP phones had a in / out switch to save network ports, they were crashing out.

What about their wireless network? Forget it - dead.

So we started doing what any good networking guys do. IGMP turned on? Check. Check VLAN configuration? Check. Everything checked out the way it should. The switches and the network were reporting little or no load, but it was clear the network was being flooded by multicast traffic, but why? 

We thought changing the cluster to IGMP multicast would fix the issue but it absolutely didn't, it made it worse. We had to make a choice, no mail, or select endpoints (half the entire site) not getting anything.

As it turns out:

The network was flooding with multicast traffic, Exchange worked, but everything else started to fail. The strange thing was it wasn’t all endpoints that started to fail. When the phones fail, the customer told us they had a splitter in them so we thought it was just bad equipment. Turns out in those phones there is a dumb two port switch (dumb in terms of no IGMP filtering capability, no manageability, the poor devices didn’t stand a chance.

The wireless failing was also unfortunate, the AP’s were on their own subnet so they were “isolated” but they tunnelled back to a single wireless controller VLAN 1 where the Exchange Servers sat so they were subjected to the same multicast flooding everything else as.

It was also hard to diagnose here as well because why didn’t the multicast traffic kill everything else on the network? All other endpoints it turned out were better equipped than the Wireless N cards, and the Samsung phones. 

Anyway, we had to understand what was happening with the multicast traffic on this network, and why IGMP wasn’t working, it was meant to filter out this very issue.

The reasons for it not working turn out to be a very serious problem that I believe could be solved if Microsoft, HP, and VMware, hell even Cisco acknowledge this as a cross vendor issue where they need to actually sit down over coffee and discuss how to fix it.

Introducing the technical problem

Now I am sure someone will read this and shake their head because their understanding of it is better than mine, but this is basically what is meant to happen.

NLB Unicast and Multicast traffic heavily relies and MAC addresses and ARP requests. When someone says, I want mail, who  is out there that can do this, ARP requests happen to see how is most suitable to service the request between the two front ends.

This is where it gets really tricky

We had to ask ourselves why unicast traffic wasn’t flooding the network (but crashing Exchange) and multicast traffic was. The answer was unicast traffic was flooding the network, but the HP Procurve switch was doing its thing properly, it was just that the preconditions for using unicast traffic in VMware were not ever met so it would crash exchange as we discussed earlier. The problem here I am afraid, was a Microsoft one.

Unicast and Multicast traffic, as well as ARP requests, are governed by RFC standards.

The Microsoft NLB Unicast traffic actually uses a multicast MAC address. That multicast address looks something like 01:##:##:##:##:##. On HP Procurve switches with the latest firmware, this is not a problem, with IGMP turned on, and the IP-MULTICAST-REPLIES set to enabled, the HP Procurve switch would sort it out and everything would run stable.

To explain the IP-MULTICAST-REPLIES rule, just know that when enabled a HP Procurve switch will happily observe the following MAC addresses

00-00-00-00-00-00 to 02-FF-FF-FF-FF-FF

It is more complicated than that. All you need to know is that the Procurve could read and happily reply to these ARP requests. In unicast mode, the traffic would not bring down the network because of this one reason

So what happened when we enabled MULTICAST on the NLB?

Well in Microsoft’s infinite wisdom they changed the NLB cluster MAC address to

03-##-##-##-##-##

Is this a valid multicast MAC? Yes it is.  

Is it suitable for the network equipment on the market? Not at all

The HP Procurve switch simply couldn’t respond to the traffic and it would flood whatever VLAN the traffic was hitting. All the posts online told us to either change the MAC address when NLB was set to multicast mode, or separate exchange onto a different VLAN

The solution

There wasn't an easy viable solution. 

We ended up deleting the NLB cluster all together and retargeting DNS to a single Microsoft Exchange front end Client Access Server. We had to lose redundancy to gain stability. This is extremely unfortunate. There were a few proposals floating around on the internet to fix this issue but they are all things we have to schedule some significant down time to do, or they simply weren’t viable.

One post said to change the NLB multicast MAC address to a one the HP Switches could handle would work but it seemed like too much of a band aid fix.

The other option was to add a static ARP entry into the switch for the NLB multicast MAC address. This wasn’t an option because it can’t be done on HP Procurve switches. Seems like we lucked out with choosing HP Procurve switches here. Cisco devices don’t suffer the same fate.

Another option was to scrap the NLB cluster all together and use Hardware clustering. This is something we still might do but we need time and money to figure it all out and do it properly, time is something we don’t have, and we are MSP so I really don’t want $20,000 dollars of hardware bills going to my customers where we can avoid it.

The last option and the only *sic* viable one is to separate their Microsoft Exchange environment its own VLAN, recreate the cluster in multicast mode, and just it flood its segment. The backbone of the network on the Exchange environment is 10 GB anyway, so it’d suffice. I am worried that as the multicast traffic floods that link buffers will overflow and it will just take longer to fail, but will fail anyway.

Why this doesn’t sit comfortably with me

1. The idea of band aid fixing an entire production environment doesn’t sit well with me

2. Microsoft and Procurve haven't spoken to each other over a known issue

3. There are MANY customers out there we now have to look at and hope it isn’t an issue

4 The Exchange environment had been running stable in Unicast mode even in an unsupported configuration until recently. What change? We still don't know. Our only sound theory was a Windows patch. The only Exchange patch was SP1 Update roll up 5 and apparently that made no change to the DAG or NLB operations. Was it a VMware patch? I certainly didn't apply one. Was it HP Procurve firmware? We certainly didn't apply it.

Conclusion:

I really can’t come to any conclusion other than be careful, hopefully by the time this happens to you, or you read this article, you have started to experience this issue first hand and there are some ideas you can take out of it. If you came across this article from my blog and have any questions, post comments. If you believe anything here is factually incorrect, please do not drop me a message. I never set out to deliberately mislead an entire community and I certainly am not professing to be an expert on everything there

There are some more articles that helped me understand this all, see them below. This was one of the harder stranger jobs i have worked on!

Here are some links to help out

Links: 

Microsoft NLB and Procurve switches 1

Microsoft NLB and Procurve switches 2

HP Procurve firmware for the 8212ZL switches

Microsoft NLB in a VMware environment

Microsoft NLB in Unicast mode on VMware breaks

Exchange update rollup 5 for Service Pack 1 release notes

A really cool blog with some useful Exchange setups

A Wikipedia article explaining multicast addresses - really useful

A really cool website where you use your mouse to move a spider over the screen

 

 



[edited by: Mark Boyd at 10:34 PM (GMT -8) on 11-28-2011] fixed some factual inaccuracies