Hi All, <<this is heavily edited post of a white paper style article i made up at www.simpleit.tumblr.com>> Have a look and let me know your thoughts
across a really tough and intricate issue at a customer site the other day that
caused me and my colleagues a lot of pain. The issue may or may not be unique
to the specific environments listed below, but it is worth mentioned none the
The basic issue
was that my customer was losing mail at 8:00AM every morning for two weeks, on
a Monday morning, no event logs, the only diagnostic information we had was
that we could no long ping their NLB cluster I.P Address
After the 2nd
week of failures, the failures started happening every day, then multiple times
a day. As a Managed Service Provider, we are there to restore service,
unfortunately, with email being so mission critical, we really needed to get
the service back up as quickly as possible which made the issue even harder to
trouble shoot. The only thing we could do to restore service was reboot the
server and the issue went away.
The issue with
having no diagnostic information is where do you start troubleshooting?
Before I go on
about the troubleshooting steps we took, consider this as a very roughly
sketched representation of the “standard” highly available Microsoft Exchange
2010 environments we would have put together to date.
I have a customer
with Microsoft Exchange 2010, 4 servers, 2 CAS servers, 2 Mailbox servers
The two CAS servers are set up in Microsoft's Network Load Balancer, operating
in Unicast mode
The two CAS servers were served by one DNS record and NLB took care of the
It isn’t the explanation but the simple facts of the build are
- Each server has
a DNS entry for its Primary Ethernet adapter
- EXCH_01 and EXCH_03 are part of an NLB cluster, they have an NLB Ethernet
- EXCH_01 and EXCH_03 share an I.P Address as the second I.P on their Ethernet
- CUSTDNS01 holds the DNS record for EXCH_CLUSTER
- Priority exists on the NLB cluster for which server takes
- Two mailbox servers sit on a different subnet and the I.P addressing is
- The two mailbox servers sit in a database availability group cluster, EXCH01 is
- The relay I.P addresses are irrelevant in this scenario, ignore them
It of course gets
way more complicated than that but for a high level over view of a “highly
available Microsoft Exchange 2010” set up, this level of understanding will get
you by for the time being
The servers sit
across VMware ESX 4.1 servers inside a HP Blade system
For the sake of
this post, the core switch is a HP Procurve 8212ZL
The Microsoft NLB
Once we suspected the NLB
configuration was the issue, we investigated exactly how it was setup. It was
setup correctly according to Microsoft; the cluster mode was set to Unicast
mode. For information on why this is problematic, keep reading.
When operating 2 Microsoft Exchange CAS servers in an NLB
cluster, set to UNICAST MODE, you have to meet a set of preconditions if you
are running them virtualised in a VMware setup. This Microsoft Exchange environment
was set to run in UNICAST mode and Exchange was crashing badly. Clients would
lose connectivity to exchange, and you could no longer ping the NLB Cluster
When it was happening we initially thought it was a certificate
issue, nothing in the event logs get logged, there was no apparent storm of
traffic on their switches. When we thought maybe it was the NLB setup, we
didn't know where to start. The only thing we had was that mail went down and
could no longer ping the cluster.
When we finally looked at the NLB in UNICAST mode, we found an
article that said for proper CAS operation in UNICAST mode. For UNICAST mode to
be supported in VMware, both CAS servers needed to be on the SAME ESX HOST! How
stupid is that? An NLB cluster on two CAS servers should be separated onto
multiple ESX hosts - it is the point of redundancy. So we had to get rid of the
UNICAST mode operating on the NLB cluster
If you want to
configure NLB in unicast mode, or if you have, and are having troubles, See this article.
What we thought
was the fix:
We changed the NLB cluster to MULTICAST. Low and behold mail
appeared to work properly and was stable for 2 whole days! Problem solved
right? Wrong. Whilst it fixed mail, it broke everything else. The SAMSUNG PABX
system was crashing, the VOIP phones had a in / out switch to save network
ports, they were crashing out.
What about their wireless network? Forget it - dead.
So we started doing what any good networking guys do. IGMP
turned on? Check. Check VLAN configuration? Check. Everything checked out the
way it should. The switches and the network were reporting little or no load,
but it was clear the network was being flooded by multicast traffic, but
We thought changing the cluster to IGMP multicast would fix the
issue but it absolutely didn't, it made it worse. We had to make a choice, no
mail, or select endpoints (half the entire site) not getting anything.
As it turns out:
The network was
flooding with multicast traffic, Exchange worked, but everything else started
to fail. The strange thing was it wasn’t all endpoints that started to
fail. When the phones fail, the customer told us they had a splitter in them so
we thought it was just bad equipment. Turns out in those phones there is a dumb
two port switch (dumb in terms of no IGMP filtering capability, no
manageability, the poor devices didn’t stand a chance.
failing was also unfortunate, the AP’s were on their own subnet so they were
“isolated” but they tunnelled back to a single wireless controller VLAN 1 where
the Exchange Servers sat so they were subjected to the same multicast flooding
everything else as.
It was also hard
to diagnose here as well because why didn’t the multicast traffic kill
everything else on the network? All other endpoints it turned out were better
equipped than the Wireless N cards, and the Samsung phones.
Anyway, we had to
understand what was happening with the multicast traffic on this network, and
why IGMP wasn’t working, it was meant to filter out this very issue.
The reasons for
it not working turn out to be a very serious problem that I believe could be
solved if Microsoft, HP, and VMware, hell even Cisco acknowledge this as a
cross vendor issue where they need to actually sit down over coffee and discuss
how to fix it.
Now I am sure
someone will read this and shake their head because their understanding of it
is better than mine, but this is basically what is meant to happen.
NLB Unicast and
Multicast traffic heavily relies and MAC addresses and ARP requests. When someone
says, I want mail, who is out there that
can do this, ARP requests happen to see how is most suitable to service the
request between the two front ends.
This is where it
gets really tricky
We had to ask ourselves why unicast traffic wasn’t flooding the
network (but crashing Exchange) and multicast traffic was. The answer was
unicast traffic was flooding the network, but the HP Procurve switch was doing
its thing properly, it was just that the preconditions for using unicast
traffic in VMware were not ever met so it would crash exchange as we discussed
earlier. The problem here I am afraid, was a Microsoft one.
Multicast traffic, as well as ARP requests, are governed by RFC standards.
The Microsoft NLB
Unicast traffic actually uses a multicast MAC address. That multicast address
looks something like 01:##:##:##:##:##. On HP Procurve switches with the latest
firmware, this is not a problem, with IGMP turned on, and the
IP-MULTICAST-REPLIES set to enabled, the HP Procurve switch would sort it out
and everything would run stable.
To explain the
IP-MULTICAST-REPLIES rule, just know that when enabled a HP Procurve switch will
happily observe the following MAC addresses
It is more
complicated than that. All you need to know is that the Procurve could read and
happily reply to these ARP requests. In unicast mode, the traffic would not
bring down the network because of this one reason
So what happened
when we enabled MULTICAST on the NLB?
Microsoft’s infinite wisdom they changed the NLB cluster MAC address to
Is this a valid
multicast MAC? Yes it is.
Is it suitable
for the network equipment on the market? Not at all
The HP Procurve
switch simply couldn’t respond to the traffic and it would flood whatever VLAN
the traffic was hitting. All the posts online told us to either change the MAC
address when NLB was set to multicast mode, or separate exchange onto a
There wasn't an easy viable solution.
We ended up deleting the NLB cluster all together and
retargeting DNS to a single Microsoft Exchange front end Client Access Server.
We had to lose redundancy to gain stability. This is extremely unfortunate.
There were a few proposals floating around on the internet to fix this issue
but they are all things we have to schedule some significant down time to do,
or they simply weren’t viable.
One post said to change the NLB multicast MAC address to a one
the HP Switches could handle would work but it seemed like too much of a band
The other option was to add a static ARP entry into the switch
for the NLB multicast MAC address. This wasn’t an option because it can’t be
done on HP Procurve switches. Seems like we lucked out with choosing HP Procurve
switches here. Cisco devices don’t suffer the same fate.
Another option was to scrap the NLB cluster all together and use
Hardware clustering. This is something we still might do but we need time and
money to figure it all out and do it properly, time is something we don’t have,
and we are MSP so I really don’t want $20,000 dollars of hardware bills going
to my customers where we can avoid it.
The last option and the only *sic* viable one is to separate
their Microsoft Exchange environment its own VLAN, recreate the cluster in
multicast mode, and just it flood its segment. The backbone of the network on
the Exchange environment is 10 GB anyway, so it’d suffice. I am worried that as
the multicast traffic floods that link buffers will overflow and it will just
take longer to fail, but will fail anyway.
Why this doesn’t
sit comfortably with me
1. The idea of
band aid fixing an entire production environment doesn’t sit well with me
2. Microsoft and
Procurve haven't spoken to each other over a known issue
3. There are MANY
customers out there we now have to look at and hope it isn’t an issue
4 The Exchange
environment had been running stable in Unicast mode even in an unsupported
configuration until recently. What change? We still don't know. Our only sound
theory was a Windows patch. The only Exchange patch was SP1 Update roll up 5
and apparently that made no change to the DAG or NLB operations. Was it a VMware patch? I certainly didn't apply one. Was it HP Procurve
firmware? We certainly didn't apply it.
I really can’t
come to any conclusion other than be careful, hopefully by the time this
happens to you, or you read this article, you have started to experience this
issue first hand and there are some ideas you can take out of it. If you came
across this article from my blog and have any questions, post comments. If you
believe anything here is factually incorrect, please do not drop me a message.
I never set out to deliberately mislead an entire community and I certainly am
not professing to be an expert on everything there
There are some
more articles that helped me understand this all, see them below. This was one
of the harder stranger jobs i have worked on!
Here are some
links to help out
Microsoft NLB and
Procurve switches 1
Microsoft NLB and
Procurve switches 2
firmware for the 8212ZL switches
Microsoft NLB in a VMware
Microsoft NLB in
Unicast mode on VMware breaks
update rollup 5 for Service Pack 1 release notes
really cool blog with some useful Exchange setups
Wikipedia article explaining multicast addresses - really useful
really cool website where you use your mouse to move a spider over the screen