Busy TMG array, ways of tweaking it?

Hey,

I have 5 TMG servers in array. One of them is master. All server are same pizza boxes - quad core CPU, 8Gb ram, mirror disks. Two on board and dual port NICs. I have Exchange 2013 published there. There are 9000 users and during peak hours each TMG has 3.2-3.5k connections (in TMG dashboard). I'm also collecting performance monitors with SCOM and I see ~12-15k active firewall connections:

occasionally during peak hour I receive authentication errors - HTTP 10013, 10048. It comes and goes.. it last for 5-10 minutes and after it works just fine..

I noticed that errors are produced by specific servers, not all of them.. just 2 out of 5.. it changes through the time.. but it always just one or two servers out of 5.. Not all of them producing errors..

I looked at Firewall - Available Worker Threads counter:

During this specific period server 1 and server 2 were producing issues. It seems that server 1 has 270 workers available and server 2 ~40. There are no drops in workers availability.. this is weird..

I'm not sure how to explain this, why server 1 is failing even it was so much workers available, and same way server 2 is also failing..

Is there any way of tweaking this array to make it better? Disable CRL checking? Anything else?

January 14th, 2014 9:45pm

Hi,

Thank you for your patience and support.

I am trying to involve someone familiar with this topic to further look at this issue. There might be some time delay. Appreciate your patience.

Thank you for your understanding and support.

Best Regards

Quan Gu

Free Windows Admin Tool Kit Click here and download it now
January 15th, 2014 2:49am

Mindaugas, It sounds like you may have a bottleneck with your Domain Controllers which would cause the authentication prompts. Check to see if Backlogged Packets rises during this time.

This link explains them.

http://www.isaserver.org/articles-tutorials/configuration-general/Measuring-System-Performance-Forefront-Threat-Management-Gateway-TMG-2010-Firewall-Part2.html

Let me know.

January 20th, 2014 8:54am

Hello,

Thank you for idea! I spend few days analyzing the performance monitors. I see some spikes and spikes correlates to the authentication errors.

I have six domain controllers with 4 vCPU each. I thought it was enough to support 9000 users. Perhaps I should add more? Domain controllers do not seems to be very busy..

I still don't understand why some TMG servers do not have any backlogged packets though:

TMG1:

TMG2:

Rest in next  message..

Free Windows Admin Tool Kit Click here and download it now
January 22nd, 2014 3:28pm

TMG3:

TMG4:

More..

January 22nd, 2014 3:29pm

TMG5:

As you see only TMG2 and TMG3 have backlogged packets and both have been recording auth errors. TMG1 doesn't have any and TMG4 and TMG4 have only few spikes through the day..

All TMGs are load balancing the traffic.. why some have backlogged packets and some not? Any

Free Windows Admin Tool Kit Click here and download it now
January 22nd, 2014 3:32pm

In addition here is a chart of available workers on all TMG servers during same time period:

As you see TMG2 and TMG3 has most of available workers (in 300-350 range). It makes a little sense for me.. That servers which have more workers are handling more clients and therefore they have more backlogged packets, but why other servers has less then?

Is there a possibility to load balance workers availability?

January 22nd, 2014 3:40pm

Mindaugas, What I have found from experience is that TMG is usually a victim of something else (overwhelmed DCs, network problems, other network devices, etc). When TMG needs to authenticate every request (as it does in your case), any delay in that authentication will have dire consequences especially when you are talking about that many users. What version of Windows Server are your DCs? Are they all consistent or is there a mix of say 2003 and 2008? We find that often to be the problem because TMG will use various DCs depending on where they are in relation to TMG (site). DCs using 2003 had a known issue where the MaxConcurrentAPI often needed to be tweaked. What delegation method are you using on your publishing rules?

You may want to enable Netlogon logging on your DCs.

http://support.microsoft.com/kb/326040/en-us

http://support.microsoft.com/kb/2688798/en-us

Keith

Free Windows Admin Tool Kit Click here and download it now
January 23rd, 2014 3:31pm

Hello,

Thank you for response.

TMG is running on Windows w2008 R2, it is up to date, TMG is SP2RU4. I already worked with Microsoft and implemented many tweaks - registry changes, including MaxConcurrentAPI. All network drivers are up date as well.

The AD is running on Windows 2012 R2, Domain Function level is Windows 2012. I believe Windows 2012 DCs do not have liminations is must perform well.

What I don't understand why 2 of my servers have Backlogged packets all the time, and other do not. One of three does not have any.. Does it mean that only two server are doing authentication?

It is pretty standard TMG deployment. I used all available guides, Exchange 2013 is also published according to available documentation. Previously there was Exchange 2010, but issues still existed..

Delegation mode is set to: Basic authentication for all Exchange rules: Outlook Anywhere, OWA and ActiveSync.

My customer is demanding solution, they even looking into purchasing something else.. I just want to give TMG one more chance and somehow fix it.

Would adding more TMG server help? I'm not sure.. It seems that only two servers are participating in authentication...

January 23rd, 2014 4:21pm

Mindaugas, What are they using for load balancing? If you are using Microsoft NLB then it leads me to believe that the issue lies elsewhere. If you consistently see only 1 or 2 of the TMG servers having a delay it would likely not be too much traffic. The more likely culprit is that those 2 TMG Servers have a longer delay when talking to your DCs. Are there any network devices that could possibly be the culprit? Did you enable Netlogon logging on your DCs? A

Are you seeing any events in TMG related to setting up a secure channel to domain controllers? The should show up under your System Event Log. Those are a good indication there it a problem with DCs or network communications.

Free Windows Admin Tool Kit Click here and download it now
January 30th, 2014 2:42pm

Keith,

Yes load balancing mechanism is native Microsoft NLB. Through the years I tried both unicast and multicast. Same results. I made my customer to change upfront swishes as well.

I agree it is somehow related to the load. there are NO issues after hours.. It runs quick and smooth. and it is not related to the particular TMG server either.. As I  mentioned the issue "travels" around. Few days it is TMG2 and TMG4 recording the problem, another day is TMG3 and TMG4. Week later it is TMG5..

I understand that it could be something else, but I'm not sure what. I'm trying to find the reason so I could fix it.

AD infrastructure is strong - 6 DC with 4 core each. I think it should be enough to support 8k users, even they have two outlooks running and multiple mobile devices...

Sure I can enable Netlogon logging.. Which performance monitors I should monitor on Domain Controllers to catch the delay?

Can I pinpoint each TMG to particular DC? I have different DNS settings on each TMG:

TMG1 - Primary DNS: DC1, Secondary: DC2

TMG2 - Primary DNS: DC2, Secondary: DC3

TMG3 - Primary DNS: DC3, Secondary: DC4

and so on.. SO for DNS requires it should connect to different servers.. As for authentication, I believe it users it own mechanism selecting authenticator?

Some time ago I tried to add more TMGs into the pull. Adding TMG6 and/or TMG7 made situation even worse. It looks like Microsoft NLB works best with only few host, but in my base 4 hosts is minimum ands with 5 TMG host it is running optimal. It still records 5-8k 10013, 10048 errors every day during peak times. It means that some customers sees auth window once in a while.. connection takes longer to complete and such.. Customer learned how to live with it, but they looking for other products.. I on another hand want to keep TMG for another few years, even it is end of life..

January 30th, 2014 3:30pm

I've been looking at various performance charts.. It is good I have SCOM in place, so I can analyze.

What do you think about following..

Here is chart of Received packets per second on Internal gateway interface:

Pretty chaotic..

And here is chart of Received packets per second on External interface:

Pretty smooth. Same five TMG servers are shown here.

Should I be concerned about these charts?

Free Windows Admin Tool Kit Click here and download it now
January 31st, 2014 11:53pm

Mindaugas, Unfortunately there is only so much we can accomplish on an open forum. Your issue seems pretty complex and we would ultimately need a lot of data to get to the root of the issue. My suggestion to you would be to open up a support incident with Microsoft. Keith
February 5th, 2014 9:36am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics