Hey,
I have 5 TMG servers in array. One of them is master. All server are same pizza boxes - quad core CPU, 8Gb ram, mirror disks. Two on board and dual port NICs. I have Exchange 2013 published there. There are 9000 users and during peak hours each TMG has 3.2-3.5k connections (in TMG dashboard). I'm also collecting performance monitors with SCOM and I see ~12-15k active firewall connections:
occasionally during peak hour I receive authentication errors - HTTP 10013, 10048. It comes and goes.. it last for 5-10 minutes and after it works just fine..
I noticed that errors are produced by specific servers, not all of them.. just 2 out of 5.. it changes through the time.. but it always just one or two servers out of 5.. Not all of them producing errors..
I looked at Firewall - Available Worker Threads counter:
During this specific period server 1 and server 2 were producing issues. It seems that server 1 has 270 workers available and server 2 ~40. There are no drops in workers availability.. this is weird..
I'm not sure how to explain this, why server 1 is failing even it was so much workers available, and same way server 2 is also failing..
Is there any way of tweaking this array to make it better? Disable CRL checking? Anything else?