Around 450 RemoteApp Sessions limit on Windows Server 2012 R2 Remote Desktop Services Session Host without apparent bottleneck - RDS Broker scalability issue

Hi

We have an RDS farm with the following set-up (using Windows Server 2012 R2) to serve RemoteApps to our clients:

  • Two RDS Gateways
  • Four Session Hosts (24-physical processors and 512 GB ram each)
  • User profile disks enabled
  • One RDS Licensing Server
  • Two RDS Broker Servers

The problem we are facing is that it seems like there's a "magic number" of about 450 connections (fluctuating between 445 and 455) per each Session Host.

Once this number is reached users start to report:

  • General session slowness (slow update of Remote App window contents)
  • Some users are unable to log in to their (new) session
  • Some users are connecting but presented with "empty" screen
  • Some users are getting (randomly?) disconnected 

When the issue happens, based on performance counters, the CPU is in range of 30%, RAM has about 200 GB free.

Processor Queue length during the day is mostly within "<2 range", with ~30% of the time going higher up to 6 intermittently (not consistently), and with ~1.5% of the time being more than 10. (There's no continuous queue build-up) So our understanding that this is not a CPU/RAM limitation. 

There were no limits on concurrent number of sessions set on Session Hosts as of SW side to my knowledge. Review of Application/System/RdpCoreTs Logs does not show anything really suspicious at the time the limit is hit, the errors/warnings in event logs do not correlate with timing of the problem.

We've been investigating this issue for a several weeks now and it's still absolutely unclear what could cause such limitation. Maybe someone experienced similar issues.

Any suggestions are welcome.



  • Edited by Igor Malin Friday, August 14, 2015 4:05 AM updated title
August 9th, 2015 11:38pm

Interesting observation:

After about ~300 sessions, the Task Manager UI degrades and looks like this:

Notice the fonts. Also taskmgr is very slow. If performance view is opened, the performance graph is either not drawn/"stuck" or rendered with artifacts:

Not sure if it's related to the experienced problem, or maybe Task Manager just cannot handle the increased number of open processes...


  • Edited by Igor Malin Monday, August 10, 2015 1:16 AM added screenshot
Free Windows Admin Tool Kit Click here and download it now
August 10th, 2015 12:20am

Hi,

1. Please confirm that when the issue occurs it happens on all RDSH servers at the same time.  This is a very key point.

2. What is the status of the RD Gateway servers when the issue occurs?

3. How often does the problem occur?  Daily?  The reason I ask is if you are seeing it frequently that is preferable because if you could show me when it is actually happening it is more likely I could narrow down the source.

If you would like you may contact me via email.  We could arrange a time for a brief Skype call and go over your configuration and the troubleshooting steps you have taken thus far in more detail.

Thanks.

-TP


August 10th, 2015 4:37am

Thanks everyone for sharing their experience on this.

@ TP and Jim:

Yesterday I had a good test run with WordPad as a remote app, just to put our app out of the equation.

The following is the behavior we observe: 

  1. There was a threshold (depends on the time of day :-\) after which clients were (intermittently) unable to establish new connections
  2. On Client side, behaviour for new sessions when limit was reached was:
    1. mstsc screen was active aka working (green "bar" running)
    2. remote app did not appear
    3. connection was eventually reset/lost (e.g. mstsc window just disappears)
    4. there were no error messages shown
  3. On Server side (Session Host), when the limit was reached:
    1. The number of connections was fluctuating (client were still pushing to establish new sessions), example (sample history of connection count): 501, 499, 496, 505, 504 ,501
  1. It seemed as the connections are establishing but then getting reset
  2. The following messages were apparent in Session Host event log at the moment the sessions were dropped:
    1. Application Log:
      1. Event 9009: The Desktop Window Manager has exited with code (0xd00002fe)
    2. System Log:
      1. Event 7002: User Logoff Notification for Customer Experience Improvement Program
    3. RdpCoreTs:
      1. Session XYZ has been disconnected, reason code 11
      2. Session XYZ has been disconnected, reason code 12
        1. https://msdn.microsoft.com/en-us/library/aa381339(v=vs.85).aspx
  3. Sometimes I have seen that the session was logging on and logging off (via expanded Client RDP screen) without reaching RemoteApp
  4. Netsh trace analysis on Server side shows that the Client drops/resets the connection after some.. 30-40 seconds (after establishing the connection)

Important point: The observed behavior started to appear after UPDs were enabled on the test setup. This is a suspicion at this moment and requires an additional test run, which I will do today. I will share my results when done.

@ Jim:

Our understanding is there's a set of scalability issues with RDS which result in overall "sloppiness" under load. There was a problem SQL locks (probably as you described, but I don't have the exhaustive details atm) under increased new connection rate, which I think we were able to work around. Our goal is to fit ~3500 simultaneous users to as less boxes as possible. When reading "RDP Performance Guidelines" from MS I think they do say that the amount of users you can fit in by improving your hardware does not grow linearly with HW. But then again you can have Hyper-V VMs on those powerful boxes... Would be good to have a quick chat (it seems we're after the same thing - higher user density on Session Hosts) if you're interested, you can reach me at: igor.malin [-at-] wisetechglobal.com.

And yes, we do see issue mainly on peak logon-logoff times. 

I'll keep you guys posted on the outcome of this.



  • Edited by Igor Malin Wednesday, August 12, 2015 10:15 PM
Free Windows Admin Tool Kit Click here and download it now
August 12th, 2015 10:09pm

Yesterday's "WordPad" test (with UPDs enabled, test without them is pending):

Graph is from the Session Host under test

The aim of a test was to push number of sessions to 1000. It struggled :) And see connection resets...

  • Edited by Igor Malin Thursday, August 13, 2015 5:37 AM
August 13th, 2015 1:18am

To further clarify the culprit we've done a test WITHOUT RD Session Broker, and our results were:

We were able to reach/attain 1000 simultaneous user sessions.

Can someone from Microsoft RDS Team comment on the observed behavior?


  • Edited by Igor Malin Friday, August 14, 2015 5:05 AM
Free Windows Admin Tool Kit Click here and download it now
August 14th, 2015 4:04am

BTW we previously suspected it (RDS issues) to be bound to the number of handles open. But during testing we were able to push the number of handles up to 5Million and the system was still functioning. By the way, if you use "Process Controller" or "Process Hacker" after the threshold for Task Manager was hit - you will get similar results - e.g. rendering problems.
  • Edited by Igor Malin Friday, August 14, 2015 10:35 AM
August 14th, 2015 10:35am

More answers:

1. We are not using anything specific, as was advised by our IT department, when the connection goes through 2012 R2 RDS Gateway, the gateway chooses the Session Host to connect to (if no specific host was specified by client). 

2. Thanks! D'you know what happens if (by any means) the Broker is unable to service the RDSH request?

3. Some data from event logs you mentioned:

TerminalServices-RemoteConnectionManager - sometimes messages such as : Remote Desktop Services has taken too long to load the user configuration from server X for user Y . Happens independently to whether we hit or not hit the limit.

TerminalServices-LocalSessionManager - a lot of "Session XYZ has been disconnected, reason code 12"

User Profile Service - Received user logoff notification on session XYZ. AND Finished processing user logoff notification on session XYZ.

4. I can share my testing scripts if needed.

5. There was another issue related with UPDLOCKS on RD Broker database. This was apparent during increased login pressure. We were able to mitigate this issue by applying changes to some RD Broker database stored procedures. I didn't notice this problem with testing "in clean".

Th reason we are investigating the problem is that we want clarity about what is causing bottlenecks in our environment and clarity on reliability of components of our cloud offering. Managing failover capacity is understandable and we are not planning to load up our boxes by 100%, but it was unclear why the machine cannot accept any more sessions with 30% CPU busy and 60% RAM (300 GB) free.

As for other scalability issues - yes, indeed, there might be multiple different factors contributing to scalability, and our investigation was just to find out them. As a result of clean testing it was correlated that existence of RD Broker causes session limit to appear. Absence of RD Broker resolves the problem. RDSH and RD Broker are on the same machine in testing environment.

Yes, I agree, virtualization is a possible solution indeed, but it doesn't answer the question of apparent existence of the limit. Saying that would mean that for any production use, just to be safe, we should pay the cost of hypervisor/VMs because the products may not handle load above some magic "N". And this is in addition that RDS team does state that the only limit to RDS is your hardware. :)



  • Edited by Igor Malin Monday, August 17, 2015 1:47 AM
Free Windows Admin Tool Kit Click here and download it now
August 17th, 2015 1:07am

Just had a production issue affecting 1500 users due to this (RDS Broker passing away under load).

It was caused by an increased number of incoming connections.

Now the connections are just getting dropped and we haven't reached even 50% of our regional capacity.

Not good.

We are moving from RDS Brokers usage ASAP.


  • Edited by Igor Malin Monday, August 24, 2015 2:29 AM
August 24th, 2015 2:01am

This is what happens when you have an increased user login rate, when it goes over some magical "X/sec" threshold. (the graph represents number of active sessions on Session Hosts):

Users just cannot get in / getting disconnected and your farm is well under capacity.


  • Edited by Igor Malin Monday, August 24, 2015 9:41 PM
Free Windows Admin Tool Kit Click here and download it now
August 24th, 2015 9:40pm

Hi,

Do you have more details on your production issue?

I did some brief tests.  I used 2 VMs, first with AD DS (1,000 test accounts), RDS (RDCB, RDSH, RDWeb, RDL) with Notepad published as RemoteApp, second for test client. 

1st test run, with local profiles being created, set to connect new session every 3 seconds:

2nd test run, local profiles pre-existing, set to connect new session every 1 second:

-TP

August 26th, 2015 2:59am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics