how to troubleshoot peoplepicker lag with NetMon (Network Steve Forum)

how to troubleshoot peoplepicker lag with NetMon

We have a significant issue with delay in performing a peoplepicker find/browse operation in order to apply security to sites/objects in MOSS2007. This has been a bit of a pain since launch (24+ months) in both prod and dev systems, but was only a problem previously for admins. as we open provisioning to our general population this is becoming a major headache as the first search operation can take upward of 1:30 minutes to return results, with subsequent queries being very very responsive. I have already ruled out configuration at the peoplepicker property level (ie. -pn peoplepicker-searchadforest -pv "forest:myforest;domain:mytrusteddomainpartners") and those options do not resolve any issue for me. Please, no more suggestions for doing that. :-) I have also used nltest.exe and ldap.exe to test connectivity to the GCs and those look good as well. And I've also run the search operation while capturing traffic with NetMon 3.4. So this is what I need help with: (1) how do I read through the NetMon capture file to verify I don't have a DC/GC that is causing response issues (not seen with LDP tool)? (2) the whole thing feels very much like an ASP.NET warm-up issue as if I load the main page where I need to assign perms (http://mossserver/site/subsite/_layouts/aclinv.aspx) and click the BROWSE icon, the interface loads and a find/search/browse operation takes f-o-r-e-v-e-r if I immediately change my query string and search again the results are quite quick and zippy, and remain responsive like this if I navigate away and return to the page and perform another query however.... if I perform an iisreset and then re-navigate to the permissions assignment page and perform another browse operation it is back to square one with tremenous delay (followed by subsequent searches being speedy again). this sure makes it "feel" much more like an issue with some .NET code needing to be warmed up than with communications to my GCs/DCs. Thoughts? Help? Guidance? Please do help me understand how to read the capture file in (1) as I regardless of the outcome from (2) I want to ensure if there are any laggy DCs/GCs those are addressed with my infrastructure team. Thanks.

July 9th, 2010 10:17pm

So you have already restrict people picker to search user in a specific OU in AD with stsadm -o setsiteuseraccountdirectorypath -path <name of OU> -url <URL name> (http://technet.microsoft.com/en-us/library/cc263328(office.12).aspx) and it did not solve the problem. I had capture the network traffic with Netmon. You can compare it with the charts in the URL follows to find out which step cause the lag: http://msdn.microsoft.com/en-us/library/dd357076(v=PROT.13).aspx http://msdn.microsoft.com/en-us/library/dd303522(PROT.13).aspx Gu Yuming TechNet Subscriber Support in forum If you have any feedback on our support, please contact tngfb@microsoft.com

Free Windows Admin Tool Kit Click here and download it now

July 12th, 2010 5:11am

BTW, I just see this interesting small tool: http://research.microsoft.com/en-us/projects/tcpanalyzer/ Could you please upload your NetMon capture to the following URL? Workspace URL: (https://sftasia.one.microsoft.com/ChooseTransfer.aspx?key=763240c4-c583-468f-b58d-0aa3db4aaafe) Password: 1O[HVKfl_$54{k

July 12th, 2010 8:31am

We are unable to set the -o SetSiteUserAccountDirectoryPath property in stsadm as we have multiple root OUs from which we are pulling users. Additionally we query multiple domains, due to two way trusts from M&A activity (FYI: the lag issue has been consistent since before any trusts were enabled, therefore we know the issue does not link to setting the -pn peoplepicker-searchadforests property either, and testing in our dev environment has proven this to be a correct assumption). I'm aware of the two digrams depicting the communications that occur during a peoplepicker operation however matching those diagrams to the communications captured in NetMon is what I don't know how to do, and would very much like to understand and learn. :-)

Free Windows Admin Tool Kit Click here and download it now

July 12th, 2010 6:21pm

Gu Yming, The capture file has been uploaded. The TCPAnalyzer tool looks quite interesting, but I still am lacking knowledge on how to read the capture so am rather hamstrung in an ability to utilize this additional visualization of the data. Maybe this is off-forum for SharePoint, but is there some nice MS documentation for NetMon no "how to read a packet trace to troubleshoot network communications"? Thank you so much for the help.

July 12th, 2010 6:24pm

Got the capture you uploaded. This is a basic guide on how to read tcp trace: http://support.microsoft.com/kb/169292/EN-US; And here is understanding LDAP: http://download.microsoft.com/download/3/d/3/3d32b0cd-581c-4574-8a27-67e89c206a54/uldap.doc; TCPAnalyzer is not designed for the small amount of capture you sent me. The capture you sent me contains tcp communication of about 90 seconds. I reviewed those conversation associated with W3WP, found that most LDAP conversation happened in about the last 10 seconds. Only a small amount happened in the beginning 3 seconds. What did you do with the SharePoint interface when this communication is captured? Could you please send me more captures for me to compare? Such as the capture when the peoplepicker browse icon respond quick on second query. And captures with SQL server TDS package so that I can see the communication between WFE and SQL Server described in the illustrative diagram in the URL I post before. At the same time, I will try get some capture on my site

Free Windows Admin Tool Kit Click here and download it now

July 14th, 2010 10:46am

(1) Type LDAP in the the display filter and apply it. You should be able to eyeball the time taken between the query and response but both Netmon and Wireshark have tools to get the latency between the request and response. I would also run through steps 1-12 of the article 216498 How to remove data in Active Directory after an unsuccessful domain controller demotion http://support.microsoft.com/default.aspx?scid=kb;EN-US;216498 to see if a non-existant DC might be present and causing problems Also If the organization is geographically dispersed make user you are using active directory sites to limit use of remote dc's Reference Understanding LDAP whitepaper http://download.microsoft.com/download/3/d/3/3d32b0cd-581c-4574-8a27-67e89c206a54/uldap.doc Looking at the relation with active directory is where you should focus. (2) You could try a warm up script. see Joel Olesens blog for an example http://blogs.msdn.com/b/joelo/archive/2006/08/13/697044.aspx Fred Ellis - MSFT

July 14th, 2010 10:25pm

The new capture has quite a bit more data to work with (170MB nicely zipped to 31MB) and has been posted back as before. Here's what I did to generate the trace. //** 1. perform IISRESET on system ** [w3wp.exe appPool processes are PID:3048, PID:8264 and PID:8520] ** 2. open root of site http://server.corp 3. go to SITE SETTINGS > ALL SITE SETTINGS > ADVANCED PERMISSIONS > ADD USER > click address book browse icon to start peoplepicker dialog box 4. [1:25:42] start NM capture 5. [1:28:10] click FIND for "smith" in "Select People and Groups" dialog box 6. [1:29:26] results returned to dialog box 7. [1:30:30] click FIND for "jones" in "Select People and Groups" dialog box 8. [1:30:32] results returned to dialog box 9. [1:33:30] click FIND for "garcia" in "Select People and Groups" dialog box 10. [1:33:32] results returned to dialog box ** [did not close browser window(s) or peoplepicker dialog box] ** 11. [1:35:00] perform IISRESET on system 12. [1:35:35] http://server.corp w3wp.exe appPool process [new PID: 5292] re-starts automatically 13. [1:36:50] http://mysites.corp w3wp.exe appPool process [new PID: 8900] re-starts automatically 14. [1:37:26] http://server.corp:10000 (central administration) w3wp.exe appPool process [new PID: 9008] re-starts automatically 15. [1:42:00] click FIND for "smith" in "Select People and Groups" dialog box 16. [1:43:25] results returned to dialog box 17. [1:43:45] click FIND for "jones" in "Select People and Groups" dialog box 18. [1:43:47] results returned to dialog box ** [reverse the lookup just for good measure, showing appropriate lookup speed even on larger result set] ** 19. [1:44:10] click FIND for "garcia" in "Select People and Groups" dialog box 20. [1:44:11] results returned to dialog box (17 matches) 21. [1:45:02] click FIND for "jones" in "Select People and Groups" dialog box 22. [1:45:08] results returned to dialog box (58 matches) 23. [1:45:40] click FIND for "smith" in "Select People and Groups" dialog box 24. [1:45:42] results returned to dialog box (62 matches) 25. stopped NM capture Disclaimer: All the times above reference the Windows clock on the same system performing the capture and running the MOSS instance, but are simply to show the timeframes associated to the steps and do not necessarily precisely map to the traffic seen in the capture. **// There's plenty of communication between this dev WFE and the dev SQL instance in the capture. As I analyze this diagram (http://msdn.microsoft.com/en-us/library/dd357076(PROT.13).aspx) I don't see any relevant communications from the SQL instance to the DCs (or elsewhere) so I believe the capture here should suffice for reviewing what is happening between the WFE and SQL. If I have that incorrect please do let me know and I'll get my db team involved to also run a trace at the SQL instance while also running one at the WFE. Thanks.

Free Windows Admin Tool Kit Click here and download it now

July 16th, 2010 1:18am

Both NLTEST and LDP have been used from the system to verify connectivity to the domain controllers and lookup to the correct site. All is good there. Additionally, AD's records for the DCs, GCs, etc. all are very clean as no DC has ever been removed from the existing forest since it was brought up 5+ years ago. The production and development systems reside in the same site and the site has more than one GC servicing it. Response time for all other GC and LDAP lookups are spectacular, and as can be seen with the second or more lookup attempt with the peoplepicker they are good otherwise from these MOSS as well. It so very much feels like a warmup-related issue but... We're using warmup scripts both our production and development environments, even warming up the SSPAdmin sites. To test again here I downloaded and ran the 'Application Pool Manager' from www.harbar.net to be able to fully control the warmups and more easily monitor which w3wp process belonged to which appPool. Even with granular control of warmups there is no difference. I have been so hoping it was as simple as "add the http://server/_layouts/bin/picker_warmup.aspx page to your warmup scripts". But no luck so far. Thank you so much for the help. Lots of good ideas and I'm starting to slowly understand what I'm looking at in NM.

July 16th, 2010 1:28am

This is looking like there is something specific to your particular environment causing this which would need a more in-depth level of support in order to pursue. Please visit the below link to see the various paid support options that are available to better meet your needs. Link: http://support.microsoft.com/default.aspx?id=fh;en-us;offerprophoneFred Ellis - MSFT

Free Windows Admin Tool Kit Click here and download it now

July 20th, 2010 5:40pm

Oddly, that makes me feel better, knowing now that it's not something simple and obvious and we're just being thick-headed. Thanks Fred, and Gu Yuming.

July 20th, 2010 6:55pm

Hi, Ratfink, I built a one-way domain trust in my test environment, and set the peoplepicker-searchadforest property, I can reproduce the initial delay you described. Had not investigated into the NetMon capture yet (lack of expertise in the authentication communication and LDAP) . As Fred describe, the “warm-up” or remedy to this problem may fall into the paid support category if this issue is urgent and important for you or try other forum. Sorry.

Free Windows Admin Tool Kit Click here and download it now

July 21st, 2010 4:38am

Gu Yuming, YEH! Thanks for confirming you can reproduce the error. Again, it validates we're not just being silly here. We are opening a case with Premier today and will be including reference to this forum/posting as we do so. If needed will the Premier tech be able to contact you for any of your findings? Thanks again for all the help. - Michael

July 21st, 2010 9:21pm

Hi,Ratfink, I reproduced the phenomenon you described, however, I cannot call it an error now. It is something out of my expertise. You cannot call the initial poor performance of web application loading an error or a bug, although it is something we shall try to improve. We have so many area which need to improve, so, prioritization comes. Thanks for your understanding. For the current process, if you open a case with Premier today, the Premier support does not have the obligation to contact me or notify me the result. They can contact me through Microsoft's internal communication system if they feel the need. They will log the case into Microsoft's internal knowledge base and I can search in it with keywords. It is appreciated if you can share the improved "warm-up" script in the forum.

Free Windows Admin Tool Kit Click here and download it now

July 22nd, 2010 4:33am

Hi Ratfink, Just curious to know how did the support ticket with Microsoft go? Is the people picker resolving any faster on the first use now? I'm also in the same situation as you were so wondering if there were any findings from the MS call that you can share which would be helpful for us. I'm considering opening a ticket too - knowing your experience on this would be beneficial. Many thanks in advance! Blue BlueSky2010

October 8th, 2010 5:53pm

This topic is archived. No further replies will be accepted.