Server 2012 R2 File Server Stops Responding to SMB Connections

Hi There,

Massive shot in the dark here but I am struggling with a pretty major issue atm.  We have a production file server that is hosted on the following:

Dell MD 3220i -> iSCSI -> Server 2008 R2 Hyper-v Cluster -> Passthrough Disk -> Server 2012 R2 File Server VM

Essentially 3 times now, roughly a month or so apart.  The file server stops accepting connections.  During this time, the server is perfectly accessible through rdp or with a simple ping.  I can browse the files on the server directly but no-one appears to be able to access the shares over SMB.  A reboot of the server fixes the issue.  

As per a KB article I removed nod antivirus from the server to rule out a conflicting filter mode driver after the second fault.  Sadly yesterday it happened again.

The only relevant errors in the servers log files are:

SMB Server Event ID 551

SMB Session Authentication Failure Client Name: \\192.168.105.79 Client Address: 192.168.105.79:50774 User Name: HHS\H6-08$ Session ID: 0xFFFFFFFFFFFFFFFF Status: Insufficient server resources exist to complete the request. (0xC0000205) Guidance: You should expect this error when attempting to connect to shares using incorrect credentials. This error does not always indicate a problem with authorization, but mainly authentication. It is more common with non-Windows clients. This error can occur when using incorrect usernames and passwords with NTLM, mismatched LmCompatibility settings between client and server, duplicate Kerberos service principal names, incorrect Kerberos ticket-granting service tickets, or Guest accounts without Guest access enabled

and

SMB Server event ID 1020
File system operation has taken longer than expected.

Client Name: \\192.168.105.97
Client Address: 192.168.105.97:49571
User Name: HHS\12J.Champion
Session ID: 0x2C07B40004A5
Share Name: \\*\Subjects
File Name: 
Command: 5
Duration (in milliseconds): 176784
Warning Threshold (in milliseconds): 120000

Guidance:

The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

I have checked the underlying disk/iscsi/network hyper-v cluster for any other errors or issues, but as far as I can tell everything is fine. 

Is it possible that something else is left over from the NOD antivirus installation?  

Looking for suggestions on how to troubleshoot this further.

Thanks


December 11th, 2013 2:14pm

Check cabling, switch, amounts of dropped packets. In general - you have a network failure. Do a stress test, find a faulty component and replace it. Also make sure you deplpy NTtcp and IPerf as your config need to have close to wire speeds for TCP.

P.S. Run multiple physical networks and all the components at least duplicated for production. 

Free Windows Admin Tool Kit Click here and download it now
December 11th, 2013 4:22pm

Hi There,

Thanks for the quick lesson on iSCSI best practice.  As stated I have already checked the underlying storage/networking/iscsi/mpio etc... and there are no problems at all. The same iSCSI/cluster has been running production vm's for 4 years now without any issues.  

I find it weird that when the SMB service manages to get locked up like this, I can still browse the files fine on the server.  That would rule out any underlying physical storage issue surely? 

One theory I had could be perhaps the use of an iSCSI passthrough disk in the 2008 R2 host to the 2012 R2 guest.  This is the only thing unique to this VM,  all other guest vm's use vhd files on CSV's.  


December 12th, 2013 8:44am

Hello Dan,
 
Thank you for your question.

I am trying to involve someone familiar with this topic to further look at this issue. There might be some time delay. Appreciate your patience.

Free Windows Admin Tool Kit Click here and download it now
December 16th, 2013 7:51am

Hi Dan,

when the issue occurs again, please restart the server service to test if it can resolve the issue. we have to verify if the Server service is corrupted at that moment

Regards,

December 16th, 2013 9:02am

Thanks for the responses.  

I will try restarting the Server service next time it occurs, sadly as this only happens one a month or so it may be a while until the condition occurs again.  

Just for some extra background on the server and its setup:

  • The pass-through disk is configured as a single ntfs volume at around 9TB in size.
  • The volume is presented as a drive letter and then each share (around 8-10 of them) is a subfolder on the disk.
  • The volume does have de-duplication enabled.  Its currently 6TB deduped down to 3TB.
  • The server was upgraded from Server 2012 to Server 2012 R2 before being deployed in production.

As a long term solution, it may just be a case of building a fresh server to move over to.

I'll message again next time it happens.


Free Windows Admin Tool Kit Click here and download it now
December 16th, 2013 10:13am

Hi , can you also check the memory usage of the Virtual Server, have seen a similar issue where the memory was full, had to use Systernal tool to clear the memory down. it could be a memory leak issue.

This was to do with the backup software we were using. not a native windows issue.

December 16th, 2013 4:28pm

Hi, 

How's everything going? If you need any further assistance, please do not hesitate to respond back.

Regards,
Free Windows Admin Tool Kit Click here and download it now
December 27th, 2013 4:03am

Hi There,

I checked the graphs that we log for ram and cpu etc... nothing was out of the ordinary at all the last time it failed.  8Gb of static ram and it was only using around 2.5.

At present it has not yet failed again, so until it does I'm just waiting.  I will post up as much info as possible once that happens.

December 27th, 2013 10:55am

Hi Dan,

I'd be very interested if you have found a solution to this as it sounds like we are having the exact same problem with 2 of our file servers.  Both are server 2012 (haven't upgraded them to R2 yet) and are virtual machines accessing storage with a pass through fibre connection.  Like yourself when the problem occurs the servers are completely responsive, pingable and can be connected to on RDP where I can access the storage directly with no problem.  One of the servers has an application providing AFP support for our apple macs and the macs continued to access their home directories when this happened so it's definitely not storage related, it's only clients that are connecting via SMB that are affected.  Also a quick server reboot fixes the problem, I will try restarting the server service when it next happens.

Annoyingly there is nothing in the logs around the time that this happens so the error messages you've posted may not be related to the problem?!

regards

Free Windows Admin Tool Kit Click here and download it now
January 15th, 2014 5:07pm

Hi Ray,

We've yet to see this issue again.  But likewise we have not done anything related to fix it.  For us I must stress that it would only happen once a month if that which makes it very hard to diagnose.

Interesting you didn't see anything in your logs.  It could very well be a different issue or mine are unrelated as you said!

January 15th, 2014 6:30pm

Hi Dan,

Yeah we upgraded to 2012 in September and have had this happen only once on each of our 2 file servers, the second occasion being yesterday (hence why I was looking on the internet as when something happens twice it ceases to be a fluke!).  If it happens again and I find out any more I will let you know.

Free Windows Admin Tool Kit Click here and download it now
January 16th, 2014 9:02am

Hello All,

I too am experiencing this issue on the latest and "greatest" windows server OS. I have tested this on 2012 and 2012 R2 and experienced the issue on both builds. I am running the servers on hyper-v 2012 r2 and have sr-iov enabled on the server nics to rule out the microsoft hyper-v networking stack (although this did occur with the vmq enabled nics too).  Today I made one change and I will see if it helps... I removed any hidden nic cards from device manager. Please keep me posted if you make any progress on resolving this issue on your servers. 

Thank you,

Fred

January 20th, 2014 2:53pm

New update! It happened again!

So this is the fourth confirmed case now.  Being a little more clued up I observed the following this time:

  • Random clients we're disconnected or could not connect.  Others were still connected fine.
  • No errors we're being logged in the event log.
  • No storage or cluster errors were apparent.
  • Tried restarting the server service.  It failed to restart and just hung at "stopping".  After telling it to stop, a lot of new messages were logged.
  • Being in production I had to restart the server to get our files working again.  Much as I would love to pour over it and troubleshoot for a few hours my phone wouldn't stop ringing.

The new error message:

Event ID 2012 - Source: srv

While transmitting or receiving data, the server encountered a network error. Occasional errors are expected, but large amounts of these indicate a possible error in your network configuration.  The error status code is contained within the returned data (formatted as Words) and may point you towards the problem.

- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> - <System>   <Provider Name="srv" />   <EventID Qualifiers="32768">2012</EventID>   <Level>3</Level>   <Task>0</Task>   <Keywords>0x80000000000000</Keywords>   <TimeCreated SystemTime="2014-01-22T13:32:18.553037300Z" />   <EventRecordID>91405</EventRecordID>   <Channel>System</Channel>   <Computer>FS-02.HHS.local</Computer>   <Security />   </System> - <EventData>   <Data>\Device\LanmanServer</Data>   <Binary>0000040001002C0000000000DC07008000000000840100C0000000000000000000000000000000004F060000</Binary>   </EventData>   </Event>

I'm going to have to rebuild a new server now just to rule it out.  This is such a pain for us and is really knocking our confidence in the new 2012 R2 OS's  

As a side note, we do have de-duplication enabled on the volume, I wonder if other people are in the same boat?

Free Windows Admin Tool Kit Click here and download it now
January 22nd, 2014 2:09pm

Greetings,

Finally found this thread after weeks of searching... I am having the same or similar issue.  Granted mine is just an enthusiast home setup, but here's what I'm seeing:

  • Originally was running Hyper-V Server R2
  • One of the guest OSs (also Win2k12 R2) was a file server with a pass-through 15TB array on an Areca 1280ML
  • The host VM disk is formatted NTFS, and the 15TB passthrough volume is ReFS.
  • All Intel NICs, though at this point they are probably 2-5 years old.
  • Supermicro X8STE with Xeon W3520 AND 24gb non-ecc RAM.
  • Netgear GS748Tv3 Switch
  • Many configurations of NIC Teaming, and even straight host serving of the files through single NIC.

Over the last several months I have ruled out everything I can think of, except for the server.  Since it's really only me using the servers, I'll mostly notice it when streaming content... I'll get a win32 I/O #59 error, suggesting a network failure.  It happens sporadically, but usually once an hour, but on occasion I won't see the issue for many hours.  Then, in Event Viewer, I see the 551 error described above:

SMB Session Authentication Failure

Client Name: \\192.168.0.11
Client Address: 192.168.0.11:50758
User Name:
Session ID: 0x240048000015
Status: The attempted logon is invalid. This is either due to a bad username or authentication information. (0xC000006D)

Guidance:

You should expect this error when attempting to connect to shares using incorrect credentials.

This error does not always indicate a problem with authorization, but mainly authentication. It is more common with non-Windows clients.

This error can occur when using incorrect usernames and passwords with NTLM, mismatched LmCompatibility settings between client and server, duplicate Kerberos service principal names, incorrect Kerberos ticket-granting service tickets, or Guest accounts without Guest access enabled

Things I've tried which seem to suggest an issue with the OS:

  • Wired up the server to a separate (cheap) switch directly with my client.  Problem was reproduced.
  • Reconfigured NIC teaming in every combination available, including disabling it.  Problem was reproduced.
  • Copied over a large library of streaming content to a Windows Standard R2 guest OS that is being  hosted on ESX 5.5.  Problem was NOT reproduced after 24 hours of testing (suggesting everything works fine when the host OS is not Windows).  The other box is a very similar setup hardware wise, minus the large storage.  The other box is also connected to the same Netgear switch.
  • Let's see... I also tried streaming music content from the guest file server TO a guest Windows 8.1 client both hosted on the same box.  Problem was reproduced (I was very surprised by this since my understanding is that it's the virtual switch that would have been doing the talking between the two).

I've read articles about how some of my NICs (like the 82574L), while supported in-box, have been found to have issues and can no longer have drivers written for them because of updated WHQL standards... but my test which reproduced the error on the virtual switch seems to disprove any relationship to the physical NICs.

It's truly to the point where I'm considering moving this machine to ESX as well.  However, I'd really prefer to stick with what I've got, as I'm tired of working on it.  I'll be bookmarking this an will be MORE THAN HAPPY to provide any additional details the community might need.

Thank you for your time.

John



January 25th, 2014 5:37am

I should also note that I didn't have this problem with Windows Server 2012.  I did a little more reading around this morning, and it sounds like others are suggesting it's an issue in SMB 3.02, which R2 uses that previous releases didn't.

I just happened across that tidbit this morning and thought I'd share.

Free Windows Admin Tool Kit Click here and download it now
January 25th, 2014 4:48pm

To test the theory, I reinstalled 2012... both Hyper-V Server and the File Server guest OS.  I continue to experience the issue, though I'm not seeing the SMB entries in Event Viewer.  So, it's either not R2 specific, or maybe I've got some type of hardware issue... but that just seems so unlikely.  May try 2008 R2 or something else to confirm.
January 26th, 2014 5:54pm

Well, I loaded up the host with ESX 5.5 and installed my 2012 R2 file server as a guest on it.  Configured it all the same as when it was a guest on Hyper-V.  Same problem...

At this point it sure seems like something in Windows 2012 and above, perhaps with SMB.  I haven't tested with 2008 R2 yet... I might try that next, but it was a heck of a lot of work just to get this far and I'm spent.

SMB Session Authentication Failure

Client Name: \\192.168.0.180
Client Address: 192.168.0.180:55373
User Name: 
Session ID: 0xC0000000065
Status: The attempted logon is invalid. This is either due to a bad username or authentication information. (0xC000006D)

Guidance:

You should expect this error when attempting to connect to shares using incorrect credentials.

This error does not always indicate a problem with authorization, but mainly authentication. It is more common with non-Windows clients.

This error can occur when using incorrect usernames and passwords with NTLM, mismatched LmCompatibility settings between client and server, duplicate Kerberos service principal names, incorrect Kerberos ticket-granting service tickets, or Guest accounts without Guest access enabled

Free Windows Admin Tool Kit Click here and download it now
January 30th, 2014 3:30am

Yesterday I rebuilt the server with all new hardware... well, new to me.  Dual XEON, LGA771 on a Supermicro board, all Intel NICs, and fully buffered ECC Kingstom RAM.  Loaded it with ESX 5.5 and the same Windows file server guest (Windows 2012 R2 Datacenter Eval).

Same issue.

Having ruled out every piece of hardware on my network... I guess I'm left with the possibility that there's some sort of authentication problem between the file server and the domain?  The Event Log message seems to suggest that, at least, and this problem doesn't exist while transferring or streaming files from the virtual guest DC on a different host.

Not sure what my next step is.  I'm debating converting the guest file server to Ubuntu, which would at least prove or disprove the problem is isolated to the Windows guest.

February 4th, 2014 4:08pm

I hope this is being somewhat helpful and that I'm not just having a conversation with myself :-).

Here's what I found over the last 24 hours.  As I mentioned, I rebuilt the server with all new hardware... which at this point totally eliminates hardware issues at all levels.  I tested again with 2k12 R2, same issue.  I reverted to 2k8 R2 today, and... same issue.

So now I'm beyond hardware issues and probably beyond "Windows" issues.  I haven't tried converting the server to Ubuntu yet, but I think my Win2k8 R2 test told me that the problem lies somewhere in configurations (and that the problem isn't related to SMB 3.02).  So, since streaming works perfectly fine when test media is located directly on a domain controller, and since Event Viewer entries suggest authentication issues in SMB, I began looking into reasons why any sort of authentication/domain chatter might fail.

I have all my servers virtualized, including my domain controllers... and currently it's all on ESX 5.5, as a result of troubleshooting this problem.  I looked into the network configuration of the primary domain controller, the one I was able to successfully stream from.  For its network I had it connected to the same virtual switch as the other guests, which has several NICS teamed together.  So then I ran across this...

http://social.technet.microsoft.com/Forums/windowsserver/en-US/68b5894a-3eeb-4090-abcb-78538f9379fa/teamed-nic-for-domain-controller?forum=winserverDS

Which suggests that NIC teaming for Domain Controllers is a no-no.  So, I carved off two vmnics to a new virtual switch, set one as active and one as standby, and am beginning to test again now.  I'll let you know how it goes :-).

Free Windows Admin Tool Kit Click here and download it now
February 6th, 2014 2:06am

Well, I think I figured it out.  Like most problems that take weeks to figure out, the solution appears to have been pretty simple.

I noticed a bunch of audit failures in the event viewer... and all were related to SMB sharing.  More research took me here...

http://social.technet.microsoft.com/Forums/windowsserver/en-US/ae9da10a-b4d2-4eda-ae6d-ad61b7b6ab79/audit-failure-event-id-4625?forum=winserversecurity

... the commands the guy recommended didn't really do anything for me, but I took the premise of the server's "channel" with the domain controller being corrupted and ran with it.  It made sense considering the number of times I've joined and unjoined the file server to the domain.

So I unjoined it, renamed it to something that's never been on the domain before, rejoined it... and for the last 2-3 hours I've been able to stream media without any interruptions or errors.  I'll let things run all night to be sure, but I think it fixed it.

So, Dan Kingdon, check out that link.  If his commands don't tell you anything, maybe look into removing the server from the domain, renaming it, then rejoining it.  If that's not feasible in your situation, maybe there's a way to fix the "channel" without doing all of that.

Hope this has helped.

John

February 7th, 2014 4:15am

I was wrong, that didn't fix it.  At this point I may just try rebuilding AD.
Free Windows Admin Tool Kit Click here and download it now
February 8th, 2014 11:30pm

Before going that route I decided to try a couple more things... as it just seemed so unlikely that Active Directory was the issue.  Also the many tests I did above, some of which included specific AD diagnostics all indicated that all was well with AD.

So I decided to disperse media throughout physical PCs in the house and run a 15 hour test streaming from them all.  Not a single error or failure from streams coming from 3 machines, two physical and one virtual on my other server. 

This test again suggests that the problem is something specific to the problematic server.  Furthermore since all versions of Windows I've tested show the same error, I'm led to believe AGAIN that it's hardware related.  Problem is, at this point I've replaced ALL the hardware in that machine... except for the Areca RAID controllers.

There are two Arecas in the box, a 1280ML and a 1222... so, one from two different generations.  The 1280 normally hosts all my data, and the 1222 my backups.  I moved media to the 1222, streamed from there for a couple hours, and reproduced the error.

So then, I sat in a quiet corner and thought for about 20 minutes.  These two Areca controllers are ALMOST the only thing unique to this machine vs. all the others.  The other server also uses a 1280ML, but it's running VMWare with the VMWare driver.  Streaming from virtual machines on ESX aren't accessing the controller directly, which may be why I don't see the problem while streaming from guests that machine.  And of course none of the PCs in the house have Arecas.  Given I've had issues with Areca controllers ever since the release of 2012, I'm thinking now that the culprit is the Areca drivers.  Looking back through all my tests, they all reproduced errors when the media was hosted on machines which used the Areca drivers, including ESX guest with the Arecas passed through.

In one final test for this theory, I loaded up a Ubuntu 12.04 guest on the problematic Hyper-V machine, and passed through the NTFS partition that I normally pass through to Windows.  I shared out the media folder, and again within about an hour, I got the error.  The underlying Areca driver on the Hyper-V host is the one thing all the failure tests seem to have in common.

I'm going to pick up an Adaptec from eBay tonight and give that  a try this week. 

February 10th, 2014 1:10am

For what it's worth I too am running into the same exact problem.  I've been battling this since late December 2013.

  • Running Windows 2012
  • Hypervisor is ESXi 5.1
  • Server service hangs if i try to restart it during the issue
  • The problem, when it occurs, only affects random clients.  Meaning it continues to work fine for some, but not others.  We have multiple wiring closets and the problem is inter mixed across all of them.
  • I too can RDP / ping to the host from a problem client
  • Rebooting the server is the only solution that fixes it.
  • I did a packet capture from the server view and from what i can see it, an SMB negotiate is sent from the client, but the server never responds with with SMB protocol to use.  It ends up in a repeating loop until the client gives up.

It's good to know i'm not the only one having this issue. 

Free Windows Admin Tool Kit Click here and download it now
February 20th, 2014 5:32pm

This looks like a bug. We are debugging this since october 2013. 

I have found another Topic which describes the same issue: 

http://www.edugeek.net/forums/windows-server-2012/126721-server-2012-file-server-suddenly-stops-serving-requests-but-otherwise-looks-fine.html

February 21st, 2014 9:33am

I'm seeing the same issue on Server 2012 R2 VM hosting on Server 2012.  We're using Dell EqualLogic iSCSI arrays.

I can reproduce the problem easily with Windows 8.1 workstations and loading roaming profiles off a 2012 R2 file server.

The key event log entry seems to be:

File system operation has taken longer than expected.

Client Name: \\[2001:630:]

Client Address: [2001:630:]:59115

User Name: DOMAIN\9999

Session ID: 0x34000C00003D

Share Name: \\*\Roaming Profiles

File Name: STUDENTS\9999.V4\NTUSER.DAT

Command: 16

Duration (in milliseconds): 77942

Warning Threshold (in milliseconds): 15000

Guidance:

The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

I've also notied taskmanager show 100% disk utilisation but 0 read/write/response time.  I can still browse the disk locally but I can't copy files or make directories. It seems to completely lock up the disk.

Has anyone opened a PSS case? Can you post case numbers - i'm going to open a case and it might be helpful to link cases.  My case number is 114022411211053



  • Edited by DJL Monday, February 24, 2014 11:06 PM
Free Windows Admin Tool Kit Click here and download it now
February 24th, 2014 10:39pm

I'm seeing the same issue on Server 2012 R2 VM hosting on Server 2012.  We're using Dell EqualLogic iSCSI arrays.

I can reproduce the problem easily with Windows 8.1 workstations and loading roaming profiles off a 2012 R2 file server.

The key event log entry seems to be:

File system operation has taken longer than expected.

Client Name: \\[2001:630:]

Client Address: [2001:630:]:59115

User Name: DOMAIN\9999

Session ID: 0x34000C00003D

Share Name: \\*\Roaming Profiles

File Name: STUDENTS\9999.V4\NTUSER.DAT

Command: 16

Duration (in milliseconds): 77942

Warning Threshold (in milliseconds): 15000

Guidance:

The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

I've also notied taskmanager show 100% disk utilisation but 0 read/write/response time.  I can still browse the disk locally but I can't copy files or make directories. It seems to completely lock up the disk.

Has anyone opened a PSS case? Can you post case numbers - i'm going to open a case and it might be helpful to link cases.  My case number is 114022411211053




I think that sounds like a different issue than what we're describing.  For us, SMB stops working, but the disk sub-system is fine.  I can copy file without issue (when logged into the server). 
February 25th, 2014 1:46pm

What's everyone's AV and processor?  I've read the there's some know bug with a specific processor.  I have the same processor family as the one mentioned here, but I have a different model.

http://social.technet.microsoft.com/Forums/en-US/7a625cfb-0252-407a-96bc-131a1a7f291a/intermittent-loss-of-unc-path-access-on-windows-server-2012?forum=winserver8gen

For our AV we use ESET.

Free Windows Admin Tool Kit Click here and download it now
February 25th, 2014 1:48pm

Well, several more weeks and lots more $$ later, I think (for reals) I've found the issue... though I still can't seem to resolve it.

It has never been hardware related.  Looking back I should have suspected that from the start.  I attempted streaming/copying continuously from the share directly from the IP address of the server... so, \\192.168.0.4\<share>.  Whenever I do that, there's never a problem.

What I'm seeing now is the problem exists while only streaming from a share by accessing it through either an A record or CNAME.  Even directly to the server via the server name fails \\<servername\<share>.  I have removed all records of the server, rebooted it and let it re-register itself in DNS, and still the problem persists while accessing via DNS host or aliases.

There are numerous articles like this one describing additional steps to be taken to accommodate sharing via DNS names, but they haven't worked for me either:

http://forums.techarena.in/windows-server-help/1195474.htm

So anyways, that's where I stand with it.  \\<IP>\<share> = good to go    \\<anythingdns>\<share> = no dice.

February 26th, 2014 1:38am

I'm not sure what this actually showed me, but last night I let a bunch of stuff stream to a Windows 7 PC in the living room using the FQDN share path, and everything ran beautifully all night.  On the other hand, on my 8.1 PC (which is where I've been doing the bulk of the troubleshooting), I streamed content all night long accessing the same share by IP address and surprisingly... I received errors all night (share becomes unavailable for a period then re-establishes itself).

Every test I do invalidates the last one.  This 8.1 machine streams fine from all other physical and virtual machines in the house... all either running Windows 7 or 2012 R2, and up until last night it streamed from the server by IP without issue.

Free Windows Admin Tool Kit Click here and download it now
February 26th, 2014 3:03pm

Let me amend my last post.  The 8.1 machine was actually accessing the content through a mapped drive (X:), which was MAPPED to \\<IP>\<share>.  Sitting here this morning with some ZZ Top playing while I work, everything disconnects briefly... and if I have any explorer windows open to the X: drive this window pops up:

One of the errors I get

It then goes away and everything resumes.  Previously when I streamed without issue, I simply opened up \\<IP>\<share> in an explorer window and played content from there.  I'll try that again now instead of accessing through a mapped drive.

February 26th, 2014 3:20pm

No problems yet so far streaming from directly browsing to \\<IP>\<share\

I'm going to try streaming from manually browsing \\<FQDN>\share\, and not from a mapped drive and I'll report back.

FYI, my maps hare handled in Group Policy... and not by login scripts.  I did have problems in the past with Group Policy drive mappings in Windows 8+.  That problem was fixed by de-selecting "Reconnect".  As another test, perhaps I'll manually map a drive to \\<FQDN>\<share> and try that.

I think we're narrowing it down? 

Free Windows Admin Tool Kit Click here and download it now
February 26th, 2014 5:25pm

Alright, I have streamed all day successfully when accessing the share directly and NOT through a mapped drive.  At least in my scenario, I can pretty confidently say the problem exists only when interacting with a share over mapped network drives.  When interacting with the share directly, like through explorer by entering the UNC, the problem is not reproducible.

But, since I WANT to be able to used mapped drives I'm going to test more scenarios... like manually mapped drives and drives that are mapped by login group policy scripts instead of drive-map policies and see what that shows me.

February 27th, 2014 2:10am

Here's the link to that other drive mapping issue from nearly 1.5 years ago now with the release of Windows 8...

http://social.technet.microsoft.com/Forums/en-US/7b033812-4ead-426d-a25b-aa5082859a25/cant-map-network-drive-with-login-script?forum=W8ITProPreRel

Not sure if there's a relation, but meant to put that link in my earlier post as I referred to it.

Free Windows Admin Tool Kit Click here and download it now
February 27th, 2014 4:04am

I honestly don't think it matters how you access the share, as windows doesn't care.  Whether you map by DNS, or IP or whether you manually map a drive, doing via GPP or connect to a UNC.  The issue, whatever it is, is related to the server service, nothing to do with the client.  Whatever the issue is, is new to 2012 R1 / R2.  So SMB 3.0 could be a culprit, as could a whole slew of other things.  Would be nice if MS would chime in.
February 27th, 2014 1:32pm

Sure seems like that would be the case... but my tests, as relatively non-technical as they are, do seem to suggest it's at least related to mappings.  Last night I streamed continuously throughout the night from the manually mapped drive and got failures and unavailability messages several times.  Tests from directly accessed shares never so far have the issue (and I can reproduce it reliably).

Is there some other mechanism involved in talking to shares through maps?  Extra authentication, DNS queries, anything?  Either from the client or host?  Intentional and periodic disconnects and reconnects?

I reproduced this problem as well with 2012, and even 2008 R2 (see above).  In all cases, I was using a 2012 R2 Domain Controller/ DNS server... and in all cases I was using a Windows 8.1 client.

John

Free Windows Admin Tool Kit Click here and download it now
February 27th, 2014 2:27pm

So, that test I did all night with the manually mapped X: drive (and all other drives disconnected) failed... but then I noticed something this morning while continuing to stream media from it.  The GPOs got ran and therefore re-mapped all the drives I disconnected, presumably including the X: drive.  I realized this only because suddenly all the mapped drives I had manually DISconnected reappeared... without needing a reboot or re-login.

When this GPO was re-applied, at that very moment all the streaming became unavailable and the errors I've been seeing appeared.  So, as of right now, it seems like the drive mapping GPO perhaps gets reapplied periodically... refreshes... or something.  When this happens any network activity using the mapped drive is interrupted until the new connection is established.

To test this, I deleted all my mapped drive GPOs this morning and rebooted my client.  I then manually mapped the X:\ drive to \\<FQDN>\<share>, and so far for about 4.5 hours there hasn't been one failure.  My guess is because the GPO for mapping drives won't ever run.

February 27th, 2014 7:00pm

for us specifically, this is a server issue not a client issue.  The clients can all access other shares just fine, the server specifically freezes (the server service) and will not recover unless we reboot. 
Free Windows Admin Tool Kit Click here and download it now
February 27th, 2014 7:55pm

I've been working through this with Microsoft over the last few days - it's proving to be a tricky one to pin down.

Those who are experiencing the problem i'd be intrested to know:

  • Is everyone seeing SMBServer Event ID 1020: File system operation has taken longer than expected?
  • What OS are your clients running?
  • What are your disk counters saying while its happening? (Diskperf -Y -> Taskmgr - > Active Time, Read Speed, Write Speed? or performance monitor)
  • Does restarting the Server service solve the problem? (It does for us, but it takes ages to stop the service, but then again the machine also takes ages to shutdown for the same reason)
  • Who is seeing SMB negotiation problems? (Either using Network Monitor 3.4 or powershell Get-SmbConnection.  We are seeing Windows 8.1 occasionally incorrectly negotiating SMB3 instead of SMB3.02)
February 27th, 2014 8:11pm

I've been working through this with Microsoft over the last few days - it's proving to be a tricky one to pin down.

Those who are experiencing the problem i'd be intrested to know:

  • Is everyone seeing SMBServer Event ID 1020: File system operation has taken longer than expected?
  • What OS are your clients running?
  • What are your disk counters saying while its happening? (Diskperf -Y -> Taskmgr - > Active Time, Read Speed, Write Speed? or performance monitor)
  • Does restarting the Server service solve the problem? (It does for us, but it takes ages to stop the service, but then again the machine also takes ages to shutdown for the same reason)
  • Who is seeing SMB negotiation problems? (Either using Network Monitor 3.4 or powershell Get-SmbConnection.  We are seeing Windows 8.1 occasionally incorrectly negotiating SMB3 instead of SMB3.02)

To answer your questions since i think we're seeing the same issue:

  • No 1020 event in system or the application log
  • Clients are a mix of windows xp through windows 8.  TMK, Windows 7 and Windows 8 are the only ones i've seen with the problem.  However we have a very small XP population, so that may not be 100% accurate.
  • I didn't think to look at the disk counters, but will post the next time i see it (its been a full week with no issues)
  • We try to restart the server service, but it timesout.  I've never waited to see if it would restart and simply rebooted the server.  Every time we've had the problem though, the server service is hung.  The reboot its self is actually quick.
  • We do see SMB negotiate problems and i have a capture of it.  Basically the negotiate packet comes in from the client to detect the dialect, and then the server never responds with the dialect.  There is TCP communication that's sent to the client. 

Other info:

Server OS = Windows 2012 R1

AV = ESET (Nod32)

Free Windows Admin Tool Kit Click here and download it now
February 27th, 2014 8:51pm

Alright, I've streamed successfully all day with a manually mapped drive and no drive mapping GPO applied.  I think as far my issue is concerned I'm good to go.  If any of you who are experiencing this problem use drive mapping GPOs to map drives, try instead to map drives with login scripts.

I will say that never did I have a server freeze, requiring a reboot.  So, I'm not entirely sure my issue ended up being the same as the OP's.  Nonetheless, this is what I've found.  Hope it helps somebody out there.

John

February 28th, 2014 1:43am

  • We are seeing EventID 1020 but for us the problem is very rare.
  • 99% Windows 7 with a couple of Win 8/8.1
  • Haven't been able to check counters during the problem yet.
  • Restarting server service takes ages (i've never actually waited long enough for it to finish restarting).  Rebooting whole server is much quicker in our scenario.
  • I have not tried this during a failure yet.
Free Windows Admin Tool Kit Click here and download it now
February 28th, 2014 8:44am

Dan & Eric thanks for your replies.

Eric - the 1020 event log error is logged in the SMBServer log which can be found under: Event Viewer | Application and Service Logs | Microsoft | Windows | SMBServer | Operational

We aren't running any AV on our file servers at the moment - took it off for troubleshooting,  We normally run System Centre Endpoint Protection

February 28th, 2014 9:28am

Forgot to mention with AV.  Ours did have ESET Nod32 but has been uninstalled since the first occurance of the problem.
Free Windows Admin Tool Kit Click here and download it now
February 28th, 2014 9:31am

Forgot to mention with AV.  Ours did have ESET Nod32 but has been uninstalled since the first occurance of the problem.

Did MS have you do any kind of dump while the issue is occurring?  I would presume they'd be able to see the hang up if so. 

Also, can you provide your case number.  I'm hoping to open a case soon and i'd like to link to yours as well.

Finally, for whatever reason, my SMB server log is empty.

February 28th, 2014 1:49pm

Hi guys,

Out of interest have you migrated the volumes on your file servers from other versions of Windows Server - ie they use to be attached to Server 2008 or where they created new on 2012?

Can you check the status of 8.3 name creation on your volumes? Run fsutil 8dot3name query D:

Thanks

Free Windows Admin Tool Kit Click here and download it now
March 1st, 2014 11:50am

Hi DJL,

In our case the volume was originally created on Server 2012 and the whole server was upgraded to 2012 R2.

8dot3 naming is disable on the volume giving dificulties.

Thanks

March 3rd, 2014 9:12am

In our case the data was robocopied (copyall) from a 2003 volume to a new 2012 volume.  8.3 is also disabled for us too.

FWIW we also have a few Mac's accessing our volume, although i suspect that's not related.

Free Windows Admin Tool Kit Click here and download it now
March 3rd, 2014 2:50pm

Hi,
I work for a Network Solutions provider and we have now seen this problem on atleast 3 completely seperate customer sites, all using different hardware, but all on 2012. We built a server to 2012 R2 last week in the hope it had been resolved but the customer has phoned today to say the server stopped serving files to clients and had to be rebooted. The only 100% fix we have found so far is to rebuild the server back to 2008R2 and we have never seen the problem again.

We have logged the case with Microsoft and I will update if we get anywhere with it but at the moment most of the blame seems to be on AV (Sophos) although it is fine under 2008R2 and other people with the same problem have tried without AV and the issue still exists.

I would be willing to work with anyone/share ideas to try and get this resolved for all of us.

Thanks

March 3rd, 2014 4:04pm

Hi,
I work for a Network Solutions provider and we have now seen this problem on atleast 3 completely seperate customer sites, all using different hardware, but all on 2012. We built a server to 2012 R2 last week in the hope it had been resolved but the customer has phoned today to say the server stopped serving files to clients and had to be rebooted. The only 100% fix we have found so far is to rebuild the server back to 2008R2 and we have never seen the problem again.

We have logged the case with Microsoft and I will update if we get anywhere with it but at the moment most of the blame seems to be on AV (Sophos) although it is fine under 2008R2 and other people with the same problem have tried without AV and the issue still exists.

I would be willing to work with anyone/share ideas to try and get this resolved for all of us.

Thanks


Keep us posted and let us know if you need any specifics from our environment.
Free Windows Admin Tool Kit Click here and download it now
March 3rd, 2014 5:19pm

Thanks all for you replies.

We've just managed to capture the information that Microsoft have been requesting.  Essentially they just wanted a network capture on the server and client at the same time while the problem was occurring; with the client trying to access a UNC on the effected server; and a few other bits thrown in.

Eric I think we are seeing exactly what you see:

  • The client sends an SMB Negotiate request to the server: SMB: C; Negotiate, Dialect = PC NETWORK PROGRAM 1.0, LANMAN1.0, Windows for Workgroups 3.1a, LM1.2X002, LANMAN2.1, NT LM 0.12, SMB 2.002, SMB 2.???
  • We can see this being received by the server but it sends no SMB response back. We do see a TCP response on 445 but it's not SMB
  • The client resends the SMB Negotiate approx. every 20 seconds due to a lack of response from the server

We killed IOMETER (we were using it to stress the file server) and waited another 5 mins and the server recovered and eventually the client got a response to its SMB negotiate request and they negotiated SMB 3.02 correctly.


  • Edited by DJL Monday, March 03, 2014 5:28 PM
March 3rd, 2014 5:28pm

Hi Everyone,
I have spoken to another IT guy and he has a number of 2012 servers and has not seen this problem yet (or isn't aware of it) and the only difference we can think of is that he is running Datacenter not Standard. We have seen the problem on 2012 Standard and 2012R2 Standard, can you all confirm the versions you are using?

Also, DJL, are you saying you can reproduce the problem on demand? If so can you google the following article (I can't post a link at the mo) and try changing the timeout value to something lower to see if it stops the problem from occurring?
Microsoft network server: Amount of idle time required before suspending session

Thanks

Tom

Free Windows Admin Tool Kit Click here and download it now
March 4th, 2014 8:33am

We are running Server 2012 R2 Standard
March 4th, 2014 8:39am

Hi Tom

We're running 2012 R2 Datacenter.

Yes we can reproduce the problem on demand (pretty much) on two cleanly installed Windows Server 2012 R2 Core Datacenter virtual machines, no software other than Windows. 

I'll give that value a try as some point - Microsoft are having us run through various test at the moment and they are quite specific on what we can/can't change so I'll need to wait until we've finished those tests.


  • Edited by DJL Tuesday, March 04, 2014 12:35 PM
Free Windows Admin Tool Kit Click here and download it now
March 4th, 2014 12:32pm

Hi,
Are you able to explain how you can reproduce the problem (if it isn't too complicated/time consuming for you) so I can do some investigations on our networks?

Thanks

Tom

March 4th, 2014 12:40pm

Sure, i'll go into detail about our setup as well

Our setup:

  • 4x Dell PowerEdge R610's (Intel Xeon X5560, 144GB RAM, Broadcom 1Gbps LOM and Intel 10Gbps X540-T2) running Windows Server 2012 Datacenter Core / Failover Cluster / Hyper-V
  • 3x iSCSI SAN's - 2x Dell EqualLogic PS6000 and 1x PS6110
  • The file servers we see the problem on are Windows Server 2012 R2 Datacenter Core.  Their system/boot disk are VHDX's on Cluster Shared Volumes and the file data is stored on SCSI Pass-through disks. 
  • The file servers have the IPv4 stack uninstalled - we run IPv6 only.
  • All hardware is running the latest firmware/drivers etc
  • Client workstations are running Windows 7 SP1 and Windows 8.1.  All latest updates from Windows Update/WSUS are installed

To reproduce the problem we:

  • Map a share on the server to a workstation.  Run IOMETER on the share to stress the server.  IOMETER settings are: 2,000,000 sectors | 400 outstanding IO | 512B 100% read access specification | 4 workers.  This takes the disk activity up to 100%
  • We then logon a number of Windows 8.1 workstations simultaneously - the users roaming profile is stored on the same server/volume. 
  • We normally login to about 40 machines at the same time to make sure the problem happens, but it can happen with a few as 1 or 2 machines.

I'd be interested to know what processors you guys are using?  

Free Windows Admin Tool Kit Click here and download it now
March 4th, 2014 1:09pm

Xeon E5520 and Xeon 5620 

We also have a 4th cluster node with a brand new Xeon E5-2643 v2 but the file server has never really been hosted on that node.

Thanks for the info on how you have reproduced the error.  I may try the same with our file server out of hours and see if I can trigger the same result.  

March 4th, 2014 1:36pm

The latest server we have seen this on is using Dual Xeon E5-2620 running Citrix XenServer 6.2

Thanks for the details, very helpful, I will see if I can replicate the problem and post back when I can.

Tom

Free Windows Admin Tool Kit Click here and download it now
March 4th, 2014 4:16pm

For what it's worth, I'm encountering the same problems at my workplace. Setup:

  • Intel Xeon E5-2420
  • VMWare ESXi 5.5, build 1474528. 
  • Windows 2012 R2 Essentials
  • 2012 R2 as sole Domain Controller, running DNS, AD, DHCP and file/print sharing
  • No WSUS set up yet.

Problem seems to manifest most often during file saves in Office 2007, but 90% of our document shuffling is spreadsheets, so it would make sense that's where we see it most frequently. Once a single user starts having the issue, it starts to show up on others. All of our workstations are running XP or Windows 7.

Seems to happen most frequently on Windows 7 clients.

Disabling smb2/3 seems to have allowed me to pull an individual workstation out of the stall just by waiting for the network share to display properly again, but it's not a solution... it usually takes a few minutes for it to resolve the share contents properly. It's not a good approach, just a stopgap that allow (eventually) saving open files.

.tmp files with the file name as a random hex value show up anywhere we've had a workstation stall out during a save. The file itself is basically inaccessible in most cases until the next reboot of the server. Those temp files often can't be addressed, opened, deleted or used in any way without causing explorer to freeze.

Once the server is rebooted, I can collect and delete all of those .tmp files, or open them in their respective programs. (word/excel/etc) Once in a while, the original file is also corrupted and can't be opened/used/saved over/replaced/renamed until a server reboot. 

Not a lot to add to the discussion, just another instance of it happening. I've been following this thread and http://www.edugeek.net/forums/windows-server-2012/126721-server-2012-file-server-suddenly-stops-serving-requests-but-otherwise-looks-fine.html and hoping someone comes up with a solution sooner or later.

As it currently stands, I wouldn't recommend deploying either 2012 or 2012R2 as a file server in any circumstance. Works great for everything else, but this pretty much shuts our entire workplace down, sometimes multiple times a day, since our key software has data files hosted on the file share. 



  • Edited by bergmbe Tuesday, March 04, 2014 5:47 PM
March 4th, 2014 5:10pm

Setup:

AD Domain 2008R2, 2 VM DCs (08r2) 1 Physical (2012)

2 VM FS and 2 Physical.  They are set to cluster.  so clstr1 and 2 physical, 3 and 4 vm.

\\fs is the file share server name.

Storage network ISCSI jumbo frames.  Shares on a Dell PS6500.  VMS hosted on MD3220 (same network)

When this occurs I can move the Node/FS Role to a new server and comes back.  I have to reboot the original server if I want it to work again.

Problems start with some people and spreads.  We do folder redirects and people are quick to tell us of the issues.

We brought in consultants to help resolve the issue.  The packet capture was very interesting.  So very responsive to all other traffic but SMB (1 and 2 no 3) show crazy delay with negotiating protocol.

So in simpleton terms, client say hey, server ack, client smb access, server waits, 50 seconds later client say forget you, server says fine.

Problems are occurring more because we thought XP (since hotfix installed on server) was part of the issue, and we have migrated to Windows 7 heavily.  Now crashes have gone from once a week to 1 to 2 times a day.

We increased the size resources to AD PDC (VM) thinking it was pegged (that was yesterday).  Today we had the typical issues right before right out failure.  Moved and services returned.  I will be taking this article to the power that be so we can start a downgrade (I feel the best solution).  Oh and these are brand new builds (fresh installs).

We are tight on storage (SAN presenting lun to servers over ISCSI) might have to build and move the LUN to new server.  Anyone do this and have any issues?

  • Edited by tcgood Wednesday, March 05, 2014 12:03 PM
Free Windows Admin Tool Kit Click here and download it now
March 5th, 2014 4:20am

Also we use GPP to map drives - does anyone above have the issue with login scripts?


Just trying to find a common denominator.
  • Edited by tcgood Wednesday, March 05, 2014 4:01 PM
March 5th, 2014 4:00pm

Just an update:

We've reproduced the problem twice today.  Both times we captured SMBServer tracing, network capture and full memory dump from both the server and a client.

Microsoft support are now analysing these - hopefully they'll be able to pinpoint something!

Free Windows Admin Tool Kit Click here and download it now
March 5th, 2014 5:46pm

Team,

We are also receving the same issue a couple of times per day with very similar symptoms.

Yesterday I changed the autodisconnect registry settings which seemed to make the dropouts occur less, however the issue did re-occur today, once! I know this article doens't relate directly to server 2012 R2 but none the less could have an effect. See KB297684 (I can't post links yet)

I would be interested if this makes a difference to the ones who can re-create the issue as my dropouts occur without warning and I cannot re-create.

DJL; please keep us updated with your contact from MS, hopefull they find the issue and release a patch.

March 6th, 2014 5:33am

Just an update:

We've reproduced the problem twice today.  Both times we captured SMBServer tracing, network capture and full memory dump from both the server and a client.

Microsoft support are now analysing these - hopefully they'll be able to pinpoint something!


 Can you give me your Case Number? Maybe we can link our cases.
Free Windows Admin Tool Kit Click here and download it now
March 6th, 2014 8:23am

My case number is 114022411211053.  Can you post yours and I'll try and link it from my side as well

March 6th, 2014 11:50am

I know I came in late with this, but my case is 114030511237424.  I really am trying to find a common denominator here with our configuration.
Free Windows Admin Tool Kit Click here and download it now
March 6th, 2014 2:32pm

Thanks - I've sent your case number to the engineer dealing with my case.

I'm not sure there is a common denominator other than Windows 2012/2012R2 - I think it's just a bug in the SMB Server or associated components. I'm seeing the problem with clean installs of Server 2012 R2 core - no av, backup, monitoring etc.

I'm going to stick Server 2012 R2 on a desktop tomorrow and see if I can reproduce the problem on that - if I can then it'll rule out any problems with virtual machines, iSCSI, passthrough disks etc


  • Edited by DJL Thursday, March 06, 2014 10:06 PM
March 6th, 2014 10:05pm

Microsoft says they are calling me but I get nothing on my phone.  I did a netstat from my DC and found that several computers were connected with hundreds of LDAP sessions.  Today we had issues with people gradually losing connections and with powershell I ran netstat -an | Select-String -pattern ":389" .

I found that the file server was no longer connected to a DC.  It is strange, it was like my AD was experiencing a ddos on ldap.  So I tracked down one of those PCs and ran netstat -b to find out why there were so many connections to the DC on 389.  svchosts was running gpsvc.dll with tons of connections.

Anyway blah blah blah

http://support.microsoft.com/kb/2561285

Still verifying the fix will work - have to spend the week applying this hotfix to problem machines.  I will let everyone know if we are good for awhile. Oh, and this is a year old fix that is not part of updates for windows 7.

  • Edited by tcgood Friday, March 07, 2014 3:00 AM
Free Windows Admin Tool Kit Click here and download it now
March 7th, 2014 2:56am

I swear working with Microsoft Support is torturous sometimes! No progress here yet..
March 10th, 2014 1:16pm

Yeah it is.  Did you check your DCs netstat connections over 389?  We have hundreds of computers to attempt to apply the hotfix to.  Also during my troubleshooting of the server I have found that I can move the node over to a working server without an issue coming back to life - after which trying to \\clstr2\ no response until reboot.  At this point I am just ranting...
Free Windows Admin Tool Kit Click here and download it now
March 10th, 2014 8:17pm

Applied that hotfix to our Windows7 pc's, where we're experiencing most of the issues. We only have ~4 on site, so it was fast. Encountered the same issues about 2 hours after applying it, so no dice for our site at least. Let me know if it works for yours. Good luck with the rollout! That sounds like a nightmare. 

I had not been seeing the :389 issues on netstat that you are though, either before, during or after the fileshare issues manifest. 

  • Edited by bergmbe Monday, March 10, 2014 9:10 PM
March 10th, 2014 9:05pm

There just has to be a common piece to our issues.  I am pushing the patch using PSexec and a text file - should be on them by the end of the day.  We had another outage today and sent our capture into microsoft.  I am really close to downgrade to 2k8r2.  For us it is happenings daily now - and many times several times in a day.

How many devices do you have on your network?  Does your FS have a lot of traffic?

Free Windows Admin Tool Kit Click here and download it now
March 11th, 2014 3:06pm

So for those with MS cases open, does MS have nothing to say yet?  I've seen at least one of you had generated a system dump.  You would think that's all that's needed for MS to figure it out.
March 11th, 2014 4:21pm

Afternoon all,

Give this registry setting a try:

reg add HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters /v DisableLeasing /t REG_DWORD /d 1 /f

I've just tested it on one of our servers and I can't reproduce the issue at the moment so it looks like it may solve the problem! (I'm trying to leave the office for the day so haven't tested extensively!)


  • Edited by DJL Wednesday, March 12, 2014 12:36 PM
Free Windows Admin Tool Kit Click here and download it now
March 11th, 2014 5:24pm

Can you tell me more about the key?  I hate to add a reg key and not know what it does.

Thanks

March 11th, 2014 6:04pm

@TCGood

Single server, ~14 workstations plus 5 additional devices. Not a ton of traffic, but at any given time we have 20-30 files open with locks. Mostly small files under 1mb. The busier days with lots of opens and saves seem to coincide with this problem manifesting more frequently. 

Then again, I came in to the issue this morning and no one had been using the network at all for 12 hours. Only 3 active logins when I arrived, and 2 of them were in the process of stalling out. All other active file share log ons were unresponsive. In the interest of getting people back to work, I just restarted the server.

I'm with you on being at the point where downgrading makes the most sense. But my business won't put up funds for a 2008 R2 license, so in lieu of that, I recently set up a second VM and installed Ubuntu LTS. I'll be setting it up as a simple file server later this week until I see some resolution from Microsoft. Not ideal, but it can't be helped at this point.

Free Windows Admin Tool Kit Click here and download it now
March 11th, 2014 7:02pm

that is bad, at least I am at 600 devices.  Traffic is between 5 and 100mb mostly.

1) Are you using Quotes (file resource manager)

2) Install disc where/how acquired

3)Special settings ie continuous availability, Volume Shadow Copy, ABE etc.

I am still looking for a common factor.  MS has my captures and keeps asking the same questions, "when the cluster goes down can you access the node?" NO! I can RDP, ping, and otherwise responds.  Admin Share and all SMB unavailable.  I am going to spin up a VM with a different install disc to stress test.

March 11th, 2014 8:02pm

The registry key disables leasing.  More info on leasing can be found here: http://technet.microsoft.com/en-us/library/ff625695(v=ws.10).aspx

The following will be logged in the event log when you add the reg key:

File leasing has been disabled for the SMB2 and SMB3 protocols.  This reduces functionality and can decrease performance.

Registry Key: HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters
Registry Value: DisableLeasing
Default Value: 0 (or not present)
Current Value: non-zero

Guidance:

You should expect this event when disabling SMB 3 Leasing. Microsoft does not recommend disabling SMB Leasing. Once disabled, traffic from client to server may increase since metadata and data may no longer be retrieved from a local cache.

So far this seems to have solved the problem for us.  I'll try and get more info out of Msft later once I've fully confirmed it solves the problem for us.

Free Windows Admin Tool Kit Click here and download it now
March 12th, 2014 10:16am

was this key a MS recommendation or something you discovered?  Also in the past have you been able to consistently reproduce the problem? 
March 12th, 2014 2:58pm

When we hired out consultant this was the first thought he had without looking into the issue very far.  It is extremely reliable, but losing the benefits of SMB 2.0 and 3 (we don't use).  At 1pm will have a call back MS.
Free Windows Admin Tool Kit Click here and download it now
March 12th, 2014 4:03pm

Yes - the key was recommended by Microsoft support, apparent they have quite a few people reporting this issue at the moment ;) and yes we have been able to reliably reproduce the problem.

Today was quite promising: we brought both our file servers back up to full load and we didn't see any problems once the registry key was set - we haven't managed 20 mins at full load since the problem surfaced so definitely better.  I'm still slightly on edge about it though as Wednesday is generally a quieter day for us - if we manage to make Friday afternoon without it resurfacing I'll be more confident.

tcgood - you won't lose all the benefits of SMB 2.0, 2.1, 3.00 and 3.02.  The reg key is only disabling leasing, all the other improvements will still be available and 3.02 will still be negotiated (client os version dependant). 

Having said that I don't consider the reg key a fix, merely a work around.  Once I'm confident our servers are stable with the reg key i'll push MS to see what they plan on doing about it permanently. 

The odd thing is file leasing is in SMB 2.1 which was available in Windows 7/Server 2008 R2 and as far as I'm aware this issue didn't affect 2008 R2.  I guess it could be directory leasing which was introduced in SMB 3...


  • Edited by DJL Wednesday, March 12, 2014 8:34 PM
March 12th, 2014 8:28pm

Good to know, thanks for the quick reply.  I'm going to wait till Friday before i try the key.  if you're still stable after then, i'm going to try it in our environment.  

What do you guys do to actually make the problem occur?  Would running IO Meter on a network share trigger it?

Free Windows Admin Tool Kit Click here and download it now
March 12th, 2014 10:13pm

Ok - never mind.  We've just had the problem reoccur - the reg key doesn't fix it :(
March 13th, 2014 10:12am

Ok - never mind.  We've just had the problem reoccur - the reg key doesn't fix it :(

So now what does MS think?  It blows my mind that they can't figure this out after getting dumps from multiple folks.
Free Windows Admin Tool Kit Click here and download it now
March 14th, 2014 5:42pm

They want us to capture logs, memory dumps and tracing again! - so another late night tonight.

It happening every f**king 20 mins on our server at the moment.  I'm royally pissed this morning!


Edit: Apologies, I'm getting very frustrated with this problem and being told there isn't a problem by PSS.  If it wasn't for the fact our users love work folders i'd be back on 2008 R2 by now.
  • Edited by DJL Monday, March 17, 2014 9:54 PM
March 17th, 2014 12:05pm

We've reproduced the problem again twice this afternoon and provided PSS with two new complete sets of memory dumps, SMB tracing, event logs, network captures and screen shots from both the server and client.

They now have 4 sets of this data from us.  Hopefully they'll find something this time!

This is the weird 100% active time symptom we're seeing in taskmgr:


  • Edited by DJL Monday, March 17, 2014 9:59 PM
  • Proposed as answer by TheOriginalHB3 Tuesday, March 18, 2014 2:48 PM
  • Unproposed as answer by TheOriginalHB3 Tuesday, March 18, 2014 2:48 PM
Free Windows Admin Tool Kit Click here and download it now
March 17th, 2014 5:09pm

We're experiencing the same issue as everyone else on this forum where our 2012 serverd (not all only two and we have eight in our environment so far) will not accept SMB connections, but all other connection are fine.  Much like everyone else we've tried several things (listed below) and the only temp solution is restarting the server:

Actions take so far with no success!

1.) Restarting the Server Service - The service doesn't start back up, which leads to a reboot anyway.

2.) Verified the following Rollups were installed.  

http://support.microsoft.com/kb/2883201/en-gb
http://support.microsoft.com/kb/2889784/en-gb

3.) Turned off Background Optimization for Data Deduplication

We are currently working with MS (Case #114020511159226) on this issue and they have not idea.  They are just having use collect logs and dump.  We updated the case to SEV A case, so hopefully we have something today.  So I thought it wouldn't hurt to post on this site as well, so see if anyone else had any thoughts.  I'll will keep you guys informed as we try things to come to a solution.  

A possible hotfix (http://support.microsoft.com/kb/2928360) that I came across that deals with SMB2 and SMB3 is due to a memory leak.  I was wondering if anyone noticed during their issue if the NonPaged Pool memory was exhausted. 

March 18th, 2014 3:16pm

Hi - I came across that article a few days ago and have been monitoring our paged and non-paged pool. Both seem normal and don't change when the problem is occurring.

The following potential solution was posted on edugeek.  No help for me as we're running Hyper-V, but interesting non the less.

>> Solved the problem on my end. I don't know why it should matter, but I was using an E1000 network adapter on the VM's giving me this issue, I switched them to VMXNET 3 and have not seen the issue re-occur. This was almost a daily problem and it has not happened for the 8 days that I have been running the VMXNET 3 adapters. I know some of you are not using VMWare ESXI hosts, but for those that are, give this a shot! ESXI 5.1 Server 2012R2


  • Edited by DJL Wednesday, March 19, 2014 12:08 AM
Free Windows Admin Tool Kit Click here and download it now
March 19th, 2014 12:08am

@DJL. I saw that on the edugeek forum as well. I'm going to give it a try this weekend, when I can take down our VMs and make sure the changeover won't affect anything else. Like the guy that posted it, we're using VMWare's ESXI 5.5 and assigned our NICs as E1000 network adapters in the windows environment. I still have printers and a small subset of user files running on the share, even though a majority of our files are now hosted on a second VM running Ubuntu Server 12.04 (LTS release). That way I can still do some testing to see if solutions work for us. I'll post back if switching the NICs over helps at all. 

March 20th, 2014 9:57pm

RESOLVED!

OK People! I hope I can help you everyone out with this solution MS have provided me. I have been dealing closley with the Network Team, in particular the most skilled escalation tech of Asia Pacific who happened to be an expert in SMB. My system has been stable for around 10 days now.

My Environment

  • vmware ESXi 5.1.0 build 1123961
  • Server 2012 R2

My Symptoms

MS have recommended two components to this fix, however, with the vm driver fix applied I was still experiencing the issue, it wasnt until I made the change to the srv2.sys that the fix became permanent.

Vmware driver; MS believe that this driver (vsepflt) could have been conflicting with the srv2.sys driver. To disable follow this article. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2034490

Srv2.sys; In my opinion this is the actual change that fixed the issue. The srv2.sys driver controls SMB 2 traffic at the kernel level, in operating systems pre Server 2012 R2 the driver is set to auto start. Microsoft have changed the functionality for Server 2012 R2 to start on demand, this seems to not be starting gracefully when a request is made on SMB 2 or above.

To change srv2.sys to auto start, open cmd and type sc config srv2 start=auto a reboot will be required after running this command.

When I talked to the lead network tech about this fix, referenced him to this article, how many people it was affecting etc, he advised that in all cases experiencing this issue changing the srv2.sys to auto start has worked 100% of the time. Why havent MS release an official patch I asked because there have not been enough cases to warrant an official fix.

Im very interested in if this fixes your issues, please mark this as an answer if I bring you success! Best of luck!



Free Windows Admin Tool Kit Click here and download it now
March 24th, 2014 5:20am

Thanks for sharing all the feedback and progress.  I've applied the service fix above to our server and will wait and see. 

The reg fix I would rather not disable leases just yet!

March 24th, 2014 9:29am

I have been having the same issue running ESXi 5.5.0 1331820 with Windows Server 2012 R2 Essentials and this is the only fix so far that has made my system stable. We have been having problems of the system crashing multiple times a day, but it has been running for 2 days continuously.

Thanks

Free Windows Admin Tool Kit Click here and download it now
March 25th, 2014 8:44am

Thanks for sharing all the feedback and progress.  I've applied the service fix above to our server and will wait and see. 

The reg fix I would rather not disable leases just yet!

What reg fix?  I know you mentioned one above, but I thought it wasn't working.
March 25th, 2014 4:27pm


When I talked to the lead network tech about this fix, referenced him to this article, how many people it was affecting etc, he advised that in all cases experiencing this issue changing the srv2.sys to auto start has worked 100% of the time. Why havent MS release an official patch I asked because there have not been enough cases to warrant an official fix.


Rant:

To be fair, the only reason I haven't officially reported/opened a case with Microsoft is because my company can't afford the service contracts they charge for issues like this. I understand the need for service contracts during personalized troubleshooting, but it seems counter-intuitive to me in a case like this where it's quite clearly their buggy SMB2 & 3 "upgrades" causing the problems.

On topic:

Once others post their experiences with this fix, I'll try it on our system as well. At the moment, I'm verifying stability after changing our network adapters to from VMWare assigned E1000  to VMXNET3. So far so good on that fixing our issues, but I had to move the bulk of our fileshare to a Ubuntu LTS release running SAMBA in order to get a reprieve, so I'm not sure I'm really testing it. If that change that DJL mentioned from the edugeek thread doesn't fix the issue, I'll try your solution, which I'm extremely glad to have as a back up option. Thanks very much for your report on this. 


  • Edited by bergmbe Tuesday, March 25, 2014 11:36 PM
Free Windows Admin Tool Kit Click here and download it now
March 25th, 2014 11:36pm

Thanks for sharing all the feedback and progress.  I've applied the service fix above to our server and will wait and see. 

The reg fix I would rather not disable leases just yet!

What reg fix?  I know you mentioned one above, but I thought it wasn't working.
Just want to confirm that the registry fix for leasing does NOT work. At least 3 or 4 of us tried it and it didn't actually solve it. All of us that did it altered it back after we discovered it wasn't the fix. 
March 25th, 2014 11:37pm

I'll check in next week.  if the fix is still working I'll move forward with making the change.

My next question for MS would be, should we make this the default for all new builds?  What's the downside?

Free Windows Admin Tool Kit Click here and download it now
March 26th, 2014 1:34pm

What reg fix?  I know you mentioned one above, but I thought it wasn't working.

EricCSinger follow these instructions to fix your issue (see full description in my post above), let me know how it goes.

MS have recommended two components to this fix, however, with the vm driver fix applied I was still experiencing the issue, it wasnt until I made the change to the srv2.sys that the fix became permanent.

Vmware driver;MS believe that this driver (vsepflt)could have been conflicting with the srv2.sys driver. To disable follow this article. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2034490

Srv2.sys;In my opinion this is the actual change that fixed the issue. The srv2.sys driver controls SMB 2 traffic at the kernel level, in operating systems pre Server 2012 R2 the driver is set to auto start. Microsoft have changed the functionality for Server 2012 R2 to start on demand, this seems to not be starting gracefully when a request is made on SMB 2 or above.

To change srv2.sys to auto start, open cmd and type sc config srv2 start=autoa reboot will be required after running this command.

When I talked to the lead network tech about this fix, referenced him to this article, how many people it was affecting etc, he advised that in all cases experiencing this issue changing the srv2.sys to auto start has worked 100% of the time. Why havent MS release an official patch I asked because there have not been enough cases to warrant an official fix.

Im very interested in if this fixes your issues, please mark this as an answer if I bring you success! Best of luck!

March 27th, 2014 11:53pm

I applied the sc config srv2 start=auto command and will report back if it does the trick.  TBH, it might be weeks before I know for sure, just depends on what randomly triggers the freeze.  I went weeks with no issues, now it's happening three times a day.
Free Windows Admin Tool Kit Click here and download it now
March 28th, 2014 12:58pm

Thanks vitis_vinifera, I'll give sc config srv2 start=auto a try shortly.  I can re produce the problem on demand so it should be fairly easy to see if it works or not.

bergmbe, which country are you in?  In the UK you can pay for support on a case by case basis using a credit card.  It costs ~240 per case and they only take payment if they fix the problem.  They also don't charge if its a defect in their software.  Pretty good value i think.



  • Edited by DJL Friday, March 28, 2014 5:30 PM
March 28th, 2014 5:27pm

I applied the sc config srv2 start=auto command and will report back if it does the trick.  TBH, it might be weeks before I know for sure, just depends on what randomly triggers the freeze.  I went weeks with no issues, now it's happening three times a day.

For me, SMB crashed again after the "fix".

I can't fathom why MS can't figure this out.  Seriously, its there freaking code, you guys have provided dumps what else do they need.  For those that have cases, has anyone actually gotten this escalated to American tech support?  in other words level 3?

Free Windows Admin Tool Kit Click here and download it now
March 28th, 2014 10:39pm

 sc config srv2 start=auto isn't a fix - I can still reproduce the problem

Eric I share your frustration.  I've been escalated to level 2 and had to basically start from the beginning again with the new engineer.  After going through disabling all the advanced nic features and blaming 3rd party software (there isn't any) we are back to memory dumps and tracing again.

I've had to migrate one of our file servers back to 2008 R2 (a 4.5 yr old OS!), but I'm stuck with 2012 R2 on others as we're using work folders.


March 30th, 2014 7:32pm

We also have this issue now on two customer sites. Both are running 2012R2 Essentials. Other has Vmware 5.5 on Dell T320 hardware and other has Vmware 5.1 with HP Proliant ML350p Gen8. The Start=Auto didn't help the, but with this problem happens little bit less. But no real solution yet
Free Windows Admin Tool Kit Click here and download it now
March 31st, 2014 7:58am

How do you reproduce the problem?

This just happens on our fileservers once / each 10-14 days.

Is there a trick or can you give us a hint how we can reproduce this to speed up debugging?

March 31st, 2014 11:59am

I don't know how to reproduce the problem, but one thing I've done as a test, is ripped out my AV (ESET).  I'll let you know if thing seem stable afterwards.  The problem is that the hangs are intermittent.  Sometime they happen every few hours, sometimes its weeks.
Free Windows Admin Tool Kit Click here and download it now
March 31st, 2014 4:20pm

so far things are good.  We noticed two other file servers (windows 2012) that have not had this problem, have no AV installed.
March 31st, 2014 11:01pm

To reproduce the problem we:

  • Map a share on the server to a workstation. Run IOMETER on the share to stress the server. IOMETER settings are: 2,000,000 sectors | 400 outstanding IO | 512B 100% read access specification | 4 workers. This takes the disk activity up to 100%
  • We then login a number of Windows 8.1 workstations simultaneously - the users roaming profile is stored on the same server/volume.
  • We normally login to about 40 machines at the same time to make sure the problem happens, but it can happen with a few as 1 or 2 machines.

We have no AV on our servers, or any other software for the matter



  • Edited by DJL Tuesday, April 01, 2014 9:55 AM
Free Windows Admin Tool Kit Click here and download it now
April 1st, 2014 9:55am

To reproduce the problem we:

  • Map a share on the server to a workstation. Run IOMETER on the share to stress the server. IOMETER settings are: 2,000,000 sectors | 400 outstanding IO | 512B 100% read access specification | 4 workers. This takes the disk activity up to 100%
  • We then login a number of Windows 8.1 workstations simultaneously - the users roaming profile is stored on the same server/volume.
  • We normally login to about 40 machines at the same time to make sure the problem happens, but it can happen with a few as 1 or 2 machines.

We have no AV on our servers, or any other software for the matter



if you run fltmc in the command prompt, what shows up?
April 1st, 2014 12:50pm

C:>fltmc

Filter Name                     Num Instances    Altitude    Frame
------------------------------  -------------  ------------  -----
DfsDriver                                0     405000         0
Cbafilt                                    3      261150         0
Datascrn                                0       261000         0
Quota                                    0       125000          0
npsvctrig                                1        46000          0



  • Edited by DJL Tuesday, April 01, 2014 8:57 PM
Free Windows Admin Tool Kit Click here and download it now
April 1st, 2014 8:57pm

Hello,

i have the identical problem with our 2012 R2 FailOver Cluster. The installation itself is completly standard with only one registry "tweak": NtfsDisable8dot3NameCreation.

Do all have this setting enabled who are affected by this issues?

Best regards

April 2nd, 2014 7:11am

Hello,

i have the identical problem with our 2012 R2 FailOver Cluster. The installation itself is completly standard with only one registry "tweak": NtfsDisable8dot3NameCreation.

Do all have this setting enabled who are affected by this issues?

Best regards


Okay it's not the problem. After i enabled the Dot3NameCreation the problem occured within a few hours (Maybe due to the higher load?!).
Free Windows Admin Tool Kit Click here and download it now
April 2nd, 2014 10:33am

C:>fltmc

Filter Name                     Num Instances    Altitude    Frame
------------------------------  -------------  ------------  -----
DfsDriver                                0     405000         0
Cbafilt                                    3      261150         0
Datascrn                                0       261000         0
Quota                                    0       125000          0
npsvctrig                                1        46000          0



Just to give you an idea, this is all i have.

Filter Name                     Num Instances    Altitude    Frame
------------------------------  -------------  ------------  -----
npsvctrig                               1        46000         0

Knock on wood, i've been stable thus far.  You might want to try disabling file screening, quota's, etc. one at a time to see if things start behaving.

April 2nd, 2014 1:15pm

That's interesting - thanks Eric.  I'll give that ago

I've just sent off another load of memory dumps, tracing, net capture etc to MS PSS!

Free Windows Admin Tool Kit Click here and download it now
April 2nd, 2014 4:00pm

Just tried disabling the filters one at a time until npsvctrig was the only one left - i can still reproduce the problem :(
  • Edited by DJL Wednesday, April 02, 2014 4:38 PM
April 2nd, 2014 4:38pm

Just a quick update on our PSS case:

We've managed to capture the required information and our case has been sent to the Microsoft Global Business Support - Windows Serviceability Team (GES was a shorter name!) for analysis.

I can't imagine it's going to be a quick response given the millions of line of code etc they'll have to sift through

Free Windows Admin Tool Kit Click here and download it now
April 6th, 2014 9:47am

Any update?
April 7th, 2014 9:00pm

Looks like it may have struck again for us, although it happened while i wasn't in the office, so I can't verify for sure.  If that is the case, then the AV isn't the cause for us.
Free Windows Admin Tool Kit Click here and download it now
April 8th, 2014 10:18pm

sc config srv2 start=auto this seemed to help for few days. After first crash I created script that rebooted server every night. Now that won't even help, yesterday and today shares stopped working even server has rebooted at night. Has MS replied anything to this bug?

April 15th, 2014 4:41am

No update yet - still waiting for the debugging team.  

Our first memory.dmp we sent them turned out to be corrupt so we had to recapture all the info for them again hence extra delay. 

Will update when i have more info

Free Windows Admin Tool Kit Click here and download it now
April 15th, 2014 11:40am

Ive just read teh entire post today. Were experiencing the problem on Windows Server 2008 R2 Virtual Machine (Windows 2012 Cluster). Is this the case for anyone?
April 17th, 2014 5:08pm

Hi Everyone, 

I thought I'd jump in here too. I'm glad to find this tread, I've been pulling out my hair for a while on this one. 

I have a Win 2012 (not r2) server having the same SMB issue described here. I've been struggling with it for some time. I did a complete  fresh install of a Win 2012 VM, it worked for a while but then the issue popped back up. 

Here's my situation: 

Every day or so (sometimes more, sometimes less) the LANMANSERVER service (SMB/SERVER service) stops responding to win7 clients. Accessing files from the console of the server via the volume drives letters (c, d, e etc) works fine, just the mapped SMB drives (M, S, H) do not work, accessing via \\IPADDR\share does not work either. 

-Win XP client seems to be able to access fine, which suggests a problem with SMB 2/3. Trying to Stop and restart the SERVER service does not work, hangs at STOPPING. Only solution seems to be a full reboot. 

Server setup:

Windows 2012 Standard server running on VMware ESXi 5.5

Direct attached storage RAID 5 on a Lenovo R700 HW raid controller. Windows disks are just regular vmdx files from vmware, no passthrough, no iscsi etc. Saw the same issue on a previous Raid 10 array, with a different controller. 

Other (non win2012) VMs on the same datastore have no problems. 

There has never been a AV on this server. The clients are running ESET v5. 

I have a separate Win 2012 domain  controller remains online, "netstat -b" shows 2 connections on 389 to the DC from the file server before and during the issue. I've never seen the file services on the dc have trouble, but they're not doing much 

Thoughts:

Like most here, I don't think Its a hardware issue - there are too many things that don't fit. I believe it's an issue with the SMB/LANMANSERVER/SERVER service or a related driver/service. Its hard to believe after all this time MS doesn't seem to have a common patch.  

Fixes just tried:

I have just tried the following fixes suggested here and I'll report back on the success. 

--Change the Virtual network adapter from E1000 to VMXNET3

--Autostart SRV2 from an elevated CMD prompt: sc config srv2 start=auto

--unload driver from an elevated CMD prompt:  fltmc unload vsepflt

--Change:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\vsepflt\Start

to value "4"

reboot

***************************

Update: going on 3 weeks, SMB's still working well. I think I've licked it, I was rebooting 1-2 times a day before. 







  • Proposed as answer by CloudThomas Friday, April 25, 2014 7:40 PM
  • Edited by CloudThomas Sunday, May 11, 2014 3:41 AM
Free Windows Admin Tool Kit Click here and download it now
April 22nd, 2014 7:08pm

Hi all,

This is response I have had back after they analyised our last memory dump:

The SRV2 threads responsible for processing incoming SMB requests are stuck on NTFS lock, owned by  another thread trying to perform file system IO. The file system is hung as couple of IO requests containing 2043 packets to the device SCSI\Disk&Ven_EQLOGIC&Prod_100E-00\000000 has been blocked for over 17 minutes. This has caused many SRV2 threads to be hung to ensure serialized access to file system resources. With no more Threads available in the SRV2 queues, large number of SRV work items are queued up and system is unable to process new SMB requests.

Suggestion:

  1. To  engage vendor of SCSI device EQLOGIC in order  to  verify if there is any underlying issue with the disk.
  2. As a work around try increasing the number of SRV2 threads using the following registry key on the File server. Though this will delay the issue in current circumstances but will not guarantee the remediation of the issue.

http://technet.microsoft.com/en-us/library/cc957460.aspx

Key        : HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters

Type      : DWORD

Value    : MaxThreadsPerQueue

Default value for MaxThreadsPerQueue is 20 you can try increasing to 1024.

If issue still occurs please collect the kernel dump again as  before when issue occurs without running the IOMeter and send it for further analysis.

I haven't tried the registry key yet.  It looks like there is an issue else where which is causing this problem.  I'm going to try and capture another memory dump without running iometer.  I can't see the issue is with our storage system as everyone else here is seeing the issue on varying different hardware



  • Edited by DJL Wednesday, April 23, 2014 10:58 AM
April 23rd, 2014 10:53am

I'm experiencing the same issue on a Windows 2008 Server running as a VM within a Windows 2012 Hyper-V cluster.   It's running off of Dell Servers and an Equallogic SAN.   Every week we experience the issue where no one can connect to that specific 2008 server's file shares.  When trying to stop the server service it hangs and won't shut down.  When shutting down the server it also hangs and I have to force a shut off.  After it restarts everything is fine.  I have not tried any of the fixes recommended so please report back if any of the fixes listed are continuing to work.   Thank you everyone for sharing.
Free Windows Admin Tool Kit Click here and download it now
April 23rd, 2014 12:54pm

For what it's worth, our SAN is Nimble Storage, so it's not the the SAN vendor (unless they're both messed up).  I suspect in your case, you have EQL mounted directory via a software initiator?  I bet if if it was VMware virtual disk, they'd be blaming VMware.  Regardless, the storage vendor is a read hearing.  To me, it still points back to something messed up in the SMB stack.  

I was really hoping they were going to come back with something a little more solid (as I'm sure you were as well).  They're troubleshooting the symptom, not the problem.

April 23rd, 2014 6:54pm

We are having the same issue on a physical 2k8 R2 server with local storage. 
Free Windows Admin Tool Kit Click here and download it now
April 24th, 2014 3:29am

Our server didn't had the MaxThreadsPerQueue dword value, but I created it now and let's see if it helps.

Key        : HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters

Type      : DWORD

Value    : MaxThreadsPerQueue

Default value for MaxThreadsPerQueue is 20 you can try increasing to 1024.

April 24th, 2014 4:15am

Just wanted to chime in that we are seeing this exact issue on Windows Server 2008 R2 Standard running on ESXi 5.0.0.

Although it happens a lot less frequently for us, maybe once a month or so.

The last time it happened we had no active AV and the only non-Windows software running is a Commvault backup agent.

I'll be following this thread closely and look into using IOMeter to recreate the problem.

Free Windows Admin Tool Kit Click here and download it now
April 24th, 2014 2:25pm

Eric - yes, definitely a red herring - they've tried to blame 3rd party software/hardware several times now.  We use the Microsoft iSCSI initiator, so literally the whole stack is Microsoft through from receiving/sending the SMB packets at guest level all the way through to iSCSI at the host level.  The only 3rd party code is Intel drivers and the Dell EqualLogic MPIO DSM on the hosts.

Regardless I spent a day checking and updating our storage arrays.  Dell took diagnostic logging and couldn't find anything wrong (surprise!) so at least that should keep Microsoft happy.

I've now got to try and capture the memory dump again, plus some new storage tracing, but without using IOMeter to recreate the problem. 

Is anyone else with open support cases getting anywhere?

April 28th, 2014 9:10am

I can confirm we see the issue too,

We are using an EMC VNX5200 SAN, FC to Server 2012 R2 Hyper-V cluster

The actual error is occuring on our primary file server Server 2012 R2 enterprise. We have fairly low IOPS so the problem has only occured twice this year, but the fact it has reoccured is worrying.

Like everyone else, the server responds to RDP and locally the drives are fine, it is a SMB access error.

I look forward to getting a solid solution in the future.

Free Windows Admin Tool Kit Click here and download it now
May 1st, 2014 1:07am

Microsoft are now analysing another memory dump and set of tracing from one of our servers...

May 6th, 2014 12:49pm

We had exactly the same problem on 4 Windows 2012 clustered VM on ESXi and have opened a case with MS.

Before doing anything, the support has asked us to update some components :

srv2.sys, srvnet.sys, srvsvc.dll, mrxsmb.sys and mrxsmb20.sys.

http://support.microsoft.com/kb/2899011

We have updated these components without much hope because the description of the patch does not match our problem, but since this, we have more that one month without issue...

If that helps...

Free Windows Admin Tool Kit Click here and download it now
May 7th, 2014 3:15pm

We are also experiencing the same issue.

Infrastructure running on Windows Server 2012 on VMware ESX 5.5 Update, IBM FLEX Chassis, Blade X240 and IBM V3700 Storage. Our AD is 2008 R2. FFL 2003 and DFL 2008 R2.

We are also running DFS Namespace on top of the file server.

Symptoms:

Intermittent disconnection of map drives

Cannot access share or slow to open

Server hangs at restart

Once hard rebooted, the server is up and running.

Just logged a call with Microsoft.

No relevant log on Windows, ESX or Storage.

Anyone found a permanent fix s

May 8th, 2014 8:00am

I'm wondering if anyone has found this related to backups of the server?  It seems that about the time the server shares start to have issues is about the time our backups start.  Again it doesn't happen every time our backups take place but when it does happen, it is during that backup time window of the server.  We are using Symantec Backup Exec 2012 using the remote agents.
Free Windows Admin Tool Kit Click here and download it now
May 8th, 2014 1:55pm

I don't think our case is related to backup. Another server is backed up with Veeam and another with Windows Backup. And sometimes servers are running for a week without issues, and backups are run daily
May 10th, 2014 8:48am

Hi Everyone, please take a look at my post above from Apr 22. 3 weeks on, no forced reboots!
Free Windows Admin Tool Kit Click here and download it now
May 11th, 2014 3:44am

check the Windows Event logs for errors that might help with troubleshooting this.
May 13th, 2014 6:15pm

So my case has been escalated again "as it's more complex than normal" and I've also had the "we will only spend commercially reasonable efforts on this case going forward" disclaimer.

I urge anyone experiencing this problem to open a support case with Microsoft (if you don't have a support contract it'll cost you 240...stick it on a credit card).  The more cases they have reported, the more worth while it is for them to spend time fixing the problem.


I managed to make one of our SQL Servers fall over today - I tried to copy a backup of a database off the server using SMB...big mistake!  Yet it's quite happy when SQL hammers the drive with tens of thousands of iops and 300MB/s.  SMB is broken! I could be moving to Linux soon! :s
  • Edited by DJL Tuesday, May 20, 2014 6:18 PM
Free Windows Admin Tool Kit Click here and download it now
May 20th, 2014 6:15pm

@DJL

Moving to Linux is what I chose for the time being. The amount of time (translation: money) our company has wasted on troubleshooting this particular problem was enough to convince me that at least for a simple file share, it made sense to switch over. For you, the time constraints necessary to set up a linux server are likely more daunting given how many users you have. A simple samba file share was sufficient for us for the time being. I was hoping "time being" meant 4-6 weeks for a patch, but now I'm thinking I'll be lucky if a patch is issued by 2015. I don't understand how this isn't a bigger issue. If it's a fundamental flaw in how 2012 is handling SMB, which it appears to be, I can only assume it's a much more widespread issue than they're admitting to. If I can convince my boss to open a case with M$, I will. We have certain industry applications that ONLY run on Windows, so at some point we're going to NEED this to be fixed. For now, work arounds are enough.

@LMosla

I think we're a little beyond the whole "windows event log errors" stage at this point. 

May 21st, 2014 6:16pm

I have had this issue since Jan 2014, only solution is to go back to Server 2012 (Not R2) or Server 2008 R2. All of the fixes mentioned here reduce the occurance's, but also appear to affect performance, which degrades over time. After changing back to 2008 R2 everything works fine!
Free Windows Admin Tool Kit Click here and download it now
June 3rd, 2014 8:57am

So my case has been escalated again "as it's more complex than normal" and I've also had the "we will only spend commercially reasonable efforts on this case going forward" disclaimer.

I urge anyone experiencing this problem to open a support case with Microsoft (if you don't have a support contract it'll cost you 240...stick it on a credit card).  The more cases they have reported, the more worth while it is for them to spend time fixing the problem.


I managed to make one of our SQL Servers fall over today - I tried to copy a backup of a database off the server using SMB...big mistake!  Yet it's quite happy when SQL hammers the drive with tens of thousands of iops and 300MB/s.  SMB is broken! I could be moving to Linux soon! :s
I'll be getting a case open soon.  We had the issue come back after 2 months.  There's clearly something particular that triggers it, but I have not been able to reproduce it manually.
June 4th, 2014 6:13pm

Hi all,

So my case has just been archived for the time being.  I have been told that Microsoft have no fix for this at the moment. Apparently they are aware of some issue with SMB 3.02, although I have no further information other than that.

Basically as soon as any updates to the relevant dll's are produced the engineer will let me know so I can see if they fix the problem.

So basically... that's it... I now just have to wait, or dump 2012 R2... not great news

I still think it's worth opening support cases if you can as it will at the very least bring more attention to the problem, and they may just discover something that help isolate the problem

Free Windows Admin Tool Kit Click here and download it now
June 10th, 2014 10:19am

we have disabled smb2 today. don't know what will happen. 
June 19th, 2014 7:29am

we have disabled smb2 today. don't know what will happen. 
no luck. it stopped responding again.
Free Windows Admin Tool Kit Click here and download it now
June 20th, 2014 6:05am

Hi all,

So my case has just been archived for the time being.  I have been told that Microsoft have no fix for this at the moment. Apparently they are aware of some issue with SMB 3.02, although I have no further information other than that.

Basically as soon as any updates to the relevant dll's are produced the engineer will let me know so I can see if they fix the problem.

So basically... that's it... I now just have to wait, or dump 2012 R2... not great news

I still think it's worth opening support cases if you can as it will at the very least bring more attention to the problem, and they may just discover something that help isolate the problem

I opened a case (finally).  114062611570165
June 26th, 2014 12:32pm

Just got off the phone with MS, has anyone tried this yet?

http://support.microsoft.com/kb/2957623/en-nz

They're pretty confident it fixes the problem.

Free Windows Admin Tool Kit Click here and download it now
June 26th, 2014 5:14pm

I had a 2012 R2 server experience "the issue" on 6/10.  It had the May rollup (KB2955164) installed and rebooted on 6/4, so I don't think it helps.  I think I will try the DisableLeasing registry key, it's listed as a workaround.  I have quite a few production 2012 R2 file servers out there, and so far 2 have experienced the SMB lockup.  I am unable to reproduce even with performing millions of scripted create/update/delete operations from a dozen clients.

REG ADD HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters /v DisableLeasing /t REG_DWORD /d 1 /f

June 27th, 2014 9:54pm

We have had our 2012 R2 Standard file server VMs in production since February, but only started seeing this issue on June 22, after I installed regular Windows updates.  Restarting server service hangs, will not reboot - have to hard power off (but oddly don't get startup errors on power-on).  Same basic symptoms (no sharing, can see files on server via RDP, no pertinent errors in Error Log); here's our spec:

Dell 720 hosts, Compellent 10k fiber-attached storage

ESXi 5.5, VMXNET3 NICs (since build)

We are looking at downgrading to 2008 R2, but I'm worried when that OS will be EOL, forcing us to upgrade to 2012 - maybe they will have a fix by then? ;)

Free Windows Admin Tool Kit Click here and download it now
July 24th, 2014 12:28pm

Don't forget that if you have an AV it possible to be a suspect as well.  If you are still seeing the symptoms, and you've applied all patches (including the one i linked), then you should make sure your server has all software ripped out of it.  MS will make you do this anyway if you open a case with them.  

That said, open a case with them.  The more exposure, the higher the likely hood of getting a resolution.

Finally, so far (fingers crossed) our server has been stable since the patch.  Won't know for sure for at least a couple more weeks though.

July 25th, 2014 12:46pm

has anyone had any luck with the two KB articles that are mentioned?

we have suddenly been hit with this SMB issue and it's crippled our AutoCad\Revit 

Free Windows Admin Tool Kit Click here and download it now
July 28th, 2014 2:26pm

It appears we have this issue too but with Server 2008 R2.  Still doing some investigative work to confirm. 

Basically every couple weeks we lose access to our shares.

We can ping the server, RDP to the server and access the files on the server.

We can't hit the shares by IP, Name, FQDN.  Nothing is coming in any of the event logs that I have noticed.

To fix is just to reboot the server.  Have tried restarting the server service but that hangs and doesn't stop and start.

July 28th, 2014 7:30pm

No updates in 3 weeks? Please tell me microsoft has some news.

We have been experiencing this about every 3-4 weeks since we upgraded to 2012.  This is a VM on ESXi 5.1 and the user volumes are physical RDM disks. The srv2 appears to be already running, and likely was during the recurrence of the issue today.  And we do not have the vmtools filter driver installed.

Free Windows Admin Tool Kit Click here and download it now
August 21st, 2014 3:36pm

Just curious for anyone else having these issues, do you have shadow copies configured? I ask because the last two times we had this issue, all of the previous versions we had for our two data volumes were gone.  We generally have at least 2-3 weeks of versions (twice/day). 
August 21st, 2014 3:44pm

Just curious for anyone else having these issues, do you have shadow copies configured? I ask because the last two times we had this issue, all of the previous versions we had for our two data volumes were gone.  We generally have at least 2-3 weeks of versions (twice/day). 

We've been experiencing this issue ourselves.  It just seems to have started the past few months and occurs about every 3-4 weeks on Server 2012 R2.  We have Deduplication and Volume shadow copies enabled, but I haven't been able to log in yet to verify if they are still present.  This is also running on vSphere 5.5 and EqualLogic PS6100 as many people seems to have also.

This is the main File Server and I'm considering moving the files to a 2008 R2 server.  I don't believe it would be too difficult as I used DFS name spaces for all the shares and drive mappings.


Free Windows Admin Tool Kit Click here and download it now
August 26th, 2014 10:02pm

SMB2 is dying, I don't think dedupe or volume shadow is a factor.  However I will mention I am running dedupe on all of my 2012 R2 file servers.  I managed to perform a NotMyFault kernel dump with the issue occurring after the May rollup was applied and rebooted, I plan on opening a Microsoft Premier level 3 call.
August 27th, 2014 2:58pm

Hi,

Just an idea, please make sure that you have the VERY latest firmware and driver version for your NICS -especially the Broadcom.

I have experienced similar issues on several setups. At the moment, not 100% sure (I need more time to be sure the issue does not happen anymore), but it seems that in my case it was an issue with VMQ being defective on Broadcom NICs. Latest drivers solve the issue, see http://support.microsoft.com/kb/2902166/en-us

In any case I would direct you towards checking that your network is completely healthy. Not sure if it applies to you, but in my case there is a CLIENT bug behind all of this (and an old one !). See this : http://social.technet.microsoft.com/Forums/windowsserver/en-US/9f93508c-71fa-4807-b41a-8f558563afe3/windows-server-2008-r2-loses-ability-to-connect-to-network-share?forum=windowsserver2008r2networking

In my case the issue is that I sometimes lose access to \\server\share, while access to \\server.fq.dn\share remains OK -other times, it's the other way around.

Not sure of course if it's the same bug you have. But the idea behind is that SMB maintains a list of "unreachable servers" on the client, this list should time out, but does not.

If for some network-related issue (in my case, the VMQ feature) you sometimes, even very shortly, lose the network connection to the server, seems that this list gets filled.

Free Windows Admin Tool Kit Click here and download it now
August 28th, 2014 11:31pm

just so folks know, after the update, we have been rock solid.  One other note, we're run VMware and as part of applying the patch we also ensured we were running the latest vm version and tools (which include updated drivers) and all the latest MS patches.  We're also on the latest Vmware version (5.5) for what its worth.

That said, if you're still having issues, take your server down to bear bones, check your filter drivers and make sure only the base filters are loaded.  Finally, call MS if you're still having issues.  That's what I did and what ultimately pointed me to the patch.

August 29th, 2014 2:42pm

Hi,

We have a similar issue to everyone else in this thread and have tried everything that has been put on the forum thread e.g. leasing,srv2 service auto restart and changed from smb1 & 2 but as yet nothing has fixed the problem. We ran the same iscsi fileshare on windows 2008 r2 fully patched for around 4 years with no issues but decided it was time to upgrade to 2012r2 but that seems to have been a bad choice! We also have a case open with MS who have taken network traces and diagnostics and seem to think that it is a nic teaming issue but again from the stuff i have read online other people have done similar things after being asked to and it hasnt fixed the problem. 

Any help/fix for this would be great!

Thanks

Free Windows Admin Tool Kit Click here and download it now
October 3rd, 2014 8:26pm

Hi,

I am having a similar issue and curious if anyone has experienced the same. I have a mixed Windows environment with Server 2003 up to 2012R2. Same as above, our primary file server was Server 2008R2 and running without issue for years. I replaced it with a new Server 2012R2 install and migrated the shares. All Windows 7 and 2008R2 or older clients connected to the shares without issue. All of my Windows 8/2012 and newer clients instantly cannot access the file server "\\Server" by host name, but can access it by FQDN "\\Server.contoso.com" and IP "\\10.0.0.0". After a reboot of the client, it would successfully connect by host name without issue.

I am running the file server on a Hyper-V 2012 R2 cluster with teamed nics. The file server is fully patched and has been restarted.

If anyone has seen this please help.

Thanks

October 3rd, 2014 9:11pm

My issue was resolved and was linked to Active directory.  We were imaging using WDS and (not me) the image had a bug.  We use GPP to map drives and printers.  The workstations would not let go of the port 389 connection.  We patched almost 300 workstations all holding 100's of connections the problem went away.  The file server could not authenticate users.  As everyone above might be experiencing the same symptoms just different causes.

My recommendation is to lookup all possible hotfixes for your SMB and OS version.  Apply and verify.

  • Proposed as answer by tcgood Tuesday, October 07, 2014 3:11 AM
  • Edited by tcgood Tuesday, October 07, 2014 3:12 AM
Free Windows Admin Tool Kit Click here and download it now
October 7th, 2014 3:11am

Just a shot in the dark here: is anyone else who is still having the problem running Extreme-Z IP? 

We are having very similar issues to the OP, very similar config:

1. Server 2012 R2 VM running on local, fiber-attached storage

2. Fresh (never-before used) LUNs connected via iSCSI

3. File sharing periodically stops, but you can still RDP in and see all the LUNs

4. Restarting the server service hangs and you have to fat-finger the VM

5. Fresh workstation build of 8.1 completely updated, also cannot connect when sharing is down, and server itself is patched to current

We thought it might be shadow copies, as we see .vss errors, but even after disabling shadow copies we still have the issue every couple of weeks.  The only other thing that affects file sharing (although it shouldn't affect SMB) is Extreme-Z IP, but I didn't see anyone mention it here.  Before I uninstall, I thought I would check here :).


  • Edited by baronfunke2 Tuesday, October 07, 2014 9:10 PM
October 7th, 2014 9:09pm

I have a lot of servers that run e-zip, however the two lock ups I've had after applying the May rollup and rebooting, none of those systems are running e-zip.
Free Windows Admin Tool Kit Click here and download it now
October 8th, 2014 2:36pm

Hi vitis,

I got the similar issue with yours, may i know whether change Srv2.sys to auto start really solved your issue? As i am so confused seeing all other reply after you. Thanks !

October 9th, 2014 8:16am

TCGood,

Can you please elaborate on what the problems was exactly and how you made that determination?  And more importantly what you did specifically to "patch" the clients?  We also use WDS (SCCM) to deploy our client OS images, plus GPP for drive mapping and are wondering if this is similar.

We are seeing this issues with 2008R2 (Hyper-v 2012 guest) and my guess is that MS has updated the SMB on it to match client features/patches.  Thus we are seeing periods where SMB is unresponsive or VERY slow. The effect is dramatic as we do folder redirection and the clients become unresponsive when SMB is slow.

In our case we see all network traffic go to effectively zero for a period then it recovers on its own.  We have disabled leases and patched SMB with any and every hotfix we have found.   The issue has become less dramatic but still occurring.

Thanks for the help


  • Edited by SteveLith Thursday, October 23, 2014 6:17 PM
Free Windows Admin Tool Kit Click here and download it now
October 23rd, 2014 6:15pm

Hi all,

I've just started rebuilding our cluster host to move from Windows Server 2012 to Windows Server 2012 R2 and updated all drivers, firmware etc in the process and I'm still seeing the same problem.

Very annoying!  I've heard nothing from Microsoft since June either, even though they said they'd inform us when new versions of srv2.sys etc were released.

Problem is happening with 2012 and 2012 R2 guests on either 2012 or 2012 R2 hosts

October 27th, 2014 1:43pm

Subscribe.

We have a 2012r2 hyper-v cluster on a Dell MD3200 shared sas array with teamed Intel nics.

One of our 2012r2 file server guests just stops responding to SMB connections at random (has happened twice now).  This server has been running fine for 2 years (Was upgraded from 2012 to 2012r2 almost a year ago).  Nothing but a reboot cures it. We have two other similarly configured guests on the same cluster and this doesn't happen.

Free Windows Admin Tool Kit Click here and download it now
October 27th, 2014 2:32pm

So, I logged into the DC and found 1000's for :389 connections when running netstat.  Our computers were being logged off and back on.  But it would leave a connection to :389.  Computers were logged into many many times.  So our FS could not authenticate users any longer.

http://support.microsoft.com/kb/2775511

and

http://support.microsoft.com/kb/2561285

I think these are the link you want, but I cannot remember the 2 I installed on the FS that regarded SMB.  Basically users cannot authenticate.  The 15 minute wait our users had when logging in was basically them failing back to using local cached credentials.  We were seeing outages several times a day with this.  Ran the patch on all windows 7 machines in the environment (about 265) and bam we are sailing smooth again.  I am since moved to another location and don't have the details.

Your situation might be different but the outcome is the same.  The only netstat info on the DC should be Exchange and FS.  All connections can happen and then disconnect.

October 29th, 2014 4:44pm

We've just being doing some testing with Fixed VHDX's and Pass-through disks on Server 2012 R2.

For us the SMB issues doesn't appear to occur on file servers which use fixed VHDX disks on CSV's rather than pass-through disk.

Also having tested the performance we are seeing much better performance using VHDX's on a CSV compared to pass-through disks, which is the opposite of what we would expect!

We are seeing anything from 20% to a 300% improvement in IOPS which is crazy.  This is using the same server, same vm, same iSCSI SAN and same settings

We're going to move a file server over to VHDX and see how that goes in productio

Free Windows Admin Tool Kit Click here and download it now
December 2nd, 2014 1:10pm

Have a look at this thread http://www.symantec.com/connect/forums/network-shares-stop-responding-randomly-windows-server-2008-r2. SEP 12.1.4100.4126 might also be the culprit. I am seeing our SEP admin to downgrade to 12.1.40.13.40
December 20th, 2014 9:38am

hi,

same problem here - virtual Windows Server 2012 on vmware esx - SMB stops responding with same errors in eventlog.

Additionaly we have these kind of Event Log entries:

Reopen failed.
Client Name: \\10.*****.*****.*****
Client Address: 10.*****.*****.*****:52948
User Name: Domain\Username
Session ID: 0x880000000E4D
Share Name: Users01$
File Name: *****\MSOFFICE\VORLAGEN2010\NORMALEMAIL.DOTM
Resume Key: {00000000-0000-0000-0000-000000000000}
Status: Object Name not found. (0xC0000034)
RKF Status: STATUS_SUCCESS (0x0)
Durable: false
Resilient: false
Persistent: false
Reason: Reconnect durable file

Guidance:

The client attempted to reopen a continuously available handle, but the attempt failed. This typically indicates a problem with the network or underlying file being re-opened.

Restarting server service helps. We already installed KB2955163 as suggested in https://support.microsoft.com/kb/2957623

http://support.microsoft.com/kb/2955163

Yesterday the problem occured again no shares were available on this file server - restarting server service did the job again.

Our server is now up to date and the problem still occurs ~ once a month.

Any other suggestions? I think we will open a case @ MS soon...


  • Edited by StefanHagl Thursday, January 08, 2015 9:24 AM
Free Windows Admin Tool Kit Click here and download it now
January 8th, 2015 9:22am

We are having the exact same issue with users losing connections to their shares. This is causing their TS sessions to lock up because we are using redirected roaming profiles. Doesn't seem like there is a solid answer to this issue. We are using XenServer 6.2 so we can't try the VMWare fixes. Subscribing to see if anything comes up.
January 14th, 2015 4:43am

Just wanted to pop in:

We've had the EXACT same issue:

Dell NX3200 NAS Windows Server 2012 Storage Server (Windows 2012 R1)

About once or twice a month, LanManServer services becomes unresponsive and kicks off all SMB file shares, server remains active on RDP and iSCSI. You can ping the DNS name and it resolves, its not a networking issue, and the DCs see the server and are giving it permissions accordingly.

You cannot restart the LanManServer service manually, it locks up as 'Stopping' and you cannot Restart the file server as it hangs at 'Shutting Down' because it cannot kill the LanManServer service.

I've tried the Microsoft patches made for this issue, no dice. The only current fix I have is hard restarting the server itself. This is incredibly frustrating, even more so that Server 2012 R2 seems to be affected as well, so upgrading the OS will not do anything to resolve the issue.

This is affecting our business performance and even though downtime is minimal during these restarts, we are heavily dependent upon this file server. About ready to seek alternative SMB file server solutions.
  • Edited by CommieGIR Thursday, January 15, 2015 6:06 PM
Free Windows Admin Tool Kit Click here and download it now
January 15th, 2015 6:04pm

Hi There,

Thanks for the quick lesson on iSCSI best practice.  As stated I have already checked the underlying storage/networking/iscsi/mpio etc... and there are no problems at all. The same iSCSI/cluster has been running production vm's for 4 years now without any issues.  

I find it weird that when the SMB service manages to get locked up like this, I can still browse the files fine on the server.  That would rule out any underlying physical storage issue surely? 

One theory I had could be perhaps the use of an iSCSI passthrough disk in the 2008 R2 host to the 2012 R2 guest.  This is the only thing unique to this VM,  all other guest vm's use vhd files on CSV's.  


 Yeah, it does not appear to be related to iSCSI, unless the iSCSI service on 2012 is knocking out LanManServer. We have VMs hosted on our 2012 machine via iSCSI and when the SMB shares fail, the iSCSI shares remain responsive and operational.
January 15th, 2015 6:15pm

On my end, as mentioned it seems SEP was the root cause. No issue since downgrade to 12.1.40.13.4013 done on 20.12.14.
Free Windows Admin Tool Kit Click here and download it now
February 14th, 2015 6:33pm

I'm experiencing this too, with Server 2012 (not R2) on a fairly simplistic virtual machine on ESXi 5.1. I end up with the LanmanServer service not responding to stop requests and taskkill.exe hanging when trying to stop it that way.

The consensus seems to be to set srv2.sys to start automatically instead of demand-start, disable problematic anti-virus filter drivers, and change NICs on ESXi to use vmxnet3 instead of e1000e. But this doesn't seem to work for everyone. Further, there's been some finger-pointing going on, blaming external storage and such.

Has anyone tried building a generic PC with a generic ATA or SATA disk and supported NIC, and reproducing the problem on that? This should eliminate third-party external storage or iSCSI as the cause and tell MS to stop finger-pointing. On this same generic platform, what about comparing service pack levels or specific patches? I had this server running since mid-2013 but hadn't had this happen until about mid-2014.

My own symptoms don't include any event log 1020 entries in the SmbServer operational log. I get a lot of 1016 errors instead, and these seem to happen with a specific application that's run over the network and no other shares. My own filter list looks like this:

Filter Name                     Num Instances    Altitude    Frame
------------------------------  -------------  ------------  -----
VirtFile                                0       429999.280700    0
MpFilter                               67       328000         0
Cbafilt                                 3       261150         0
Datascrn                                0       261000         0
Dedup                                  65       180450         0
Quota                                   0       125000         0
npsvctrig                               1        46000         0

System Center Endpoint protection is installed, but currently has real-time file scanning turned off after I read about virus filter drivers being part of the problem. I do have deduplication turned on in one volume.

--

February 23rd, 2015 2:24pm

We're seeing the same problem with 2012 R2 backed by VMWare ESX 5.5. We've recently opened a case with Microsoft and will keep this thread updated on progress.

I can say adjusting the srv2 service to auto did significantly reduce the number of warnings (event 1020) present in our SMBServer->Operational log. Since the service adjustment last Friday only 1 event has been logged. The shares would go offline at the times these events were logged.

Previously we'd see 200+ events in that log on bad day.

We already use the vmxnet3 adapters, they've been in place since the start of the issue and our VMWare tools are fully up-to-date.

Lastly, we see and experience these warning message across 3 VMs. One of which is running on complete different hardware/storage/network etc. It seems directly related to 2012/2012 R2 and not environmental.

Wish Microsoft had better answers. We migrated from a 2008 R2 system that never experienced these problems on the same underlying hardware.

Free Windows Admin Tool Kit Click here and download it now
February 27th, 2015 2:42pm

Going to chime in here as we have been experiencing this for almost a year.  Running 2012 R2 VM on ESXi 5.5 with a mix of RDM and VMDK disks. It used to happen about once a month, not enough to even spend much time troubleshooting the issue as it was easier to power cycle the VM.  It gradually increased to every week, and is sporadic now. Times we will go 3 weeks without the issues, then all of a sudden had 3-4 time in one day.

We started on 2012, which was an upgrade from 2008. The VM had VMXNEX3 nics. We started with Symantec Endpoint protection, removed that and used MS Endpoint, then removed AV altogether.  Last month we lifted the volumes and put them on a clean build of 2012 R2, using e1000 NICs, and no AV. We still see the issue.  We have given MS netstat, perfmon, and wireshart packet captures.  no Help.  Then they wanted a memory dump, but because our RDM LUN is greater than 2TB, we couldn't use VMware to snapshot the memory like they wanted.  So I figure out how to blue screen the VM using NMI, they say the dump doesn't look like a complete dump, and wants us to use vmss2core utility to turn a snapshot into a memory dump.  Well they botched the instructions on that, because they just told us to take a disk based snapshot - which is not what the utility needs. 

May 2014 rollup was already applied. Client side packet capture just shows SMB reset being sent after the 60 second default connection time lapses.  get-smbopenfile on the server still shows tons of files open, but  no one can connect.  I could initially connect still from a 2003 server, but its not always consistent.

We waste so much time with MS and they haven't even provided anything useful.

Does anyone have updated info from MS?

March 3rd, 2015 2:43pm

Problem is gone!

I installed http://support.microsoft.com/en-us/kb/2957623/en-nz Hotfix. This did not help... But then I added the registry key mentioned in the KB article as a workaround and since then our file server works without any error.

REG ADD HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters /v DisableLeasing /t REG_DWORD /d 1 /f

(The Server needs a reboot after adding the key!)

The problem is gone now for about 2 months *knocking on wood and throwing salt over my shoulder*

Give it a try and post if this is working for you.

Cheers, Stefan

  • Proposed as answer by StefanHagl Thursday, March 26, 2015 6:33 AM
Free Windows Admin Tool Kit Click here and download it now
March 26th, 2015 6:33am

Tried workaround above..still not helps.
April 21st, 2015 1:38pm

Same Problem here. Is there no official Fix for this Problem at this Time from MS? I can't believe it...
Free Windows Admin Tool Kit Click here and download it now
April 30th, 2015 9:10am

I have the same Problem started on April 15th for an accounting firm trying to finish the taxes.

we removed all updates from that night but to no avail.

I have been pulling my hair out and now have to shave my head.

running hyper-v plan on moving shared drive to another virtual to see.

May 2nd, 2015 7:19pm

We have the same problem on our file servers. I will try srv2.sys services on one of the file server.
Free Windows Admin Tool Kit Click here and download it now
May 8th, 2015 11:47am

DJL,

I see you are using Work Folders in your environment, I'm from the team delivered the Work Folders, I'd love to hear your feedback on the feature. Are you interested to have a chat for 30 mins over the phone to give us the feedback on Work Folders? Appreciate your time.

I can be reached at jianyan at microsoft dotcom.

Thanks

May 28th, 2015 9:26pm

Is Symantec in the topology?

https://support.symantec.com/en_US/article.TECH225417.html

Free Windows Admin Tool Kit Click here and download it now
May 28th, 2015 9:40pm

For what it's worth, I'm still struggling with this too. Still working around it with a samba file share on a second VM, but I have a few files shared on 2012R2 and still see this intermittently. It also screws up the print server when it rears its head. Can't stop or restart server service on 2012R2 so the only option is a reboot.

No Symantec in the topology, and it's a mostly clean install of 2012R2 as a VM on ESXi 5.5. Very little software has been installed besides that on the server, which leaves me pretty confident this is still a Microsoft problem. More specifically, it's an SMB related Microsoft problem. 

This link describes our issues almost exactly: https://support.microsoft.com/en-us/kb/2957623

....but neither the workaround nor the patch has helped, as it did for Stefan. At this juncture, I'm about ready to scrap this VM and start over. Hope that in the next iteration whatever is causing it doesn't show up. 

Did anyone with an open case ever get ANY resolution to this? 



  • Edited by bergmbe Friday, June 26, 2015 5:32 PM for clarity & version numbers.
June 26th, 2015 4:35pm

Since last February, I had this problem happen to me once. At that point I made the service change (srv2.sys to autostart), VM NIC change (swap E1000E with VMXNET3), and disabled System Center Endpoint's real-time checking. Those changes held up for a while, until I re-enabled SCEP real-time checking after routine updates and it happened about six weeks later.

After that, I re-disabled SCEP real-time checking. It held up for a while since. During the most recent update cycle I also disabled SMBv2 (and thus v3 as well) from a PowerShell command. I haven't noticed a major performance hit but now an unrelated problem has come up in that some applications won't open more than one file from the same network share on that server.

I'm going to migrate my file share volume from a 2012 VM to a 2012R2 VM after discovering that new, unrelated problem.

That intensively-used application I mentioned is on a 2012R2 file server already, and it hasn't locked up.
--

Free Windows Admin Tool Kit Click here and download it now
June 26th, 2015 7:04pm

Only solution for us was to migrate shares to another server and destroy VM that had this issue.
July 1st, 2015 3:33am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics