Windows 2003 server drops network (Network Steve Forum)

Windows 2003 server drops network

Hi all, I recently installed a new server at one of my facilities to replace an old one. It's a Dell PowerEdge R610, running W2K3 Server R2, SP2, all current updates applied. Users are frequently getting disconnected from network drives that are mapped to the server, only to reconnect again a few seconds later. The server seems to drop at random intervals, for anywhere from 1 - 10 seconds. I am not on-site, so it's been hard for me to troubleshoot the issue, but one thing that's relatively easy for me to monitor is a remote control session - either via RDP or using VNC. I continue to get disconnected at random intervals, which I assume coincide with the users losing network drives. I am running a continuous ping to the server as well and at the same time I lose my remote connection, the ping time-to-live (TTL) value drops from 124 to 60. However, the ping is not failing. This happens for anywhere from 2 - 20 ping packets, then goes back to 124, which is the expected value. As soon as it's back to 124, I can reconnect my remote session. Here is a sample output from a ping (Note: there were actually 17 packets with the TTL value of 60, but I truncated it below): Reply from 192.168.16.10: bytes=32 time=36ms TTL=124 Reply from 192.168.16.10: bytes=32 time=35ms TTL=124 Reply from 192.168.16.10: bytes=32 time=36ms TTL=60 Reply from 192.168.16.10: bytes=32 time=36ms TTL=60 Reply from 192.168.16.10: bytes=32 time=35ms TTL=124 Reply from 192.168.16.10: bytes=32 time=51ms TTL=124 Also, all of the return times for the ping during the "TTL=60" period are between 36 and 39 ms, so time-outs are not a problem. I have verified that this is not a WAN issue, as pinging the server from the same network also drops off from 128 to 64. So, again the TTL is dropping by exactly 64. We are running Cisco 3560 switches and I have rebooted both switches. I have swapped out ethernet cables and changed ports on the switch with no change. The only other thing I am planning to try is to move to a different adapter on the server (we are currently only using 1 of the 4 adapters on the server), which I am planning to do tonight after business hours. Thanks in advance for any help.

October 14th, 2009 10:23pm

Is the Scaleable Networking Pack enabled? Some NIC's did not work well with SNP and so maybe disabling this is a first option. SNP is enabled by default in windows 2003 when service pack 2 is applied. Broadcom NIC's or NICs' that have a broadcom chipset are known to not work properly when SNP is enabled. updated drivers can be found here: http://www.broadcom.com/support/ethernet_nic/netxtremeii.php

Free Windows Admin Tool Kit Click here and download it now

October 15th, 2009 4:29am

Do you have the latest drivers from yours NIC vendor, firmware? Did you disable power management on NIC config tabs? Also - what is going on your network devices? Can you check if the switch to whom your server is attached, has no problems? For example in Cisco's switches there is always good practice to enable STP Portfast, then recheck your speed and duplex settings on both sides. What do you have in logs? System log would be useful - there you can see if your NIC loses communication or not. Also one hint - if you will have Hyper-V, please don't use NIC teaming on Hyper-V host.

October 17th, 2009 6:14pm

Hi Jackson54 have you managed to resolve this?

Free Windows Admin Tool Kit Click here and download it now

October 19th, 2009 9:04pm

Jackson54, It looks like you have done a good deal of troubleshooting on this, and have given great info into the problem. I think it is important here to understand the use of the TTL in the ping application. What is displayed in any packet is what is left of the TTL sent from the remote server. The initial server sets the TTL on the packet (In 2003 the default is set to 128), and every network hop along the way takes one away from this value. Typically only layer 3 (and above) devices take away from this TTL, but some modern devices span layers (like multilayer switches). The goal of the TTL is to prevent network packets from endlessly transversing your network, usually in a loop. The comparison you did from local subnet to WAN is great. It lets us know that you only have 4 hops across you WAN, and that your problem is local to your subnet, which should mean that no device should ever take away from your TTL. This limits us to looking at the switch and the server itself. I would suggest, if this was not already the case, testing this connection to the server plugged into the same switch (since Vlans can span multiple switches). If you see the problem there then we can be comfortable assuming that this is caused by the switch or the server. From experience I can tell you TCPIP on a server either works or does not work, so if anything it is a configuration issue. Switches are in the same boat, the occasion has been rare that I have seen two switches have the same issue, short of a bug in the code or hardware. So, in both cases I think it is likely that we are looking at a configuration issue. -On the switch: -Ensure that spanning-tree portfast is enabled -Ensure that speed/Duplex match that of the server's NIC (usually set in device manager on the properties of the physical NIC card) -On the server: -Dell is now typically using HP NICs, which often use the HP teaming software, I cannot tell you how many problems I have seen with this. You may only be using one NIC, but it still can have this function enabled. The only sure way to get rid of it is on the properties of the NIC. Look for and uninstall the "HP Network Teaming Driver". See http://cbfive.com/blog/post/Considering-Network-Teaming.aspx for more information on network teaming. -Use "netstat -r" to view the local routing table. Ensure that the default route is correct and that there are no routes being injected into the table (if you are unsure post screenshot) Finally a set of network traces between the server and a local workstation would tell volumes. Close all connections from that client to the server, and start the traces, wait for failure and then stop the captures. for detailed instructions see: http://cbfive.com/blog/post/Taking-a-long-network-capture.aspx If you need help with the network capture assessment, and you probably do not want to post the file here, I would be happy to take a look. Here is an email you can send it to: InitialAssist@cbfive.com Don't forget to give credit where credit is due, vote this as helpful if it helped you.

October 21st, 2009 11:37pm

Hi,I have to agree with Jared, looks like he's on the right track. Possibly a packet storm. Check the simple, almost ridiculous stuff too; like a Enet cable looping around then back to the same switch. Two uplink cables between the two switches, a user that has plugged two ends of the same ethernet cable to two wall jacks ( I have seen it done.) two routes top the same destination, etc.Miguel Fra / Falcon ITS Miguel Fra www.falconits.com

Free Windows Admin Tool Kit Click here and download it now

October 22nd, 2009 5:49am

This topic is archived. No further replies will be accepted.