Virtual Machine - high disk response time (Network Steve Forum)

Virtual Machine - high disk response time

Hi Everyone,

Got something strange happening in our lab at the moment and was wondering if anyone had experienced the same thing (and maybe has a solution).

Our lab environment in a nutshell:

2x Windows 2012 Hyper hosts cluster connected to a "home-made" SAN based on Windows 2012 iSCSI target.
Each Hyper-V host has two 1Gbps network cards to connect to the SAN via the Microsoft iSCSI initiator, with MPIO in load-balancing mode (least queue depth).

The SAN (Windows 2012 server with iSCSI target) has 4x 1Gbps cards, teamed two by two, so presenting two IP addresses used by each hosts to connect to it (via MPIO).
The disk subsystem on the Windows 2012 SAN is an external HP storage works with 25x HP 500 SATA disks, connected to the server via an INTEL RAID controller with 2x 240GB SSD caching enabled for read/write.

The iSCSI network is on a dedicated HP switch, with flow-control and jumbo frame enabled (tested ok).

Now the problem:

I've built a few virtual machines on the two hyper-v nodes and I'm getting very bad disk response time as soon as there is an increase in the disk traffic.
When the virtual server is doing very little, I get a normal 6-8ms, but I soon as I increase the traffic (by for example copying a big file, or installing an application), this figure shoots up to 200ms, 300ms and more!

So I first thought that it was my disk subsystem (and the SAN server), but while the spikes are happening within the virtual machine, the disks on the SAN Server are sitting at about 10ms, with some spikes to about 20ms (which is pretty good and what I would expect to see within the VM because of the SSD cache).

I then thought it could be the network, but during those times of activity, the network does not get saturated at all. Barely 150Mbps to 200Mbps per link.
I even tried to disable MPIO and run everything across one Ethernet link, but still the same result.

Am I missing something here? doing something wrong? or is this expected behaviour?

Thank you,
Stephane

September 10th, 2013 10:05am

[ ... ]

The SAN (Windows 2012 server with iSCSI target) has 4x 1Gbps cards, teamed two by two, so presenting two IP addresses used by each hosts to connect to it (via MPIO).

[ ... ]

Here are two problems:

1) Microsoft target has no cache so you put heavy load on disk subsystem - replace with something with cache to shock adsorb writes and reads

2) You should never team iSCSI networks - unteam them and use MPIO with Round Robin

Free Windows Admin Tool Kit Click here and download it now

September 10th, 2013 2:30pm

Hi,

VR38DETT is right, if you want to use more ISCSI bandwidth or redundancy you must use the MPIO method, Multipath I/O (MPIO) is a feature that provides support for using multiple data paths to a storage device. Multipathing increases the availability of storage resources by providing path failover from a server or cluster to a storage subsystem.

More information:

Multipath I/O Overview

http://technet.microsoft.com/en-us/library/cc725907.aspx

Support for Multipath I/O (MPIO)

http://technet.microsoft.com/en-us/library/cc770294.aspx

Hope this helps.

September 11th, 2013 5:39am

Thank you for your reply.

In regards to the cache, the Intel RAID controller is handling with the SSD Caching (2x 240GB SSD in RAID 1), so the read/write access are already cached.

With the iSCSI network, I agree indeed, the teaming was a dumb idea that I thought I'd try out, but I'll remove it and will let you know if this fixes the problem.

Thank you,
Stephane

Free Windows Admin Tool Kit Click here and download it now

September 11th, 2013 5:58am

Thank you for your reply.

In regards to the cache, the Intel RAID controller is handling with the SSD Caching (2x 240GB SSD in RAID 1), so the read/write access are already cached.

With the iSCSI network, I agree indeed, the teaming was a dumb idea that I thought I'd try out, but I'll remove it and will let you know if this fixes the problem.

Thank you,
Stephane

1) These are different things... With caching done @ controller level you still make your hardware work as your caches reside behind PCIe bus. With a caching done in target app you don't put the load on hardware / software storage stack at all.

2) Yes, please remove teaming and we'll continue. See, Windows Server 2012 *does* support teaming for iSCSI but for virtual networks only. See:

http://blogs.technet.com/b/askpfeplat/archive/2013/03/18/is-nic-teaming-in-windows-server-2012-supported-for-iscsi-or-not-supported-for-iscsi-that-is-the-question.aspx

Good luck!

September 11th, 2013 8:32am

So, last night I reconfigured the whole network setup:
I have 4x 1Gbps cards in the Windows 2012 iSCSI Target server not teamed, on the same IP subnet.
Each Hyper-V host has 2x 1Gbps cards (not teamed). Each card initiates two iSCSI connections to a unique portal address on the SAN (hope I'm making sense).
I then use MPIO in "round robin" mode to load-balance the traffic.

I've also decided to enable CSV caching on each of the Hyper-V hosts (512MB), just for good measure.

Unfortunately, no change.

I can see god transfer rate, but crappy response time.
For example, copying a file from one disk to another within the same VM, will run at about 60MB/sec, but during the copy, the disk response time will shoot up to 1000ms!
The same happen if I start to generate any disk access (with IOMeter for example), the reponse goes crazy.

The bizarre thing here is that I don't see that any disk latency increase on the Windows 2012 iSCSI Target server.
If my disk system was creating the latency within the VM, surely I would see the same increase on the SAN itself when I monitor those disk right?

Proposed as answer by Alex LvMicrosoft contingent staff, Moderator Wednesday, September 18, 2013 5:30 AM
Unproposed as answer by Alex LvMicrosoft contingent staff, Moderator Wednesday, September 18, 2013 5:30 AM

Free Windows Admin Tool Kit Click here and download it now

September 12th, 2013 12:22am

Hi,

The CSV mostly used for the primarly read and less write scenario, please disable the function and try again. Additional, could you confirm your iSCSI initiates configuration or post a screenshot abou that configuration.

Thakns.

September 18th, 2013 5:30am

Hi,

I would like to check if you need further assistance.

Thanks.

Free Windows Admin Tool Kit Click here and download it now

September 30th, 2013 3:19am

I want to keep this thread going.

I have a very similar setup. Quick specs on mine are

2 HP DL360p G8, 2P 32core, 256GB ram, 4 port NC365T 1gb NIC, 4 port 331FLR 1gb NIC.

The 2012 Server OS is on a raid 1 - 300gb 15k sas drive

NIC Config:

1 - Domain /8

2 - Live Migration /30

3 - Cluster Communication /30

4 - Not used

and then (4) ports config'd with MPIO least queue path iSCSI /28 to my EMC VNXe 3300 which has (35) 600gb 15k SAS drives in a raid 5. The HP Procurve 5400zl switch is aware of each set of 4 ports on host server A & B, and also on the VNXe on service processor A & B. I can get bandwidth around 3200Mbs and I've seen it as high as 180MiB/s on the EMC appliance. So I know it can handle the traffic. It's very common to see about 20Mbps send & receive combined across the MPIO group.

I too see the exact same circumstances as you - however. I am running 165 VMs (Windows 8, 8.1) in a clustered RDS environment. I will see anywhere from 0-100ms through the day on the clients. We have grown in to this, but the numbers haven't changed since we started off with 10.

Just wanted to share my setup and my info and give some feedback and input any way I can

Edited by Newmar Corporation Monday, February 10, 2014 2:25 PM

February 10th, 2014 2:25pm

Re: "I can see good transfer rate but crappy response time".

There is nothing wrong. You will find your answer in Queueing Theory -- wherein Little's Law states that:

Response Time = Queue Depth / Throughput.

Re: "copying a file from one disk to another within the same VM, will run at about 60MB/sec, but during the copy, the disk response time will shoot up to 1000ms!"

File copy operations are optimized for throughput, why are you looking at response time?? Your workload is generating a deep queue of IO requests -- this is a GOOD thing because it allows your intelligent RAID controller to optimize the workload. Windows is building a deep queue of IO requests (which means long average response times) but this is exactly what is giving you good throughput.

Free Windows Admin Tool Kit Click here and download it now

March 24th, 2014 6:24pm

I'm using IOMeter on a 2012 R2 VM running under Xen 6.2SP1 with a Dell EqualLogic 6210X.

32KiB IO size, two workers 100% sequential write on a bare drive presented through Xen, no iSCSI from the VM.

Windows Task Manager average response time is in the 800-900ms range.

IOMeter shows the average response time below 1ms.

XenCenter shows the average response time below 1ms.

Dell SANHQ shows under 1ms for the response time.

Almost seems like the decimal needs to be moved to the left three places in Windows.

Edited by Robert Havens 5 hours 18 minutes ago

June 24th, 2015 9:48pm

This topic is archived. No further replies will be accepted.