2012R2 Storage Spaces - Enclosure redundancy

Hi,

We are currently testing redundancy with Storage Space and have ran into a big problem..

Here is a description of our setup (I'll try to be as precise as possible):

Two HP DL360 Gen8 servers with 2x10 Gbe Ethernet cards and 2 LSI 4 SAS external ports, each connected to 3 DataON JBOD enclosures via dual SAS paths (2 SAS cables per servers going to 2 separate controllers on each enclosures).

The 2 10Gbe Ethernet cards are setup in separate network (10.0.0.0/16 and 192.168.0.0/16).

The 10.0.0.0/16 network is part of the Windows domain and host the DNS servers.

The 192.168.0.0/16 network is independent and only accessible by the above servers (no DNS defined, No default gateway).

I installed failover clustering and built a new cluster with those two servers, making sure to untick the add available storage from the wizard.

The cluster built successfully, so I proceeded to build the storage pool..

On one of those servers, I created a Storage Pool using all the disks from all 3 DataON enclosures (The disks are composed of 32x SAS HDD and 12x SAS SSD (Dual ports)).

And on top of this Storage Pool, I created two virtual hard disk:

                One small 1GB virtual hard disk for the Quorum (non-tiered, enclosure awareness enabled, mirrored)

                One large 15TB virtual hard disk for the data (Tiered Storage, enclosure awareness, write-back cache and mirrored)

As a reference, here are the powershell commands I used to create the virtual disks and the storage pool:

$pooldisks = Get-PhysicalDisk | ? {$_.CanPool eq $true }

New-StoragePool -StorageSubSystemFriendlyName *Spaces* -FriendlyName SP1 -PhysicalDisks $pooldisks

$tier_ssd = New-StorageTier -StoragePoolFriendlyName SP1 -FriendlyName SSD_TIER -MediaType SSD

$tier_hdd = New-StorageTier -StoragePoolFriendlyName SP1 -FriendlyName HDD_TIER -MediaType HDD

New-VirtualDisk -StoragePoolFriendlyName 'SP1' -FriendlyName 'VD1' StorageTiers @($tier_ssd, $tier_hdd) -StorageTierSizes @(2212GB,13108GB) -ResiliencySettingName Mirror -NumberOfColumns 4 -WriteCacheSize 10GB -IsEnclosureAware $true

New-VirtualDisk -StoragePoolFriendlyName 'SP1' -FriendlyName 'Quorum' -Size 1GB -ResiliencySettingName Mirror -IsEnclosureAware $true

 

So far so good, I then added the storage pool to the cluster using the failover cluster manager, then added the two disks created above (created a volume within first).

I then added the bigger disk to the Cluster Shared Volmue.

Added to second disks (smaller one) as a quorum to the cluster.

In the failover cluster manager, I added the Scale Out File Server role (used the name 999SAN01P001 as the distributed server name) , and created a highly available share on the Cluster Shared Volume (now appearing under c:\clusterStorage\Volume1\Shares\Hyper-V).

I can now access the share via \\999SAN01P001\Hyper-V without any problem and even run Virtual Machines on it.

Here is the problem:

If I eject a couple of disks from one of the enclosures, no problems, everything stays available.

If I however simulate an enclosure failure (by pulling the power), the Cluster Shared Volume becomes inaccessible!

The Cluster Virtual Disk status in the failover cluster manager shows as NO ACCESS.

The virtual disk in Server Manager (under the File and Storage Services), although shows as Degraded is still accessible (not offline).

What am I doing wrong here?

With three enclosures, the system should be able to sustains a failure of a complete enclosure (and it does as my virtual disks in server manager shows online, but degraded), but my cluster cannot access it anymore (the cluster shared volume as no access).

Thank you,

Stephane

March 6th, 2014 7:57am

Hi Stephane, what model are your LSI cards and do they support SES. ?

https://social.technet.microsoft.com/wiki/contents/articles/11382.storage-spaces-frequently-asked-questions-faq.aspx

http://social.technet.microsoft.com/wiki/contents/articles/11382.storage-spaces-frequently-asked-questions-faq.aspx#Enclosure_Awareness_Support_Tolerating_an_Entire_Enclosure_Failing

The LSI cards need to support this protocol to allow for JBOD failover. and did you configure a 3 way mirror across your 3 JBODS ?

Just wanted to Rule this out as a cause for your issue.

regards

Mark



Free Windows Admin Tool Kit Click here and download it now
March 6th, 2014 8:40am

Do you have the enclosure management hotfix installed?

http://support.microsoft.com/kb/2913766/en-us

The first thing I would check is to see if the enclosures are being picked up correctly. If you have the hotfix installed, you should have the "get-storageenclosure" command, which should list exactly 3 enclosures (nothing more nothing less). You can also use the "get-storageenclosure <name> | get-PhysicalDisk | ft FriendlyName, PhysicalLocation" powershell command to verify that each enclosure is picking up the correct number of disks that should belong to it. If there is any discrepancy here, enclosure awareness will not work as expected.

If you don't have the hotfix installed, you can still use the PhysicalLocation property of the physicaldisk to get the same information (use the Enclosure XXXX to group disks into enclosures, each unique XXXX is what storage spaces recognizes as a separate enclosure).

Another option is to try the same experiment, but with an external quorum witness instead of a storage space as a quorum witness.


  • Edited by ApamnapatMicrosoft employee Thursday, March 06, 2014 3:24 PM Correcting error: enclosureme management vs. enclosure awareness
March 6th, 2014 3:08pm

Hi Mark,

Thank you for your reply.

Yes, the LSI card (9206-16e) does support SES.

In the Storage Pools (Server Manager), I can see the chassis ID for each disk.

Cheers,

Stephane

Free Windows Admin Tool Kit Click here and download it now
March 6th, 2014 10:02pm

I don't have this hotfix install, that could explain it!

I'll get it on there today and will let you know how I go.

Thank you,
Stephane

March 6th, 2014 10:04pm

So, I installed the hotfixed and ran the commands.

Everything seems to be fine.

The results for the Get-StorageEnclosure gives me my three enclosure:

And for each enclosure, the disks are detected and look fine. For example, the results for the second enclosure (Enclosure1):



The thing that is bizarre, is that the redundancy at the Storage pool and virtual disk level seems to be working (I get a "degraded" status for both when I switch off the enclosure), but the Cluster Shared Volume sitting on top of this fails and my Cluster volume (c:\ClusterStorage\Volume1) becomes inaccessible (Volume1 folder disappears completely).


Free Windows Admin Tool Kit Click here and download it now
March 6th, 2014 10:27pm

I would suggest that you verify (via powershell) that the virtual disk and storage pool are actually online when you have turned off one of the enclosures. Server manager might not have refreshed yet.

Also, did you try the experiment where the cluster quorum is not a virtual disk, but an SMB share?

March 7th, 2014 2:55am

The storage pool and virtual disk shows as "degraded" in powershell too (but accessible).

I've just tried to move the quorum to an external SMB share, but no joy, same result.

When the enclosure is offline, here is a screenshot of what I see in the Cluster Shared Volume (which is the only thing that seems to be failing).

Note the highlighted section.. Instead of saying "Data (C:\ClusterStorage\Volume)", it now shows the GUID for the volume and, on the C drive, the "Volume1" folder under "c:\clusterstorage" is no longer appearing

Free Windows Admin Tool Kit Click here and download it now
March 7th, 2014 3:51am

I've also ran the Cluster validation report, which passed without warnings

March 7th, 2014 3:54am

Some additional information:

As soon as I power the enclosure back on, even though the virtual disk is sill flagged as "degraded", I get access to the cluster shared volume straight away, while the virtual disk is "repairing".

Free Windows Admin Tool Kit Click here and download it now
March 7th, 2014 4:27am

Hi Stephane, do you have 3 Data-on enclosures or 2 ?

Two HP DL360 Gen8 servers with 2x10 Gbe Ethernet cards and 2 LSI 4 SAS external ports, each connected to 3 DataON JBOD enclosures via dual SAS paths (2 SAS cables per servers going to 2 separate controllers on each enclosures).

your original post is confusing ?? or it might be the way I have just read it :-)

if you have 3 enclosures you should have 3 or 6 SAS cables per server, one going to each JBOD ?

this is POC from Data on Website :-If its not cabled correctly Storage spaces might not be seeing the correct combination of disk, or the mirroring would not be correct .

Each Server must connect to one controller on Each JBOD, or two connections to each JBOD for redundancy/ resilience

cheers

Mark





March 7th, 2014 11:26am

Hi Mark, Sorry about this, I see how my post is confusing. We have 2x 4 ports LSI cards per server, connected to 3x JBOD enclosures (DataON), via a total of 6 cables for each servers (12 cables in total). I'll create a diagram of our config on Monday, but it follows the diagram from DataON (except that we only have three enclosures as opposed to four): http://www.dataonstorage.com/images/PDF/Solutions/MX/DataON_MX-3240Q-16e_Windows_Server_2012_Storage_Spaces_Storage.pdf Thank you, Stephane
Free Windows Admin Tool Kit Click here and download it now
March 8th, 2014 12:28am

If you can reproduce the issue with tracing enabled, that would help us identify the root cause.

You will need tracelog (available in the Windows Driver Kit) http://msdn.microsoft.com/en-us/library/windows/hardware/ff552994(v=vs.85).aspx

.\tracelog.exe -start spetw -guid "#{929c083b-4c64-410a-bfd4-8ca1b6fce362}"
-flag 0x7fffffff -level 0xff -f spaces.etl

<< Reproduce the issue >>

.\tracelog.exe  -stop spetw

Get-clusterlog Destination folderpath -Timespan 15

You can send me the logs (both spaces.etl and the 15 minute cluster logs from the above command) at nandak at Microsoft dot com.

March 8th, 2014 2:17am

I'll go the datacenter tonight to get the trace logs and will send them to you.

Thanks again for all your help,

Stephane

Free Windows Admin Tool Kit Click here and download it now
March 9th, 2014 9:47pm

While waiting for the logs, I've created a physical diagram of our server/storage connections



March 10th, 2014 1:39am

You're not very clear about how you added physical disks. Are the disks of each pool evenly spread across the JBODs? Were they added that way originally? Both of the previous are required. Did you implement 2-way mirroring for the virtual disks? This would be required for the above config.
Free Windows Admin Tool Kit Click here and download it now
March 10th, 2014 7:49am

Apologies for the confusion.

The disks are evenly spread across all three enclosures (16x SAS and 4x SSD in each enclosures).

The storage pool was created with this original configuration and has not changed since.

A two-way mirror virtual disk was then created in that storage pool using all available SSD and SAS space. Powershell command used: New-VirtualDisk -StoragePoolFriendlyName 'SP1' -FriendlyName 'Quorum' -Size 1GB -ResiliencySettingName Mirror -IsEnclosureAware $true

Thank you,

Stephane

March 10th, 2014 7:55am

And the following powershell command for the second disk:

New-VirtualDisk -StoragePoolFriendlyName 'SP1' -FriendlyName 'VD1' StorageTiers@($tier_ssd, $tier_hdd-StorageTierSizes @(2212GB,13108GB-ResiliencySettingName Mirror-NumberOfColumns 4 -WriteCacheSize 10GB -IsEnclosureAware $true

Free Windows Admin Tool Kit Click here and download it now
March 10th, 2014 8:00am

Stephane - Were you ever able to identify the source of the problem?
September 19th, 2014 11:06pm

Please do update us on the status of this issue.

Were you able to Live Migrate storage between SOFS nodes?

Was there a networking backend problem between the two SOFS nodes?

Do you need further assistance with troubleshooting?

Free Windows Admin Tool Kit Click here and download it now
October 27th, 2014 4:38pm

Hi Philip,

Apologies for the delay.

The problem was fixed and was related to the amount of free space within the storage spaces.

All good now.

Thank you all for your help,
Stephane

November 3rd, 2014 11:14pm

I know i'm a little late on this, but, Stephane can you describe how you fixed your issue?
Free Windows Admin Tool Kit Click here and download it now
April 16th, 2015 5:02pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics