Follow DeepStorage - Howard Marks on Feedspot

Continue with Google
Continue with Facebook


I’ve been hearing people declare tape to be dead so long that it’s become cliché. While tape has seen it’s day as a direct backup medium, tape has remained the least expensive way to store large amounts of data for long periods of time, and modern tape systems are a lot more reliable than the memories of 8mm, DDS and DLT breakdowns would indicate.

Today it seems that the two companies that still make magnetic tape have decided to use their patent portfolios to keep each other from selling the latest LTO tape cartridges (LTO-7 and LTO-8) in one of the stupidest business moves I’ve seen in decades. Sony and Fujifilm have each claimed that the other’s LTO tapes violate one or more of their patents.

Well, that’s fair enough, people should get rewarded for their intellectual property, but the actions both companies have taken to protect their IP has resulted in findings that both vendors’ products infringe on the other’s patents and removing both vendors products from the market.
Industry standards organizations like the LTO alliance typically include language in their standards and membership documents that require members holding patents required to implement the standard provide licenses to other members on a fair, reasonable and non-discriminatory (FRAND) basis. That means that Sony must license its patents to Fujifilm and Fujifilm must license its patents to Sony.

If the LTO agreement includes FRAND language these two companies aren’t fighting to be the last tape vendor standing, they’re fighting over who’s patents are most key to a $100 tape cartridge to decide if Sony pays Fuji $3/tape or Fuji pays Sony $2.50/tape.

By not agreeing to allow some 3rd party arbitrator to decide this matter of a few dollars per tape while both vendors remain free to sell tapes, these two companies have put the whole tape market at risk. Tape libraries only make sense if I as a customer can rely on a steady supply of tape cartridges when I need them.

Fujifilm and Sony have just ended that steady supply of tape cartridges. If I were making a final decision between a new tape library and an object store or cloud storage I’m going to have to start figuring tape supply risk into my calculations and that can’t be good for Sony or Fujifilm in the long run.

I would call the whole situation a circular firing squad but that would take more than two participants. In markets like tape, or Fiber Channel, that are down to a small handful of key vendors those vendors can’t afford to be rough with each other over key patents without risking the whole enchilada.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Stretched Clusters – Data Center Level Resiliency


As we saw in our last installment, Local RAID is a great solution for remote sites looking for high resilience with a low node count. Where local RAID improves resiliency by making each node more resilient stretched clusters replicate data across multiple data centers providing protection against data center failures.

HCI stretched clusters fill the same role as storage array based metro-clusters. They not only replicate each write synchronously between storage in two locations but also make the storage available to workloads, for HCI specifically VMs and containers.

Storage arrays do this by synchronously replicating the data between two arrays and as importantly by presenting each LUN from both arrays with the same identity in both data centers. Should one or the other array fail OS or hypervisor multipathing will automatically redirect I/O to the surviving array.

If you stretch the subnets your VMs are connected to across the two data centers you can VMotion from one to the other with minimal impact. Stretched clusters are perfect for stretching a cluster across two data centers on a campus or cross-town allowing load balancing and maintenance fail-overs between the two without creating any downtime.

Latency limits distance

The synchronous replication ensures that any data written to storage in data center A is also written to storage in data center B before it’s acknowledged to the application also known as RPO (Recovery Point Objective) zero. That means that the write latency your applications see will depend not only on how fast the SSDs in your nodes are but also on the network latency between the two data centers.

Since the speed of light is a constant and writes to a local NVMe SSD can take as little as 150µs any substantial distance will have a measurable impact on application performance. Most HCI stretch clusters will run over distances as long as 100KM, or more significantly up to 10ms of round trip latency. Personally, I would want to keep the two sites within 2 or 3ms of latency making NY-Jersey City practical but Chicago-Milwaukee a bit much.

Stretched Clusters and HCI

If you had the 24 strand fiber loop around campus I recommended, you could simply put half your HCI nodes in the data center in building 47 and the other half in building 4. VMware HA, or is it FT, can restart VMs when one data center goes offline and if you configure your HCI system’s rack-awareness it’s always writing one replica of your VMs to hosts in each data center so your data’s available.

Back in the dark ages, we would have been very self-satisfied to build that kind of campus cluster but it would have been a fragile thing. Four nodes in each data center would mean any data center failure would divide the cluster evenly, so we’d need a third location for a witness or quorum server. Even with an external witness, we’d have to expand the cluster with one node in each data center or the smaller data center plus the witness would still not total a ½+1 quorum.

Even worse, for me as a steely-eyed storage guy, since our HCI system only supports two or three-way replication at least half the time a data center failure will drop the cluster to a single remaining copy.

HCI vendors have since come to our rescue by directly adding support for stretched clusters to their solutions. Their witness, which can be a VM, an application or a cloud service. The cloud service is especially attractive as it doesn’t require the customer to maintain the witness in a third location.

That witness and the associated failover decision process is also site aware. Since the system “knows” there are supposed to be six hosts in site A and four hosts in site B it can force a failover without a true majority quorum.

Even better my HCI stretched cluster will store multiple replicas, or erasure code stripes, in each site so the system remains resilient. A basic stretched cluster would replicate two copies in each site (2×2) while more sophisticated system will let you use any combination of replication and erasure codes at the two sites.


Speaking of HCIchannels do you like this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.

The first 2-hour session was December 11 and is, of course, available on demand to subscribers. Part 2, where I compare and contrast some leading HCI solutions went live Tuesday, January 22 and the final installment February 5th. Sign up here





  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Replication and parity based erasure coding may allow your HCI system to survive one, or two node failures in a cluster. To get even greater resiliency HCI vendors can layer another protection method under the HCI systems basic node to node replication with local RAID in each node, or over it by replicating data to another cluster or failure domain. Much like combination RAID.

Higher levels of resiliency become even more important when systems are in the remote sites where HCI is such a good fit. In your corporate data center, in a major metropolitan area, an HCI node that goes offline will get noticed in a few seconds to a few minutes. Once someone acknowledges the alert they can probably add another node to the degraded cluster in a few minutes, or call your server vendor who’s contractually required to send a tech in four hours.

When a server dies in the server closet of the SprawlMart in Truth or Consequences, NM at 4 PM on a Friday afternoon it’s going to be down for the big Memorial Day barbequeue grill sale even if SprawlMart’s server monitoring system alerts someone by 4:05. If it doesn’t the store manager is going to be too busy to check that both of his servers are actually working until at least Tuesday.

Every hour a system spends running from a single copy of data is an hour during which even the smallest additional failure, like the failure of a drive to read a particular block, could trigger a total failure and even worse data loss. Larger HCI clusters address this problem by having enough nodes and storage capacity to rebuild their data but a SprawlMart only runs half a dozen VMs, putting four hosts, or three hosts with an external witness, in every store so they can rebuild would get expensive.

Local RAID

If you want to increase the resiliency of any kind of cluster you can either increase the number replicas, which will require more nodes, or you can increase the resiliency of individual nodes which local RAID does by implementing traditional RAID on each node.

LocalRAID has been around since the days of the VSA. Since the early VSAs from VMware, StoreMagic only supported two node clusters local RAID was the only way to have any resiliency when a node went offline. Many VSAs relied on the hypervisor to manage the RAID controller simply consuming .VMDKs like any other VM. This allowed users to leverage RAID controller features like DRAM and SSD caches.

2-Node HCI with Local RAID

As HCI vendors added 3-way replication and double-parity erasure coding local RAID fell out of favor with many vendors. Simplivity continues to use local RAID and 2-way replication as their primary data protection scheme and it continues to be the most cost-effective way to get higher resiliency from small clusters.

A 2-node cluster of small Simplivity nodes would replicate 5+1 RAID5 for effective resiliency of N+3. Their large nodes use 10+2 RAID6 which is effectively N+5 protection. Both are 42% efficient at any cluster size if rebuild space isn’t considered. Since the system will remain resilient after a node failure I would consider that acceptable for environments where failed nodes can be replaced in a small number of days.

While local RAID may sound like an old-fashioned solution, Simplivity’s 42% efficiency is better than the 25-30% efficiency of 3-way replication and by most measures a higher level of resiliency.

Do you like this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.

The first 2-hour session was December 11 and is, of course, available on demand to subscribers. Part 2, where I compare and contrast some leading HCI solutions will go live Tuesday, January 22 and the final installment February 5th. Sign up here

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In this episode of As the Cluster Turns we’re going to look at how HCI cluster size impacts the storage component of an HCI solution. While My Google Sheet from part one compared clusters with the same total SSD capacity that analysis ignored data protection and the impact of cluster size on data protection efficiency.

Reviewing HCI protection

HCI systems protect data by either synchronously replicating data across two or three nodes, or by striping data with single or double parity across devices on as many as a dozen nodes. While it may seem obvious that 2-way replication require a minimum of two, and three nodes respectively I and my paranoid storage administrator brethren would sleep much better at night if the system could rebuild data after a node failure and maintain the N+1 or N+2 level of protection I want.

The ability to rebuild means not only that the system needs not only access to a node’s worth of free space across the cluster but also that the system be able to resolve any split-brain issues by communicating with a quorum or witness device. Some solutions manage with three nodes for 2-way mirroring or four nodes for 3-way, possibly with an external witness. Others require four nodes for a rebuildable 2-way replication cluster and five for minimum N+2 configurations.

Even after the cluster size exceeds the minimum users must reserve one node’s capacity so the system always has enough space to rebuild. Unlike disk arrays that have dedicated hot spares most HCI solutions leave this to the system administrator. This rebuild capacity can be included in a vendor’s headspace recommendation (see below).

As a result, a cluster’s usable capacity is the capacity of n-1 nodes, where n is the number of nodes in the cluster. In small clusters, this can be significant. A four-node cluster loses 25% of its capacity to this spare space while a 16 node cluster only sacrifices 6¼%

Calculating Storage Efficiency

This week’s Google Sheet calculates the storage efficiency of common HCI protection methods. Greyed out cells represent cases where the number of nodes is less than the minimum for that protection method. Pink cells configurations that may not be rebuildable.

The sheet assumes a capacity of 4TB per node and populates the TB capacity and N-1 columns with the raw capacity of the cluster. It then calculates the usable capacity of a rebuildable cluster using 2 and 3-way replication, single parity erasure coding with three and five data strips and double parity with four and six data strips. It then divides the cluster’s raw capacity with the net capacity for each method and cluster size.

There are some columns off to the right for double protection but they’re the subject of a later blog post. What, you thought this was a trilogy?

Storage Efficiency  vs Cluster Size for Common Protection Schemes

As you can see from the chart above each protection method asymptotically approaches its theoretical efficiency as cluster size increases. The initial blip on each line is the cluster size that can accommodate the protection method but lacks enough resources to rebuild after a node failure.

The 5+1 line for example peaks at 80% efficiency for a six-node cluster and then re-approaches that efficiency level hitting 78% at a cluster size of 32.

Implementation Specific Factors

Of course, all we’ve considered so far are the impacts of an HCI system’s basic data protection method. Even before we consider data reduction, and have to go down the wormhole of how well vendor A’s system deduplicates vs vendor B’s system, how the system uses storage should get factored in along with the underlying capacity:

These factors include:

  • File/object system metadata

The distributed file system, object store, or whatever else a vendor wants to call it that runs the HCI solution has metadata to manage VMs, snapshots and the like. Typically this will be less than 5% of total storage.

  • Deduplication metadata

Any deduplication system has to store additional metadata, including a hash/block use count table. Typically 2-3% of capacity

  • Deduplication realm scope

Data deduplication eliminates multiple copies of the data stored in whatever set of data makes up a deduplication realm. Breaking our servers, and therefore our VMs, into multiple clusters will reduce deduplication efficiency by creating multiple deduplication realms. If you have Windows servers on 20 clusters you’ll have 20 copies of Windows, one stored on each cluster.

Some HCI solutions create deduplication realms smaller than a cluster at a disk group or datastore/volume level while others offer global deduplication across a federation of clusters.

  • Performance tier cache

On many HCI systems, some or all of the performance tier or SSD will be used as a cache of one sort or another. Since data in a cache is a copy of data stored in a separate endurance tier SSDs used as cache shouldn’t be counted towards system capacity.

  • Vendor headspace recommendations

To be perfectly honest most storage systems start to lose performance as they fill up. Hot spots emerge, the system has to do more garbage collection to create free space for the new data, and so on, and so on. Distributed systems have the additional complication of having to rebalance when individual components like nodes or drives fill up.

VMware’s VSAN rebalances data whenever components exceed 80% full. To prevent the I/O overhead created by rebalancing they recommend the VSAN system not be filled beyond 70% of capacity.

  • Mixed Protection Levels

While my little spreadsheet will calculate the efficiency of some common data protection schemes it assumes that an entire cluster uses one, and only one data protection scheme. Most HCI systems actually use some combination of data protection schemes both because the administrator has selected different policies for different VMs and because the HCI system may replicate data across its performance tier and only erasure codes data that hasn’t been accessed in 3, or 30, days.

The Economies of Scale

In part one we saw that clusters made of the biggest servers we could find would cost less per compute unit. Now we see that storage efficiency gets better with bigger clusters.

Which means the HCI world is one where having enough scale to manage clusters of ten to sixteen, 56-64 core servers will give you a cost advantage.

Anjoying this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.

The first 2-hour session was December 11 and is, of course, available on demand to subscribers. Part 2, where I compare and contrast some leading HCI solutions will go live Tuesday, January 22. Sign up here


  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In our last installment, I determined that both hardware pricing and software licensing made vSphere environments with bigger servers less expensive, for a given amount of compute power, than those with larger numbers of smaller servers. While I’m glad I did that math and learned that lesson, cost per core isn’t really the determining factor in laying out a data center.

As a practical matter it’s not the number of cores per server, but the number of servers in each cluster, that’s the more significant factor. Sure more cores per server reduce the cost of each core but cluster size influences failure domain scope, application interaction as noisy, or nosy, neighbors and storage efficiency each of which may outweigh mere cost/core.

HCI Clusters define compute and storage boundaries

While I didn’t make it a Precept of HCI in my architecture post the leading HCI solutions create clusters that define congruent storage and hypervisor/compute management units. While systems may support storage only or compute only nodes a single node in an HCI system participates in one cluster of storage and compute resources.

Since a single cluster represents a management and fault domain for both storage and compute, we have to balance compute and storage factors to determine the appropriate size of a domain.

Clusters and Failure Domains

Larger clusters are more efficient, but efficiency isn’t the only factor in determining the right size for your clusters. A cluster not only serves as a boundary of DRS, and distributed vSwitches but also as a failure domain. When the number of nodes offline exceeds the number of failures you’ve designed that cluster/VM to survive some to all of the VMs in that cluster will go down and suffer data loss.

While simple replication schemes may be able to have some unpredictable set of VMs survive the failure of 3-4 nodes more sophisticated systems that stripe data in at finer granularity via replication or erasure coding will lose access to some of the data from all the VMs. As a result, we must plan for all VMs to go offline if the number of device failures in a cluster exceeds its design resiliency.

Some HCI solutions are rack-aware, replicating data to nodes in different racks. A rack, or failure domain aware system can survive a failure of ½ or 1/3rd of its nodes if they’re all in a single, defined failure domain like a rack. This eliminates circuit breaker trips and PDU failures as causes of cluster failures, but as Dr. Murphey taught us, everything fails.

Using smaller clusters lets users control the number of VMs affected by cluster failures. If we replace 1 cluster serving 1,000 VMs with 10 clusters serving 50-200 VMs each when a cluster fails 1/10th as many irate users will call the helpdesk to complain. More importantly 1/10th as many users will be irate because 90% of the users will be unaffected.

Even better once you have multiple clusters, you can leverage the additional resiliency multiple clusters provide. Put load balanced web servers, Exchange DAGs (Database Availability Groups) and other VMs that provide duplicate services on different clusters.

Replicating data from one cluster to another with DAGs or other application-level replication lets administrators save a little bit by using N+1, rather than N+2, protection in each cluster. Replicating from one N+1 cluster to another provides N+3 protection making N+2 on each cluster a bit redundant if the short outage caused by failing over from one cluster to the other is acceptable.

Impacts Short of Failure

Even if the number of cluster members offline doesn’t exceed the design threshold anytime a node is offline, for maintenance or failure, the other nodes in the cluster must take up the slack, running relocated VMs, and frequently rebuilding data.

While the math in my last blog post may indicate that a cluster of 4 servers with dual 28core Xeons will lose 1/4th of its compute power every time I need to update NIC firmware. If I use six 20 core/CPU servers, I’ll get 240 cores, compared to 224 from the big servers but when a host goes offline, I’ll still have 200 cores while the 28 core/CPU cluster will be down to 168.

Next Time Storage Efficiency

Looking at HCI as a compute resource bigger servers will give us more bang for the buck, but storage efficiency features like erasure coding are only available on clusters with more than a typical minimum of 6-12 nodes. Next time we’ll look at how cluster size impacts storage efficiency.

So tune in for our next exciting adventure in clustertech. Same HCItime, Same HCIchannel.

Speaking of HCIchannels do you like this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.

The first 2-hour session was December 11 and is, of course, available on demand to subscribers. Part 2, where I compare and contrast some leading HCI solutions will go live Tuesday, January 22. Sign up here

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

As I was writing an upcoming blog post on sizing HCI clusters, I started thinking about optimizing not just the number of nodes in the cluster but also the size of each server in the cluster. Is the premium Intel charges for Xeon Platinum processors enough to offset the savings on vSphere and other software licensed by the processor socket? I decided to run the numbers and see.

The process was almost enough to make me wish for the old days when a server was basically a server. When I installed a DL380G3 the difference between the minimum 2.4Ghz and maxed out 3.2Ghz processor was minimal. Today the same server model can have anywhere from six to 28 cores per socket.

The first step was to price out a server. Since I planned to use these servers as HCI nodes, and I’m comfortable playing what-if on the Dell website, my victim was the 2U workhorse Dell PowerEdge R740. I configured a server the way I’d like to buy it, including 256GB of DRAM, the RAID1 M.2 SSD boot option and such which totaled $11,500.

I then built a spreadsheet using processors with 8-28 2.1Ghz cores and the premium Dell charges for each over the minimum Xeon 3104. I picked processors with the same clock rate to eliminate one of the 5,000 possible variables.















Price Premium














In terms of raw cost per compute power it looks like there’s a sweet spot around 16 cores with the cost per core rising rapidly for the high-end Xeon Platinum 8000 processors.

Once we start looking at the cost of a vSphere cluster, the bigger processors start looking more attractive. I decided, kind of arbitrarily, to see how much it would cost to build a cluster of hosts to provide 1200 cores of computing power and roughly 16GB DRAM per core. Again I pulled the 16GB/core number out of thin air, and some may argue that much DRAM is excessive but reducing the hardware cost just makes bigger hosts even more attractive.

Hosts/1200 cores














Node Compute Hardware …(Dell DRAM)














Cluster DRAM







Cluster hardware Total







 I then added in the cost of a vSphere Enterprise Plus license with 3-years of production support ($8,308 discounted price from the Dell configurator) for each node and discovered that those 16-core servers that looked like a sweet spot would still cost almost 1.5 times the cost of roughly equivalent horsepower from a smaller number of bigger servers.

Cluster vSphere







Cluster vSphere + Hardware







Price vs 28 core







HW vs 28









Since SSDs make up so much of an HCI cluster’s cost, I thought going to an HCI solution might shift the balance a bit to the smaller server side. I added 1.6TB Dell NVMe Mixed use SSDs for the performance/cache layer, and 3.84TB read oriented SATA SSDs for the capacity layer to each server configuration with 1 NVMe SSD and 2-4 SATA SSDs in a disk group to have the total capacity layer around 550TB. The low-end node has 1 NVMe SSD and 2 SATA SSDs while the high-end node has 2 NVMe SSDs and 7 SATA SSDs.

Once we add in the cost of vSAN Enterprise Edition (3995/socket) and 3-years of support the relative costs of large vs. small servers remains the same.

VSAN Enterprise $1,923,600 $1,442,700 $961,800 $730,968 $480,900 $403,956
SSDs $868,100 $651,075 $597,850 $578,854 $597,850 $570,990
Capacity layer RAW storage 768 576 576 583.68 576 564.48
Disk Groups / NVMe SSDs 1 1 1 1 2 2
SATA SSDs 2 2 3 4 6 7
Total VSAN Cluster $4,772,500 $3,617,175 $2,727,650 $2,313,326 $1,878,750 $1,646,946
$/TB $6,214 $6,280 $4,736 $3,963 $3,262 $2,918
Price vs 28 core 2.90 2.20 1.66 1.40 1.14 1.00

While I always knew per socket pricing was encouraging bigger and bigger servers I was surprised to see that even using 12 or 16 core servers could cost half again as much as maxing out my server configs.

What about AMD?

AMD’s Epyc processors promise the power of a dual-socket Xeon from a single socket server by cramming up to 32 cores, and just as importantly 128 PCIe lanes in a single processor. Using a single socket cuts the cost of vSphere and vSAN in half, but when I ran the Dell configurator for a PowerEdge 7415 with a 32-core AMD 7601 and 512GB of memory (the same as the 32 core Xeon Server), the total cost was just nominally lower than the R740 with two 16 core Xeon.

Second Order Costs

My calculations are limited to the day one acquisition costs of a cluster, and to the discounts, Dell offers anyone for a quantity one server purchase (about 30%). I didn’t include:

  • Network costs (every server needs two 10Gbps ports or more)
  • Rack and data center space
  • Power, cooling
  • Administration costs
  • HCI storage efficiencies like erasure coding

All of which are basically cost per server so including them will further push the cost equation in the direction of bigger servers in smaller clusters.


All my numbers are the quantity one price on the Dell site.  Organizations that get bigger discounts on software from colleges to megacorps may see smaller servers a bit more affordable than the spreadsheet shows though

Cost isn’t the only consideration

While the math may say that maxed out, servers are cheaper, that doesn’t mean you should buy three or four 28 core servers for your next refresh. A 28 core server with a terabyte of memory would host over 100 virtual machines creating demand on the 25Gbps network connections especially when you evacuate that host, and a terabyte of memory data to perform maintenance.

Of course, the biggest reason not to put all your VMs in 2-4 hosts is the impact of a host failure on the remaining members of the cluster. Three or four hosts supporting the reboot of 30 or 40 VMs each will be overloaded somewhere, which slows apps and delays the reboots, which annoys users, which makes them call the help desk, which makes my phone ring, and you know I hate when my phone rings.

What’s the optimal vSphere host size? Like everything else, it depends but I would start thinking hard about bigger servers, as opposed to bigger clusters if I’d still have a minimum of 8-10 nodes in the cluster.

All my calculations are in a Google Sheet. Feel free to plug in your numbers and see how your clusters change costs with big, and little, servers.



Like this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.

The first 2-hour session was live December 11 and now available on demand.  Part 2 goes live January 22nd. Sign up here

Dell, VMware and Intel have all been clients of DeepStorage, LLC on projects unrelated to this blog post.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

While storage administrators and DBAs haven’t always been the closest of allies I’ve always thought a good storage admin should have a working knowledge of the database engines they support.  Until recently that meant relational SQL engines like Oracle and SQL Server. In the 2000s we were trying to consolidate down to one or two database engines, today enterprise data centers are filling with No-SQL, In-Memory and high-performance databases that act very different than Oracle.

I met Brian Bulkowski, our guest on this episode of Greybeards on Storage, at Intel’s Optane DC (3DXpoint on a DIMM) announcement. There we talked about Optane, how database engines were changing, and a bit about Aerospike.  I enjoyed the conversation so much I invited him to be our guest.

Ray’s Show Notes and the audio are HERE

Even better subscribe to Greybeards on Storage on iTunes, Stitcher or wherever you like to get your podcasts.

  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In our last installment we saw that over the past three years, the storage industry (OK, most of them), VMware, and customers have laid the groundwork for a transition from VMFS to VVols. Here’s why you’ll want to make the switch.

VVols, VMware’s Virtual Volumes, store each virtual machine, in fact each virtual disk .VMDK file and a bit more, as an independent volume on your storage array.

Without VVols, vSphere creates a datastore by overlaying VMFS, a clustered file system that holds the .VMDK and other files for many VMs, on a large volume provisioned on a storage array.

While VMFS works pretty well, storing many VMs in one array LUN reduces the value of array services such as snapshots and replication. When vSphere needs a snapshot of a VM to enable application-consistent backups, it creates its own log-based snapshots in VMFS, which can have a significant impact on performance.

Because VVols are essentially just smaller volumes, or LUNs, vSphere can simply tell the array to take a snapshot of the group of VVols that make up a VM, and then use that snapshot to enable a backup, or as the first line of backups, replicated to another array across campus.

Because modern array-based snapshots are very efficient, and have a barely measureable impact on performance, shifting from VMFS to array-based snapshots will boost performance, especially when performing backup operations.

Greater Efficiency

While thin provisioning has saved uncounted petabytes of allocated but never used storage space, the dirty little secret of thin provisioning is that thin provisioning was designed to allow volumes to grow as data was written, but wasn’t set up to return space to the free pool when objects were deleted.

To address this problem the T10 committee added the UNMAP command to allow an operating system to tell the underlying storage when data blocks are free.

In a VMFS, or NFS environment, VMware has gotten better with each release about using VAAI to send UNMAP to arrays when .VMDKs or other VMFS files are deleted.

While earlier version required manual intervention and/or didn’t work with other important features like changed block tracking, vSphere 6.7 has got this UNMAP angle right.

You can now recover space that represents not the unused space formerly occupied by .VMDK files, but the unused space formerly occupied by deleted files from the file system within the .VMDK. Since each of a VM’s disks are a VVOL, arrays running VVols pass the UNMAP request directly from the guest operating system to the underlying storage system, further boosting efficiency.

Simplifying Storage Administration

With VMFS, an organization would have to manually create an array LUN and a separate VMFS data store for every combination of RAID level, snapshot and replication frequency, compression, deduplication, and performance level that VMs could run at, and then attach those LUNs to all their hosts.

It’s no wonder most organizations settled on Gold, Silver, Bronze models with very few options.

With VVols, the storage team can create policies that combine protection, replication, QoS, and any other feature the array vendor offers. This way, the storage team can create guardrails that prevent admins from creating VMs with insufficient protection.

The storage team is also relived of most of its provisioning work. They just create one VVols datastore and assign it a size limit. Since the VVols datastore is just an abstracted container it doesn’t consume any real array resources.

The virtualization admins can then provision space directly to VM virtual disks by selecting an appropriate policy through vSphere’s SPBM (Storage Policy Based Management).

This is the very same interface they’d used to manage vSAN, though with policies in line with the capabilities of the array they’re using, and the requirements of the application taking the storage administrator out of the provisioning business.

As of vSphere 6.7, the biggest piece of the VVols puzzle still missing is support for VMware’s SRM (Site Recovery Manager) tool, which automates the steps required to restart VMs that have been replicated from one data center to another.

This summer, VMware announced that SRM/VVols support was coming, which is a good thing, without saying exactly when, which was less satisfying.

The polarity of customer opinions of VMware SRM never fails to amaze. It’s either “we don’t use SRM” or “we will not buy/use any storage that cannot be controlled by SRM.” There seems to be no in-between.

For now, VVols users will have to script their fail-over themselves, while the loyal SRM group should plan for VVols when buying new storage kit.

Since VVols can deliver easier administration, better performance w/snapshots, and greater efficiency, it seems to me you’d be a bit of a Luddite to stick to the tried and true VMFS approach to VMware storage.

This blog post was sponsored by VMware and was posted to DeepStorage.net.

If you’re reading it at a site other than DeepStorage.net the owner of the site where you’re reading it has STOLEN this post and is a thief.


  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

We tech industry pundits are a bloodthirsty lot. We declare old technologies like tape or hard disks dead when newer, faster, shinier alternatives come along.

We aren’t any easier on new technologies; we’re ready to label fresh products such as Intel’s 3DXpoint or VMware’s VVOLs a failure if they haven’t broken sales records in their first two years on the market.

While VVols haven’t yet set the world on fire, the next year or two will really determine if VVols end up being the way to manage shared or external storage for virtual machines.

Satyam Vaghani demoed VVols, and the per-VM data services VVols would deliver, way back at VMworld 2011. By VMworld 2015, VVols were a feature of vSphere 6.1, and I was on the big stage talking them up.

I had high expectations for VVols in 2015. However, if I had thought it through, I should have known it would take three to five years before VVols adoption really took off. Three to five years from 2015 would be right around now and according to the reports I’m getting from the storage vendors, VVols are gaining traction with many customers

Why So Long?

A vSphere user in 2015 who wanted VVols would most probably have had to upgrade both their hypervisor (to version 6.1) and their storage array. Because vSphere 5.5’s feature set satisfied most users’ requirements, organizations put off the work of upgrades as long as possible.

So many users were still running 5.5 as it approached end-of-support in September of 2018 (I’ve seen estimates that north of 30% of vSphere customers with support in late 2017 were still running 5.5) that VMware mounted a major social media campaign urging people to upgrade before it was too late. They even sent customers little count-down clocks to impart a sense of urgency.

Storage Upgrades and Annoyances

If getting users to upgrade their hypervisors was an obstacle, getting storage with proper VVols support was an even bigger problem.

For most users that meant a new storage array, because very few of the storage arrays organizations bought between 2011 and 2015 could properly support VVols.

Version 1.0 is never that attractive

Those early VVols implementations were—and my VMware friends would rather I not mention this part—frankly, not ready for prime time. VVols 1.0 did give users control over storage array snapshots at VM granularity, but 1.0 didn’t support array-based replication.

Some array vendors also took the easy route, running their VASA provider in a VM with limited, if any, built-in high availability.

As a result, we got in a vicious chicken and egg spiral where customers weren’t demanding VVols support, so vendors—even ones like Pure Storage that recognized the value of VVols– started pushing it down the development list.

HCI Stole The Story

As all this was going on, HCI systems such as VMware’s own vSAN matured to offer the simplicity of management and per-VM services that VVols had promised.

Customers could implement HCI without synchronizing storage upgrades, so HCI was, for many, easier to buy and faster to the floor than a VVol enabled array.

Time for another look

VMware added replication support for VVols to VMware 6.5 and storage vendors have vastly improved both their VASA providers and their arrays’ ability to handle the large number of small volumette objects that VVols create.

Now that you’ve upgraded your servers to vSphere 6.5 or 6.7, even if VMware forced your hand, your next storage system really should support VVols, as we’ll see in the next installment using VASA and VVols to manage you storage will make your life easier.

This blog post was sponsored by VMware and was posted to DeepStorage.net.

If you’re reading it at a site other than DeepStorage.net the owner of the site where you’re reading it has STOLEN this post and is a thief.


  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In our last installment we saw how Hyperconverged Infrastructure, emerged as a term, and market segment, to describe products that integrated storage, and storage management in the same server platform to run VMs.

As all too often happens with terms that didn’t start with rigorous definitions, the marketing elves have stretched, or shrunk their meaning to met their needs, and/or their product. I’ve lost count of the number of times vendor representatives have insisted to me that some feature of their system, and only their system, was crucial for a system to be called Hyper-Converged.

Today we’re going to further define HCI as an architecture with design precepts a system must follow for us to call it truly HCI.

Precept 1 – Shared Storage From Server

The one key element to HCI as an architecture is that HCI systems use software to transform SSDs and/or HDDs across multiple servers into a resilient, shared, distributed datastore. Storage services in an HCI system are provided by software running on the same server CPUs as user services. The storage software can run as part of the hypervisor or operating system, or as one or more virtual machines (Virtual Storage Appliances or VSAs). This software layer, typically a distributed file system/object store replicates or erasure codes data across multiple nodes, and provides whatever data services like snapshots the system has.

Are two servers running a simple VSA or a half dozen servers running Linux and Gluster to create a clustered file system for KVM an HCI cluster? They are, but they’re HCI the same way my Clariion CX-500 is a storage array. That is that the CX-500 is, an old busted storage array, the others are HCI architectures but not HCI architectures I’d recommend to anyone.

Precept 2 – Scale-Out Storage Makes Scaling Easier

The storage software in an HCI system should create a shared-nothing, scale-out storage system. Some proto-HCI VSAs were limited to mirrored pairs of nodes. While there are many ROBO use cases best served by a two node cluster HCI systems must scale bigger than that.

Systems must allow expansion by adding additional nodes to existing clusters expanding both the available storage and compute resources. Ideally, clusters should allow heterogeneous, storage only, and compute only nodes, but we’re not going to require it for our definition.

If Simplivity didn’t rely on a PCIe card in each server for data reduction, I would go so far as to say HCI solutions were defined as running a software–defined storage service across the same servers that run user workloads. Since the term hyperconverged was coined with Simplivity in mind <See the last blog post>, we can’t declare being software defined as a requirement and exclude Simplivity from our definition.

While we’re allowing the Simplivity accelerator card, HCI is at its very core a software concept. A vendor like Broadcom could add a 100gbps Ethernet port to their RAID controller then use them to cross–connect multiple RAID controllers and distribute data across nodes. While this would have many of the advantages of HCI, it would be something different, not quite HCI. We’ll talk about a few of these almost HCI solutions in later blog posts.

Precept 3 – Managed by a Hypervisor

Since the Hyper in hyperconverged, by at least one derivation, comes from hypervisor requiring that an HCI system is based on a hypervisor seems obvious. HCI systems use the same servers to run user workloads and storage management, and the first few generations of HCI solutions all used a hypervisor to run the user loads.

More recently vendors like Rubrik and Cohesity have built scale-out appliances where their data management services, rather than, user VMs, present the CPU load.

Were these vendors stretching the definition of HCI? A little, but the primary and secondary use cases are so different I don’t think there was significant confusion created. If the backup data mover is the only user workload is that HCI? I’m not sure it is, but I’m also not insisting it’s not.

Would a system that ran containerized workloads and storage services across a pool of servers without a traditional hypervisor be HCI? I would have to say yes, being managed by a hypervisor may not be an absolute requirement.

Precept 4 – Integrated Management

To truly be hyperconverged, and not just uberconverged or super converged, convergence must extend all to the management interfaces as well. Administrators must be able to manage both storage attributes like data protection levels and virtual machines.

Architecturally any system that scales-out storage across internal storage of VM hosts can properly be called HCI. Philosophically, HCI is all about simplicity, and separate management consoles for VMs and storage just isn’t simple.

HCI, however, isn’t primarily an architecture, but more a philosophy of simplicity and a series of promises made by that philosophy, but that’s a story for another blog post.

Disclosure: DellEMC, Rubrik, Cohesity, and Simplivity have all been clients of DeepStorage, LLC.

Like this series on HCI? Do you want a more? I’m presenting a six-hour deep dive webinar/class as part of my friend Ivan’s ipSpace.net site.

The first 2-hour session is December 11. Sign up here

This post was published at www.deepstorage.net if you are reading it anywhere else on the internet it has been stolen and the thief running the site you are reading it at was too stupid to edit this out. 

Read for later

Articles marked as Favorite are saved for later viewing.
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview