开启辅助访问

超融合+云计算论坛

 找回密码
 立即注册

QQ登录

只需一步,快速开始

查看: 2505|回复: 8
收起左侧

[原理架构] RAID和对像存储数据保护问题

[复制链接]
发表于 2015-11-9 20:56:02 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。

您需要 登录 才可以下载或查看,没有帐号?立即注册

x

This post will discuss two examples below, with Part 2 discussing Hyper-converged solutions using Distributed File Systems.

1. Traditional RAID

2. Hyper-converged Object Based Storage

Starting with Traditional shared storage, and the most common RAID level in my experience, RAID 5.

The below diagram shows a 3 x 4TB SATA drives in a RAID 5 with a Hot Spare.
3-Disk-R5-w-Hot-Spare-NO-BG-.jpg

Now lets look a drive failure scenario. We now have the Hot Spare activate and start rebuilding as shown below.

3-Disk-R5-w-Hot-Spare-REBUILDING-NO-BG-.jpg

So this all sounds fine, we’ve had a drive failure, and a spare drive has automatically taken its place and started rebuilding the data.

The problem now is that even in this simplified/small example we have 2 drives (or say 200 IOPS of drives) trying to rebuild onto just a single drive. So the maximum rate at which the RAID 5 can restore resiliency is limited to that of a single drive or 100 IOPS.

If this was a 8 disk RAID 5, we would have 7 drives (or 700 IOPS) trying to rebuild again to only a single drive or 100 IOPS.

There are multiple issues with this architecture.

  1. The restoration of resiliency of the entire RAID is constrained by the destination drive, in this case a SATA drive which can sustain less than 100 IOPS
  2. A single subsequent HDD failure within the RAID will cause data loss.
  3. The RAID rebuild is a high impact activity on the storage controllers which can impact all storage
  4. The RAID rebuild is an especially high impact activity on the virtual machines running on the RAID.
  5. The larger the RAID or the capacity drives in the RAID, the longer the rebuild takes and the higher the performance impact and chance of subsequent failures leading to data loss.

Now I’m sure most of you understand this concept, and have felt the pain of a RAID rebuild taking many hours or even days, but with new hyper converged technology this issue is no longer a problem, right?

Wrong!

It entirely depends on how data is recovered in the event of a drive failure. Lets look at an example of an hyper-converged solution using an object store.The below shows a simplified example of a Hyper-converged Object Based Storage with 4 objects represented by Object A,B,C and D in Black, and the 2nd replicated copy of the object represented Object A,B,C and D in Purple.

Note: Each object in the Object Store can be hundreds of GB in size. HyperconvergedObjectStoreNormal-2.jpg

Let’s take a look what happens in a disk failure scenario.

HyperconvergedObjectStoreFailure1.jpg

From the above diagram we can see a drive has failed on Node 1, which means Object A and Object D’s replica have been lost. The object store will then replicate a copy of Object A to Node 4, and a replica of Object D to Node 2 to restore resiliency.

There are multiple issues with this architecture.

  1. Object based storage can lack granularity as Objects can be 200Gb+.
  2. The restoration of resiliency of any single object is constrained by the source drive or node.
  3. The restoration of resiliency of any single object is also constrained by the destination drive or node.
  4. The restoration of multiple objects (such as Object A & D in the above example) is constrained by the same drive or node which will result in contention and slow the process of restoring resiliency to both objects.
  5. The impact of the recovery is High on virtual machines running on the source and destination nodes.
  6. The recovery of an Object is constrained by the source and destination node per object.
  7. Object stores generally require a witness, which is stored on another node in the cluster. (Not illustrated above)

It should be pointed out, where SSDs are used for a write cache, this can help reduce the impact and speed up recovery in some cases, but where data needs to be recovered from outside of cache, i.e.: A SAS or SATA drive, the fact writes go to SSD makes no difference as the writes are constrained by the read performance.

Summary:

Traditional RAID used by SAN/NAS and newer Hyper-converged Object based storage both suffer similar issue when recovering from drive or node failures which include:

  1. The restoration of resiliency is constrained by the source drive or node
  2. The restoration of resiliency is constrained by the destination drive or node
  3. The restoration is high impact on the desination
  4. The recovery of one object is constrained by the network connectivity between just two nodes.
  5. The impact of the recovery is High on any data (such as virtual machines) running on the RAID or source/destination node/s
  6. The recovery of RAID or an Object is constrained by a single part of the infrastructure being a RAID controller / drive or a single node.

The below diagram shows a 4 node hyper-converged solution using a Distributed File System with the same 4 x 4TB SATA drives with data protection using replication with 2 copies. (Nutanix calls this Resiliency Factor 2)

HyperconvergedDFSNormal2.jpg

The first difference you may have noticed, is the data is much more granular than the Hyper-Converged Object store example in Part 1.

The second less obvious difference is the replicated copies of the data (i.e.: The data with Purple letters) on node 1 do not reside on a single other node, but are distributed throughout the cluster.

Now lets look at a drive failure example:

Here we see Node 1 has lost a Drive hosting 8 granular pieces of data 1MB in size each.

HyperconvergedDFSRecovery.jpg

Now the Distributed File System detects that the data represented by A,B,C,D,E,I,M,P has only a single copy within the cluster and starts the restoration process.

Lets walk through each step although these steps are completed concurrently.

1. Data “A” is replicated from Node 2 to Node 3
2. Data “B” is replicated from Node 2 to Node 4
3. Data “C” is replicated from Node 3 to Node 2
4. Data “D” is replicated from Node 4 to Node 2
5. Data “E” is replicated from Node 2 to Node 4
6. Data “I” is replicated from Node 3 to Node 2
7. Data “M” is replicated from Node 4 to Node 3
8. Data “P” is replicated from Node 4 to Node 3

Now the cluster has restored resiliency.

So what was the impact on each node?

readwriteiorecovery.jpg

The above table shows a simplified representation of the workload of restoring resiliency to the cluster. As we can see, the workload (being 8 granular pieces of data being replicated) was distributed across the nodes very evenly.

Next lets look at the advantages of a Hyper-Converged Solution with a Distributed File System (which Nutanix uses).

  1. Highly granular distribution using 1MB extents not large Objects.
  2. The work required to restore resiliency after one drive (or node) failure was distributed across all drives and nodes in the Cluster leveraging all drives/nodes capability. (i.e.: Not constrained to the <100 IOPS of a single drive)
  3. The restoration rebuild is a low impact activity as the workload is distributed across the cluster and not dependant on source/destination pair of drives or nodes
  4. The rebuild has a low impact on the virtual machines running on the distributed file system and consistent performance is maintained.
  5. The larger the cluster the quicker and lower impact the rebuild is as the workload is distributed across a higher number of drives/nodes for the same size (Gb) worth of restoration.
  6. With Nutanix SSDs are used not only for Read/Write cache but as a persistent storage tier, meaning the recovering data will be written to SSD and where the data being recovered is not in cache (Memory or SSD tiers) it is still possible the data will be in the persistent SSD tier which will dramatically improve the performance of the recovery.

Summary:

As discussed in Part 1, Traditional RAID used by SAN/NAS and Hyper-converged solutions using Object based storage both suffer similar issues when recovering from drive or node failure.

Where as Nutanix Hyper-converged solution using the Nutanix Distributed File System (NDFS) can restore resiliency following a drive or node failure faster and with lower impact thanks to its highly granular and distributed architecture, meaning more consistent performance for virtual machines.

欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复

使用道具 举报

发表于 2016-4-6 18:08:02 | 显示全部楼层
Thanks
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复

使用道具 举报

发表于 2016-4-26 07:21:01 | 显示全部楼层
Thanks for sharing
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复 支持 反对

使用道具 举报

发表于 2016-12-31 19:14:53 | 显示全部楼层
thanks for share
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复 支持 反对

使用道具 举报

发表于 2017-1-3 10:21:33 | 显示全部楼层
谢谢,好文章
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复 支持 反对

使用道具 举报

发表于 2017-1-8 21:11:54 | 显示全部楼层
thanks for your share
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复 支持 反对

使用道具 举报

发表于 2018-5-15 10:43:20 | 显示全部楼层
对于准备安装Nutanix CE版,要求至少4核、16G内存、200G SSD、500G HDD,

Intel的网卡、LSI的阵列卡,可以是1、3个或者4个节点
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复 支持 反对

使用道具 举报

发表于 2018-7-31 10:03:38 | 显示全部楼层
学习了 感谢
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复 支持 反对

使用道具 举报

发表于 2018-9-28 19:49:17 | 显示全部楼层
感谢分享
欢迎来到 【nutanix.club】最大的中文超融合&云计算社区 请记住我们的网址 www.nutanix.club [这是默认签名,更换签名点这里!]
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

返回顶部快速回复上一主题下一主题返回列表微信扫一