The purpose of this guide is to give the reader enough general understanding of ZFS and Nexentastor to better equip one to execute on the design and implementation of a ZFS-based storage solution. The core storage framework of Nexentastor is a robust enterprise-quality filesystem: ZFS. As is true of any filesystem on the market today ZFS has its strengths and weaknesses. We want to discuss these important elements as well as what considerations should be made during you design process to achieve the most robust and scalable solution on your budget.
Before any storage solution is considered, whether traditional hardware RAID or software-based RAID, like Nexentastor with ZFS, it is critical to determine requirements and accept the fact that there are always trade-offs between availability, storage capacity and performance. It is true that we can reconfigure storage after it has been deployed, but consider the challenges and costs of reconfiguring a production Storage Area Network especially when service outages may not be possible in the environment without significant planning, online data migration and coordination with customers. It is critical to understand your storage needs before any purchasing decisions are made to make sure that needs will be met when system is deployed. It is important to understand requirements, because without well defined requirements it will be difficult to know what trade-offs we should expect to make. There are always trade-offs in any technical decision that we make and having the right knowledge is essential to making choices during the design phase of your Nexentastor solution.
There are several factors that ultimately all play a role in final configuration of the pool and we need to address each factor and the decisions made after analyzing each factor. A storage solution will be used very differently by different organizations and one size never fits all. For example, we do not use a Ferrari to tow a boat out to the lake, likewise we do not think of trucks as high-performance roadsters. The same could be applied to storage. For example, when storing backups for periodic recovery our main concern is data integrity, followed by having enough capacity to retain backups to satisfy retention requirement imposed by the business, with last concern being redundancy to properly protect our retained backup data. At the same time, if we are running a latency critical stock-trading application, affect of even moderate latency from improperly configured storage may be completely unacceptable to such an application. These factors play a huge role in our decisions about hardware, and pool configuration with ZFS.
Complexity in Storage Area Networks is inherent and increases further whenever we begin to mix workloads. If we are using the same storage system for backups and for our highly latency-sensitive trading floor application, we have to make sure that we can maintain our latency requirements and have enough space to handle backup retention requirements all at the same time. Feasible, sure, but it requires careful planning, gathering of requirements and most critically understanding what a suitable storage solution for this environment is. Of course most modern SANs are used to provide storage back-ends for a multitude of applications with potentially very different requirements and knowing this we have to design a system, accepting certain trade-offs, capable of meeting these requirements as well as meeting growth projections, scaling with demands and handling natural spikes in demand, as well as failure of hardware, which inevitably occurs as disks, and any computer equipment fails.
Before we dig any further, there are a few key concepts about ZFS and storage in general that we have to understand and accept, as they will make it easier to further analyze our current storage need and the solution necessary to satisfy this need.
We cannot forget that in all but a select few environments I/O will be generated by multitude of sources and there will be ongoing competition by applications to store or retrieve data.
There are many very different worlds that we have to unite when we design a centralized storage solution. Every application is in a sense its own world, because in most cases it is only aware of itself, be it a single node or a large cluster of machines accessing shared resources on the same SAN. Of course at the infrastructure level we are dealing with all applications as one bigger whole, running in parallel, demanding resources such as I/O, but having a mix of vastly different profiles, resulting in a dynamic mix of elements like recordsize, seeksize, numbers of work files, sizes of files, directory widths and depths, sizes of operations, types of operations (such as attribute lookups, deletions, creates, etc.), frequency of operations, caching, etc.
At any moment there may be several applications requesting that data is retrieved from or written to storage. Understanding the fact that we are working with highly dynamic systems that can at a moment's notice change from mostly generating large sequential (streaming) I/O to small highly random I/O is critical, and the more data we can aggregate about our particular environment the better equipped we will be in our ongoing research and design of the future storage solution. There are no available tools today that will adequately profile entire environments, as opposed to individual applications and tell us exactly what we need to build.
The goal of this paper is to equip you with enough information to know what considerations to make and questions to ask during your research and analysis phase and is not intended to be a deep dive into ZFS. A more in-depth discussion about ZFS and its many unique features is strongly encouraged. This document barely scratches the surface of capabilities and intricacies of ZFS.
In general there are a number of key elements that we need to address before we can start to think about our final design. This is a high-level list of these elements, and we will address each in detail.
-
ZFS and Latency, what is latency and why we need to think about it?
-
Caching, read and write caching with ZFS Intent Log.
-
ZFS Dynamic Striping and improved parallelism.
-
What problem are we trying to resolve with the planned storage solution?
-
Striking a balance between Capacity and IOPs.
-
What do the applications in our environment do.
-
One pool or multiple, advantages and disadvantages.
-
Disaster recovery through ZFS replication.
We touch upon latency in this conversation and without digging deep into the subject matter, which on its own could easily have books written about it, let's briefly address latency. At the high level, when we talk within the context of storage, we should think of latency as the total amount of time it takes to complete a sequence of I/O requests to satisfy a specific task that was issued by an application. Latency should be observed from the moment an I/O has been issued by the application to the moment it has been satisfied. It is a key concept which we have to grasp, as it is critical in many high-performance environments today. User experience could very well suffer if key applications experience high-latency at the storage layer and do not have mechanisms to counter and obscure this latency.
Latency is the total time it takes for the I/O request to be structured, perhaps delivered via the network, sent to disk through the system bus and the HBA, acknowledged, handled and result returned taking a reverse path back to the requesting application. Busy storage systems may be doing many thousands of I/O operations per second, and in most cases, as is so typical of a SAN environment, I/O from a number of applications has extremely diverse characteristics. These different I/O's are mixed together, resulting in a highly random pattern by the time it has been delivered to the filesystem and ultimately disks. Worst case scenario, which is quite common with homogeneous SANs is a 100% random workload which necessitates very frequent repositioning of the heads in a disk drive. Of course as is typical of most SANs, this is happening to many disks at the same time. Latency increases as disks become more busy. Data is typically queued in multiple places between the application (source of the IO) and the physical disks. Time spent in one of these queues will in fact add to the total latency number.
Environments where virtualization is leveraged are even more susceptible to latency induced by storage. There are more points at which I/O contention can occur and where I/O could be queued. Virtual machines running on a host will send their I/O's to the host, and during peak times may in fact be saturating the host with I/O requests, resulting in the host relying on its own often deep queues to hang on to I/O's that it is unable to service right away. This gets infinitely more complicated when we begin to think about affect of peak times in systems that host virtual machines and the amount of I/O that may be generated during these peak times. Imagine for a example a boot storm that is generated from multiple systems all booting at the same time.
When disks are taxed with highly-random I/O and they are struggling to keep-up, a natural result is an increase in latency. Flooding disks with I/O, especially with modern SANs on multi-link 10GbE systems is not difficult. Understanding what latency is and what its impact is on the environment is essential to achieving solid and fairly predictable performance from our storage solution.
To combat latency caching is commonly used through the many layers of today's complex systems. In a real world, latency is observed through all the layers of a typical system. It takes time for an application to prepare and submit the I/O to the system. It takes time for that I/O to be delivered through the network stack over the network to the SAN and time to receive and combine the I/O then to properly deliver it to the filesystem, and for filesystem to process it perform actual I/O to physical storage and return results. I/O to physical storage in the case of ZFS may or may not go to the disks, if for example we are performing read I/O and reading directly from cache. However, when I/O requires blocks from disks, latency from HBA and of course disks themselves is another factor that we have to consider. By separating storage from the system where the application resides, be it a single bare-metal system, or a virtual host with many VMs running on it, we also introduce latency by placing a physical network layer between storage and application.
Some applications deal with latency more gracefully than others and most environments will have a mix of applications, some of which handle it well, and some of which do not. We should always focus on the worst case scenarios, the outliers, because inevitably the worst case scenarios are those we are least prepared for. We will discuss this in more detail later. This subject could become infinitely complex and so we will try to address some of the basics of latency, but we have to remain cognizant of the fact that understanding latency and its impact to our environment is no trivial task.
Latency is inherent to spinning media because mechanical operations take more time to complete than it takes to move bits at nearly the speed of light. ZFS employs multiple mechanisms to cache data, utilizing system memory, and low-latency Solid State Devices.
The ARC (Adaptive Replacement Cache) is an amazing and complex mechanism which cache recently accessed data, frequently accessed data and data in the physical proximity of blocks that are being used in an effort to pre-cache data that has a higher chance of being touched in the very near future.
Historically, filesystems were not designed to allocate large caches for data to accelerate repeated retrieval of information, or to buffer writes, such as those generated by databases, which are particularly difficult on storage systems, due to the synchronous nature of the highly random transactions. With ZFS we try to maximize our ability to rapidly access data which has recently been accessed or is being frequently accessed by caching this data in one of two levels of cache. The ARC, by nature is allowed to grow to almost the size of system memory, which effectively makes it a volatile, but extremely responsive vast cache. For this reason, systems with 96GB or more of system memory are commonly used to build ZFS-based storage.
In reality most environments today are working with much larger Working Sets than the typical memory configurations on servers today permit. For this reason we attempt to employ flash-based storage devices as cache in front of main storage, which predominantly is spinning media. Flash devices do not, under normal circumstances, suffer from affects of latency, and in fact are specifically built to combat latency at every turn. This technology is an extension of the ARC and is referred to as L2ARC - Level 2 ARC. While nothing beats performance of system memory, flash-based storage comes close.
What makes flash-based storage particularly compelling is its ability to handle random small I/O extremely well, with incredibly low latency. Because highly random I/O is so tough on spinning disks, pushing some of that work to cache is one of ZFS' primary strengths.
As always, realize that the more you pay the better the performance you will get. Because predominantly storage systems are asked to handle highly random I/O, using flash based storage is an ideal way to combat latency. With ZFS in particular we can of course build solutions entirely out of SSDs, to deal with extreme low-latency requirements. But, more generally we can use SSDs as cache, allowing us to buffer both reads and writes, reducing the number of physical I/O to disks, resulting in much lower latency while handling large very random workloads. More about this later.
Because storage has historically struggled with handling applications that have a requirement of consistent on-disk data by writing information synchronously, built into ZFS is an Intent Log mechanism which essentially allows for synchronous writes to be grouped and ordered along with asynchronous writes and be sent to the disk in a more organized sequential stream, while rapidly responding to the application that is submitting synchronous write requests, allowing the application to continue working instead of waiting for storage to flush data to disk every time I/O is sent to the storage system.
Ability to hold I/O in RAM while protecting critical data in the Intent Log gives ZFS more time to group I/O's in order to sequentially flush them to disk, maximizing throughput and reducing on-disk fragmentation of data. This write cache is accomplished with high-performance NAND flash memory devices, such as SSDs and non-volatile memory-backed RAM devices. These devices do not suffer from latency inherent to spinning disks, and can very rapidly respond to highly random I/O, which at an appropriate time is flushed to disks as part of a larger transaction group flush. All writes to disks could be thought of as a transaction and they either all succeed or fail. This fact and presence of the Intent Log assure data coherency and consistency.
Again, this is something that we need to think about when we analyze our storage requirements. Many environments today require this level of caching, which is heavily used any time virtualization, OLTP applications and essentially any database engines are being deployed. If the environment has virtualized systems, databases and any applications that you know enforce data integrity by doing sync I/O, addition of write cache is practically a requirement and will further benefit the storage solution by reducing amount of I/O to the disks allowing the disks to handle other competing I/O requests.
Write throttling is another absolutely essential concept that is not native to many filesystems, but is a real factor when dealing with ZFS and Nexentastor in particular. Because latency is so critical, and we understand just how critical it is, we have ways to make sure that writes do not completely overwhelm storage and really impact read performance. In general, write operations are more difficult than read operations, for several reasons. The system is designed with understanding that there should be a balance in performance between reads and writes, so during periods of heavy write activity we seek a better balance by throttling writes. This can become rather problematic in environments where workloads shift drastically from being heavily write biased to read biased. Throttling of writes could in fact become so severe that a SAN may seem completely overrun, yet performance tools show continuous reduction in amount of I/O to disks. This is a factor which we have to take into account as we plan our environment. We have to make sure that we do not constrain it from day one and design it for our specific workload footprint.
Write throttle is a tool which was designed to manage short periods of abnormal amount of writes, unfortunately the side-affect is that prolonged write operations could be throttled so aggressively that eventually operations from the client begin to timeout and fail. A good example of this is a repeating failure to complete storage vMotion in a vSphere cluster when storage is not correctly configured to support the demands of the environment. Again, understanding our environment is key, and as we work to design our solution we need to think about whether we are going to be read or write-biased and design accordingly. While we are not diving deeply into this topic, this should at least give one the knowledge to ask questions that you did not know to ask prior to reading this. Ultimately our goal is get you to think in terms of storage and in particular ZFS, because with traditional (legacy) storage we did not have these factors to consider.
It is difficult to summarize all the key features of ZFS in a few sentences, but the key concepts that we need to understand are touched upon here. ZFS first and foremost utilizes dynamic striping, which means that as we create more RAID groups inside our storage pool, or volume in Nexentastor terms, we increase throughput by making a wider stripe, which essentially means that we can do more work in parallel. It is easy to visualize each RAID group, also referred to as top-level VDEV as a single disk. So, a pool that consists of say 3 top-level VDEVs regardless of number of disks in each VDEV, can be thought of as having an aggregate performance of 3 disks, before factoring in any caching. The math is linear; as we add top-level VDEVs we linearly scale performance. To further illustrate this notion we can simply say that a pool with 3 top-level VDEVs will have 1/2 the throughput compared to a pool with 6 top-level VDEVs, and 1/4 the throughput of a 12 top-level VDEV pool and 1/8 the throughput of a 24 top-level VDEV pool. As we can tell from this, more top-level VDEVs are better than less. Comparisons assume that we are using the same disks in these conceptual configurations. In reality of course, performance varies based on a lot of factors, including the type of disk, it's rotational speed, average seek times, etc. Data on disk will also have some impact on performance.
This may sound complicated now, but will begin to make sense as you learn more about Nexentasor and ZFS in particular. The key understanding here is that performance scales linearly with number of top-level VDEVs. So, when we think about our storage solution and when we think in terms of performance, so IOPs, latency and not so much about Capacity, we have to realize that if we are trying to match IOP performance of 3 disks working in parallel, we have to have a pool with 3 top-level VDEVs. One is tempted to argue that number of disks in each VDEV surely will make a difference; however, the key performance factor is parallelism of workload, which we can only achieve by striping data across more top-level VDEVs.
With capacities today reaching 3TB, and growing rapidly with Enterprise-class drives, we tend to think capacity and reduction of footprint, both of hardware and energy consumption over and above IOPs. These are important matters, but the fact is, nothing replaces spindles, when we are dealing with I/O hungry applications and low-latency demands. While disk capacity is continuously growing, the speed of disks and consequently seek times (amount of time to reposition the head to location of data on disk) have not been keeping pace. Because seek times are costly, doing 100% random IO is extremely costly, and something spinning disks are extremely poorly equipped to handle. Completely random workload translates to constant repositioning of heads on the disks, which leads to latency associated with time to move and position the heads. Increase in latency will mean lower throughput from the disks. Many disks today have about the same speed and seek times of disks from say 7 to 10 years ago. Albeit, manufacturers have been improving disks with larger caches, more efficient placement of physical blocks on the platters and better servos to reduce seek times, these improvements are still insufficient. In almost all but a few isolated use-cases IOP capacity is absolutely key, as its abundance will reduce latency, reduce chance of write throttling, etc. For example, 20 mirrors with 7.2K 1TB drives are equal in capacity to 10 mirrors with 7.2K 2TB drives, but with 20 mirrors we have 2000 RAW IOPs vs. 1000 RAW IOPs with 10 mirrors of 2TB drives. We will continue talking about the importance of IOPs throughout the rest of this document, because it cannot be understated.
Planning for a new storage solution is typically due to a growing need to resolve some existing problem. This problem may be current storage solution's inability to handle growing storage demands, decision to consolidate systems and storage into a more centralized configuration, deployment of some new systems that have storage requirements which you cannot meet with existing storage architecture, etc. Defining the problem is critical, because it will lead into other questions that have to be answered in order to properly scope and configure a new storage system.
It is common for the problem statement to be complex and multifaceted. There is not usually just one well defined problem, and most commonly a combination of issues, for example, consolidation of physical systems into a virtualized clusters, in order to reduce physical footprint and associated costs, while at the same time retiring an aging backup solution that is no longer able to meet your SLAs for completion of backups or perhaps lack of capacity to retain growing data produced by your applications and users.
To successfully define a problem statement we need to ultimately boil things down to the most basic of terms. Essentially, once we strip away most of the details we are looking at a question of Capacity, level of Data Protection and Performance and which two of these three are most important. Remember, we mentioned trade-offs at the beginning of the document, and with ZFS, just as with any other storage solution we have to choose between Capacity, Performance and Data Protection. Without understanding our problem we cannot begin to make any decisions about trade-offs that will need to be made.
For example, if the problem which we are trying to resolve is mainly a lack of capacity to store fairly volatile data, value of which after 30 days diminishes considerably selecting double of triple parity, which of course means sacrificing capacity may not be the best option. If capacity is the main driver and only non-latency critical applications are going to rely on this solution, building a storage pool with only a few stripes may be completely acceptable. We may not even need to think about Read or Write caching, because latency is not an issue, we do not have users interacting with the data or applications on this storage, and as long as we can achieve our capacity objective cost always being a factor, the benefit here may not be worth the added cost.
Similarly, if we are trying to achieve higher level of data availability and protection and are willing to sacrifice capacity, we may opt for a greater number of stripes. Having more stripes, or VDEVs actually increases our redundancy, because each VDEV is essentially a RAID group, and with a choice of single or double parity we can afford to lose one or two disks respectively, in each RAID group while still remaining operational. The compromise here is capacity, because as we increase number of VDEVs, we increase number of stripes and in turn amount of raw capacity required to store parity data. We always have to make some trade-off. The added benefit here is an improvement in performance as we parallelize workload. Finding a perfect balance is not easy, but critical.
A situation where we are supplying storage to a highly dynamic Web site that may consist of a multitude of virtualized systems running perhaps a trading floor application or multiple such applications with extreme sensitivity to latency and very high data-availability requirements is completely different from being mainly a data repository, and requires a completely different approach to the design. Here, we may not consider mirroring or single parity, seeking greater data protection provided by double or triple parity. In order to address our critical latency requirements we are likely going to increase number of RAID groups in order to better streamline I/O and leverage both Read and Write caching, thereby reducing number of IO requests that have to hit our spinning disks, because inevitably mechanical disk devices, while they have great capacities all suffer from latency induced by positioning and re-positioning the heads. In some very specific use cases we may even consider an SSD pool, again sacrificing capacity for higher price, but achieving outstanding low-latency numbers.
Please bear in mind that these are all extremely simplified examples to help us visualize the challenges with which we will be presented in our own environments. In reality, most of the time we will be dealing with a mixture of workloads some may not be concurrent, for example only occurring during certain times of day, while others being highly concurrent.
We cannot ignore a simple fact that when we talk about storage we are not only talking about how much we can store [capacity], but we are also talking about IOPs [number of operations performed per second]. Clearly, capacity is an absolute number that we can fairly easily understand and derive. For example, if we are building a SAN to consolidate 100 servers each of which on average uses 500GB of disk, we are looking at a 50TB solution, with perhaps another 20% overhead expectation, for a total of around 60TB. We are simplifying this conversation by not talking about compression, deduplication and other space conserving measures. These considerations should be made after we have a very good understanding of our data and our requirements and the depth of this discussion is largely out of scope of this higher-level document. The takeaway here is that it is fairly trivial to understand capacity requirements. Because we are used to the fact that when we build conventional servers with local storage on which our applications run, we typically do not think about IOPs, because dedicated storage in the servers typically has fast enough drives and enough of those drives to support most typical workloads.
However, as soon as we begin to talk about a SAN we have to immediately recognize that we are going to be looking at a very different disk configuration, and likely a far different IOP number than the aggregated number from all disks in the individual servers, for example the 100 servers that we trying to consolidate. For the sake of an argument assume that we have 6 ~100GB drives in each of the 100 servers for which we are consolidating storage. Let's also assume that each physical disk is an industry standard 10K SAS drive, capable of around 140 IOPs. Most servers are built with some level of RAID, so let's assume here that with a RAID controller these 6 drives per server are capable of ~500 IOPs. This is a conservative estimate, but this is purely to illustrate the subject matter, accurate numbers are not important. If we are to get a total IOPs capability of the 100 servers we come-up with 100 * 500 (IOPs) = 50,000 (IOPs). This is an impressive number, but what does it mean? This tells us that at any given moment we may be doing as many as 50,000 IOPs if we are to look at the 100 servers running as one. The challenges with this picture are: a) do we know that any of the 100 servers ever actually hit the 500 IOPs number, and 2) if they do, do we know if there is ever a chance that all servers will hit anywhere near 500 IOPs at any given time? These may sound like straight-forward enough questions but they are absolutely critical to our design choices we make as we build our SAN. Obviously, our worst case scenario here is that we may need 50,000 IOPs at some point from the SAN. We should always build conservatively, aiming to meet our performance goals under worst-case scenarios, and ideally accounting for growth of perhaps both capacity and IOPs.
So how does this scenario translate to a ZFS-based storage solution like Nexentastor? Unfortunately, without a real in-depth analysis of this scenario several factors will remain vague, but at the high level the following are important to consider. Quickly recall our mention of dynamic striping and the concept of a VDEV which is a virtual representation of a group of physical disks in a stripe this is roughly the same as a RAID group in a traditional RAID array. To simplify things let's accept that as long as we are talking about the same, say 10K SAS disk, each VDEV is roughly equivalent in its IOPs capability to a single physical disk. For the moment, let's forget about Read and Write caching of ZFS. If we take the average of 140 IOPs per VDEV, we quickly realize that we will need as many as 358 VDEVs to satisfy our hypothetical 50,000 IOPs requirement. If we take a typical Raidz configuration of 6 disks per VDEV, we are all-of-a-sudden looking at 2,148 disks. Again, takeaway here is IOPs are critical to consider, because our requirement of 60TB could easily be achieved with far, far fewer drives. The number of disks is perhaps unrealistic here, but not the IOP requirements, which is where caching is effectively a buffering factor, allowing us to achieve the IOP goal within a reasonable latency. More on the application of caching in the following section.
The more we know about how our applications work and the relationships between applications the easier it will be for us to design, from ground-up an appropriate and scalable solution. Every environment is different, and in reality your systems may only be utilizing a small percentage of their total IOPs potential, only hitting their maximum IOPs capability 5% of the time. Knowing this is essential and cannot be understated. It is also not likely that all applications will hit their maximum IOPs potential at the same time, and in most environments it is typical to see a distribution of when different applications show maximum demand. Organizations with some fairly defined business hours will typically show highest IOPs during those hours. This may not be true in your organization and you need to know this.
A number of times we referred to caching because it is a key differentiator between ZFS-based solutions and other traditional software RAID or hardware RAID solutions. Earlier we talked about an environment that could potentially have a maximum requirement of 50,000 IOPs, based on a back of the napkin calculations. With disks alone achieving this requirement may not be possible in our SAN solution. One answer to this is a robust caching mechanism to help with caching both reads and writes. As has been pointed out already ZFS aggressively caches recent data and frequently used data to one of the two levels of cache. Like other traditional filesystems RAM is used to cache both reads and writes. The difference is, with ZFS we treat almost ALL available RAM as cache and because RAM is higher bandwidth and lower latency than SSDs or spinning disks, we try to consume almost the entire available RAM into what we call ARC (Adaptive Replacement Cache). Expensive RAID controllers with their limited caches simply cannot compete with a modern x86 system equipped with perhaps 96GB of RAM or more. ZFS caches everything in ARC, every bit of data that we read or write will pass through ARC. For every IO that we can service out of ARC/L2ARC, we reduce the work on disks to improve performance of IO that will require access to disks. One key element to appreciate here is that matching our working set size (WSS) to size of ARC/L2ARC will greatly improve performance. If our workload is highly read-biased, reading from ARC, we could easily achieve 150,000 IOPs, but as is most often the case, we will service part of the workload out of ARC, L2ARC and disks. The larger the difference between working set size and the size of ARC+L2ARC the less effective cache is going to be, generally, because we will end-up caching more data that is evicted from cache without being hit on again.
Determining appropriate amount of L2ARC (Level 2 Read Cache) is not simple, but the are some basic generalizations, such as adding more L2ARC requires more metadata space in ARC, which means more RAM, this is because we store references for L2ARC blocks in RAM. The amount of RAM again depends on the size of the system and most critically total number of blocks allocated. It is always far easier to manage L2ARC on systems that primarily use larger blocksize, but the reality is never this simple. It is also important to bear in mind that there is really only one ARC, no matter how many pools are running on a system, one or one hundred. Starting out with seemingly more RAM than we may think is needed is not actually a bad idea. Some tuning is at times necessary to make sure that large RAM sizes are used correctly and efficiently, but tuning is far easier than not having enough RAM to cache sufficient percentage of our working set. It is not common to see ARC that is too large for the environment, in fact quite the opposite. Of course, size of systems vary, but it is not unreasonable to start out with perhaps 96GB of RAM on a medium sized system. Because ARC is used both to store data to be written to the pool(s) and cached recently or frequently accessed data, it is a highly dynamic and sought after resource. Large ARC means more room to store blocks that would otherwise be read from L2ARC or disks. Working set size that is not significantly larger than the size of cache means better use of cache, resulting in high percentage of hits against ARC or L2ARC, which in turn means extremely low latency IOPs that required no physical access to the disks, again allowing operations that do require real IOPs to disks to be completed in less time. Because ARC is higher bandwidth and lower latency than L2ARC, we always strive to achieve majority of cache hits against ARC, with L2ARC being secondary, and of course spinning disks being the least desired option.
There are a number of considerations to make about L2ARC, and sizing as well as number of actual devices are two critical parts. A generally accepted best practice for L2ARC is to deploy more than one device per pool. This is first and foremost an availability consideration. SSDs are continually growing in capacity, and we are beginning to see nearly 1TB SSDs. However, having one large SSD means there is a single point of failure. Good systems engineering requires understanding failure and minimizing it through various means. L2ARC is not meant to be redundant since data is by nature volatile and effectively throw-away. However losing a L2ARC device which may contain 600GB for example, of cached data could be a severely crippling blow to a system that is highly dependent on high cache hit ratio. On the other hand, having two smaller SSDs means 50% of cache remains if one is lost. In the case of a system with two 300GB SSDs, loss of one means 300GB of cache are still active. Split two into four, and now a loss of one means only 25% impact to performance. The take away here is that we should always think about where our single points of failure are in the design of the solution, and attempt to eliminate them, or reduce them as much as is reasonable to do so. In addition to a reduced risk of losing 100% of L2ARC with a single device, with increased number of devices we are also increasing total bandwidth and IOP potential. L2ARC devices are effectively pooled, with each device being a stripe and a top-level VDEV. As is the case with spinning disks in the pool, more top-level VDEVs of L2ARC allows for a more parallelized performance. More concurrent IO will of course translate into lower latency. The more devices we use for L2ARC the more IO channels we expose to handle larger amounts of I/O from consumers in any given period of time.
Writes are typically grouped into two categories: Synchronous and Asynchronous. Asynchronous writes are normally buffered in ARC, reorganized into larger sequences and at regular intervals flushed to disks in sequential manner, maximizing throughput and reducing latency. Write cache, also referred to as ZFS Intent Log (ZIL), when talking about dedicated devices is used to allow for caching of synchronous writes which normally would require immediate I/O to disk to satisfy requirements of certain applications that force direct I/O to ensure data consistency on disk. Forced flushes to disks from direct I/O will necessarily bust caches, resulting in too-frequent flushes diminishing ZFS' ability to coalesce I/O and stream it to disks at longer, more regular intervals.
The result of having dedicated Intent Log (ZIL) devices is increased ability to turn synchronous I/O into asynchronous IO, which translates into much better coalescing of all I/O and better sequential streaming to disks. One thing that disks do extremely well is sequential I/O. ZFS is able to achieve this when we build up substantial enough buffer of coalesced I/O's and flush that buffer to disks in one transaction. Not all environments will benefit from dedicated ZIL devices equally. In most cases however having a dedicated ZIL results in a very substantial performance improvement from significant reduction of busted caches and longer periods of time between flushes of buffers to disks. Depending upon the size and scale of the SAN being configured multiple ZIL devices striped and mirrored may in fact be necessary. Dedicated ZIL devices could be added or removed on demand, making it easier to grow the system without any significant reconfiguration.
It is critical to understand that operation of the ZIL is quite unique in that data is continuously written to the device, and almost never read. Majority of Solid State Disks today experience wear of cells in the memory modules from prolonged write cycles. To combat this manufacturers introduce wear leveling mechanisms, which while effective may not be effective enough when the SSD is used as a dedicated ZIL device. There are certain devices which are not affected by wear due to their design, and they are recommended as the first choice for the ZIL. Another critical factor which we cannot overlook is the fact that ZIL has to be extremely low-latency and high-IOP capable. Under no circumstances should we ever consider using conventional spinning disks for dedicated ZIL. Doing so could in fact result in performance being worse than operating without a dedicated ZIL.
Yet another important consideration is a choice between one or multiple pools. There is not one correct answer, and the best answer is: "it depends". There are multiple reasons to consider having more than one pool, some of which we will discuss here. Size and scale are of course two key elements that will inevitably drive our decision. If we are planning to build a highly scalable solution, one we know will grow with time, we may choose to start with a single pool, and instead of growing the pool by adding more devices later, we may decide to build another pool instead. Why? Perhaps one of the concerns is elimination of a single point of failure. Even ZFS is not perfect and pool corruption is still a possibility that we should consider when designing a highly redundant solution. Having one very large pool also means that even with backups or replicated configuration, it may take more time to recover than will be found acceptable. Protecting ourselves from a single point of failure is a very good reason to opt for building out pools, instead of expanding a single large pool. Building out pools also means that we will need to define a flexible hardware strategy which will allow us to add disks or entire JBODs (storage enclosures) to the SAN controller nodes in a way that will not cause I/O bottlenecks in the existing I/O channels and will not severely constrain us with sticking to a pool design which may not meet all of our needs.
On the other hand we may decide to build in tiers. Most environments will have data and applications that will fit into one of a few performance tiers. For example, we may want to build a very high-performance pool, perhaps utilizing only SSDs to host a large virtual environment, with thousands of VMs that all require a large number of very-low latency highly-random IOPs which may not be achievable with disks alone while cache is cold, and some subset of latency-critical data that this virtual environment uses regularly. Most data, as it ages, will see fewer and fewer accesses, resulting in continued decrease of its importance. As this happens, we may choose to move this data to slower, lower cost storage, and perhaps keep it there until it is completely useless to us, which in some cases may be never. Tiering of data is a good way of reducing costs by having a smaller very high-performance pool and larger mid-range pool(s) for aging data, with perhaps a third tier using very high-capacity disks for long-term or archival storage of data with few requirements for latency.
This could be infinitely expanded to address needs of a large number of different environments. It is quite likely that tiered approach will prove much less costly in the long-term and will allow for more flexibility in terms of leveraging most IOPs for applications and data that need them most. Scaling is easier this way too, because scale may at times mean more IOPs or much more capacity. It may not always make sense to grow a pool that uses 2TB drives in order to achieve higher IOPs, because the cost of capacity proves to be a wasted expense, whereas growing a small all-SSD pool increasing IOPs significantly with just a small capacity waste may be far more appropriate. The point here is that use cases vary, and we should not lock ourselves into a monolithic design with a single pool that is expected to do everything.
Of course, there are extremes in both cases. Building lots of small pools really diminishes performance, and aggregate performance of several small pools may prove to be far lower than that of a single large pool consisting of the same number of devices from the many small pools. Caching, both L2ARC and write cache has to be dedicated, so having several pools means we have to have more SSDs to satisfy each pool, if we need caching in each of the pools.
Because designs will vary based on requirements of the environment, it is difficult to say that a particular pool design is superior to others, but because of various factors, such as preferred stripe size, time to resilver replacement disks, the value proposition of dynamic striping, etc., there are certain best practices with pool design. For this is a high-level document and not an in-depth guide on pool design we will quickly highlight things not to do.
We should avoid building pools where any of the following are true:
-
Build a pool without selecting any level of redundancy, instead simply opting for each disk being a top-level vdev, where effectively a failure of one disk in the stripe means loss of pool.
-
Build Raidz pool with fewer than 5 disks per vdev or Raidz2 pool with fewer than 6 disks per vdev.
-
Build Raidz pool with greater than 9 disks per vdev or Raidz2 pool with greater than 10 disks per vdev.
-
Opt for uneven capacity or physical size of vdevs in the pool, perhaps mixing 500GB disks and 2TB disks whether same vdev or not.
-
Mixing slower and faster disks in the same vdev or pool, for example mixing 15K and 7.2K drives, resulting in odd asymmetric performance within vdevs, or between slower and faster vdevs.
-
Mixing redundancy levels inside the pool, for example, 1 vdev with 5 disks in Raidz and 1 vdev with 6 disks in Raidz2, or adding mirrored vdevs to a RaidzX pool to improve performance.
-
Configure pools with too few top-level vdevs, regardless of the number of disks in each vdev, resulting in a highly IO-bound pool with poor write characteristics.
-
Choosing to use spinning disks, regardless of their RPM rating for dedicated ZIL and L2ARC cache devices.
It cannot be stressed enough how critical it is to have a properly configured pool, one where performance is fairly even between vdevs, with enough top-level vdevs capable of IOPs necessary to meet needs of environment, and having a good balance between capacity, redundancy and performance. One of the current constraints of ZFS which may be addressed in the future is inability to reconfigure a pool where anything other than addition of new vdevs is desired. Today, we cannot adjust the pool by shrinking or growing the number of disks in vdevs, or balancing vdevs by replacing larger disks with smaller disks in pools where IOP performance is the primary goal. Effectively, we can only add vdevs, and increase overall capacity (expanding) by replacing smaller capacity disks with larger capacity disks. We cannot change RaidzX of the vdevs in the pool once it has been configured. Because of all these constraints, we have to see pool design as the hallmark of a ZFS-based SAN solution, such as Nexentastor.
Programmed into ZFS is a robust mechanism for data replication. Unlike most filesystems replication in ZFS was a core requirement, and as the filesystem was designed all aspects required for robust replication were built-in. There are many reasons to replicate data, and use cases vary of course. We cannot begin to imagine the many possible use cases, but there are a few very common reasons, such as fault tolerance and data protection.
The one obvious weakness in the design of ZFS is a non-distributed model of the pool, which ultimately is a single point of failure. Corruption of data or metadata in the pool, loss of redundancy, out of order writes all have a potential to leave a pool in a state that is beyond recovery. Another reason for making informed and well understood design choices when building the pool is failure. We have to consider what level of availability we are seeking and include replication into this model. The higher our availability requirements the greater the need for replication will be. One of ZFS' greatest strengths is its commitment to data integrity which is apparent in the use of a strong checksumming (sha256 algorithm) on every block of data. Every time a block is read, its checksum is validated to make sure the block is still as expected, and if something has gone wrong we attempt to fix that block. But, even with the best of integrity mechanisms, failures happen that may be beyond the scope of software and we need to consider this as one of the chief reasons to replicate our data. While ZFS pool recovery is possible and in fact has been proven extremely successful over time, associated costs should be considered early on, and ZFS replication which in Nexentastor is augmented further with a flexible proprietary transport mechanism, as well as a ssh-based mechanism should be part of any production system design.
It is important to mention that replication with ZFS is asynchronous, meaning our replica will lag behind the source. In most environments this is acceptable. Recovery Time Objective is important to consider, and the shorter this time is the more frequently replication should run. Nexenta's Autosync replication framework is designed to be flexible enough to allow for frequencies measured in minutes, hours, or days. Replication target has to be another Nexentastor system, but does not have to have the same pool structure, or pool configuration. For example, your production environment may have high-IOP and low-latency requirements, and the pool may consist of a large number of smaller capacity, perhaps 600GB 15K drives in mirrored pairs, while your target's pool may be a raidz2 configuration with 12-disk vdevs, utilizing perhaps 2TB 7.2K drives, resulting in a lower performing pool, with much higher latency on disks. But, because this system is our replication target, we likely will not care about this, simply because there are no applications and consumers that depend on this storage. It is quite reasonable to build our replication target with far greater capacity and configure multiple Nexentastor systems with the same replication target.
Replication is achieved via ZFS' native snapshot functionality. A snapshot is created for each dataset, which is then sent via the network to the target system as a stream of bytes. It is possible to replicate on the same system by simply building another pool specifically for replication. However, having this second pool on the same system increases your Single Point of Failure, and thus should be considered very carefully. It is a good practice to have another physical system, ideally not located in close proximity to your source. Obviously every environment is unique and the design will have to suit the environment. It is not uncommon to replicate between datacenters and in fact Nexentastor's proprietary replication mechanism allows a multi-destination replication schema. Multi-destination allows for an even better data protection strategy because multiple targets will contain copies of the data from one or multiple source systems.
Replication is an all or nothing (atomic) operation. This guarantees integrity of information. We have to achieve a complete transfer of all blocks in a snapshot, which means we are always consistent on the target as of the latest successfully transferred snapshot from source. Snapshots are essentially deltas between previous state and state of dataset when snapshot is captured. ZFS is built in such a way that creation of snapshots is normally a very low-impact operation, there are caveats to this, but on a properly scaled and architected system they will be low-impact. More frequent replication schedule will result in more snapshots generated, but the more frequently we replicate the more blocks will be in ARC or L2ARC, resulting in low-cost reads, and only marginal impact to pool performance.
It is important to consider however that additional IOPs will be required to achieve our replication objective, and should be included in our calculation of IOP requirements. Rate of change will of course be a profound factor in this calculation. Some systems may be highly read-biased in which case deltas between snapshots may be quite small thus resulting in replication having a very negligible affect on overall pool performance. The other extreme could of course be a heavily write-biased system where deltas end up being quite substantial, and replication operations result in observable impact to overall pool performance. Having a properly sized ARC and L2ARC is critical in this case, because we want to avoid any unnecessary reads from disks during replication events. One of the benefits of having an asynchronous replication mechanism is ability to schedule replication events during off-peak hours. Most environments go through a few peak / valley cycles every 24 hours, and replication could be scheduled to occur during the low-volume periods, thus maximizing and balancing out available resources. When replicating from multiple sources to the same target, it is a good idea to build replication schedules maximizing these low-volume windows, but making sure that target is not flooded with several streams of data from all source systems at the same time. Time and trial and error approach will be needed to come up with a solid workable design. It is worth mentioning that having a single target for several systems may not be a good enough design, because failure of the target will mean multiple source systems no-longer have a recovery strategy. It is a good design practice to scale number of targets with number of source systems. Again, requirements of the environment will drive the decision.
There is a lot of flexibility in Autosync in selecting what to actually replicate. We could choose the replicate an entire pool, or only specific datasets. It is also possible to have a varying snapshot retention schedule. Remember, snapshots grow over time, which means additional space in the pool will be required to store snapshots, both on the source system and on the target. This is another necessary consideration. The more snapshots we want to retain, and the longer we want that retention period to be, the more space in the pool will be consumed. Rate of change is a critical factor, and systems that are write-biased will need more space in the pool dedicated for snapshots. There are a lot of controls that could be implemented in order to control snapshot creation and ratio of data in snapshots to live data. This discussion is out of scope of this document.
While this is rather lengthy document it only addresses topics at a superficial level. More research should be done on each important topic addressed in this document. Key things to take away are:
-
ZFS is a filesystem strongly biased towards caching in system memory, and if permitted to will use all of available system memory, resulting in potentially extremely large primary cache.
-
Level 2 cache, L2ARC is an extension of ARC, is optional, but typically recommended, especially when environments are read-biased. High performance SSDs are critical to achieving consistent and reliable L2ARC performance.
-
Understanding how ZFS stripes data and following some outlined best practices and not doing outlined worst practices is really important if we are hoping to achieve best possible performance.
-
Reconfiguration of a ZFS pool is extremely constrained today and allows mostly growing sizes and number of vdevs, but does not allow for correcting various design mistakes introduced at inception. As such, it is paramount to build a pool correctly from the start to assure proper growth and scaling.
-
Observation and learning behavior of the environment for which a solution is being configured is critical, because of the many facets of ZFS being driven by the environment and what is operating in it.
-
Scaling up or scaling out in terms of pool design is an important consideration and should be made based on understanding the environment, where the environment is headed, in terms of projected growth and expansion, as well as by the types of workload in the environment.