Making storage highly redundant has been arguably the single largest challenge for storage engineers for years. Keeping data in multiple places to allow it to survive unscathed if things go badly in the environment can create a lot of heartache. Initially, simple RAID was enough, but then people wanted multiple copies of data at multiple places. Storage arrays added this functionality, but that meant a similar SAN often had to exist in multiple places, and it would often cost more than twice the cost of a single copy. DFS added a manner of doing this, but it came with its own problems. Finally, there were third-party programs such as Doubletake that would replicate on top of the OS for you, but they were costly and required extra software.
With Windows Server 2016, Microsoft has built storage replication into the OS, at a block level, per volume. On top of that, it is completely storage agnostic and runs over SMB3 with no special transport needs. There are two ways that the data can be replicated between the two locations, Synchronous and Asynchronous.
Synchronous replication is used to keep the two pieces of storage in a lockstep copy of each other. Writes are stored locally in a log volume until they reach the destination side and are then committed to the storage on both sides. The write isn’t acknowledged to the host until the data is on the two servers, which means that the link between the servers needs to be high bandwidth and it needs to be fast. The official word from Microsoft is less than five milliseconds and a minimum of one gigabit, but it depends heavily on what workload you are trying to push between the two systems. The key, however, is that if a site suddenly disappears, no acknowledged write is lost since it is still at the other location.
Asynchronous replication is used to keep the two copies of data in sync as it is allowed. This still isn’t designed for extremely low bandwidth links — it is basically the same process as above, except the write acknowledgement isn’t held until the second side acknowledges. It acknowledges immediately and then it gets it to the second side as it can. This is mostly designed for very high IO, low-latency access, or slower connectivity between the two servers.
This new feature can provide a lot of major benefits to a business’s redundancy and overall survivability. There are, of course, some caveats. Depending on the replication topology, the destination volume being replicated to could be offline while the data is being replicated, and will have to be brought online manually when the data is needed. It cannot be brought online as read-only on the destination side, either. If you’re doing a stretch cluster with storage replication, though, it can be well coordinated such that the impact is nearly seamless. The log volumes mentioned before need to be on fast storage to keep up with the rates of change that they are getting, and a large amount of bandwidth can be used between the two nodes.
You can also watch my webinar touching on a bunch of the new features.