Here are some quick questions (w/ answers) I put together in regards to EMC’s RecoverPoint that I get from time to time. Perhaps this will help put you in the know as well as arm you with the right questions to ask when considering this product for your datacenter. Any discrepancies? Please feel free to politely comment.
What is RecoverPoint?
RecoverPoint is a continuous backup solution offered by EMC capable of providing asynchronous and synchronous replication across heterogeneous arrays. As of today it supports both block based storage protocols, Fiber Channel and iSCSI. All replication takes place over standard IP for asynchronous replication and fiber channel for synchronous replication. RecoverPoint handles all FC to IP conversions for asynch replication.
What does continuous backup mean?
Simple..every write is captured (or split/copied) and replicated in real time (depends on async or sync req.). In MOST situations every write is deemed a snapshot. These small aperture delineations allow you to roll back to any point in time. This is important as you don’t have to worry about rolling corruption as is the case with most mirrored based solutions. Continuous Backup is the best of both worlds folks. The ability to meet an RPO of zero as well as return to any point in time for recovery, this is the now and future of backup. I am with Mr. Preston on this, “..Things have got to change, people. We can't keep doing things the way we've been doing them..” <-Amen brother Curtis, Amen..
How are my writes replicated?
Splitters make a copy of the write I/O and send it to RecoverPoint (RPAs) for replication local or remote. To split these writes you need a write splitter, remember this isn’t complicated? Lets keep it simple…
There are three types of splitters..
- Host splitter-code that is installed on the host itself
- Fabric splitter-code that is installed within you FC fabric switches (Brocade and Cisco)
- Array splitter-code that is installed on your array (Clariion Only)
The most widely used in my travels is the array or Clariion splitter. The assumption is..you guessed it..you have a Clariion. There are two components that are needed for the Splitting capabilities on the Clariion. They are..
- The Splitter engine or driver as its called and the enabler package. As of today all current FLARE bundles include the Splitter driver/engine. So you just need to worry about enabling the splitter (.ena file). This is NDU requiring a staggered reboot of your SP’s. Take of advantage of the Navisphere Service Taskbar for this..
- For CX3 arrays you will need to be at Flare release 26 patch .029, .031 + RP splitter driver 03.26.003.6.012 for RP3.3 support
- For CX4 arrays you will need to be at Flare release 29 patch .006 + RP splitter driver 04.29.006.6.003.
Opinion time: It doesn’t get much easier and less error prone than the Array (Clariion) Splitter. The Fabric Splitter is complicated plain and simple, BUT capable of replicating your entire environment whether you are an EMC shop, an HP shop, an IBM shop, etc., all from a single pane of glass. So you will need to weigh both sides. And that’s right, I didn’t explicitly say it, but you don’t have to be an EMC shop to use RecoverPoint. In most cases replacing all your third party arrays with Clariion’s just won’t cut the mustard. So make sure up front that the splitter you choose is appropriate for your situation. Check out this post (under Modus Operandi) for a snapshot of what’s involved with the Cisco Fabric Splitter (SANTap), now close your eyes and think of a happy place. No really its not that bad, just make sure your chosen partner understands the pitfalls of said solution.
Is RecoverPoint just software or hardware as well?
RecoverPoint is intelligent software that operates on commodity based hardware from Dell. Current shipping hardware are R610’s. This hardware operates under the context of an appliance. Each site is capable of supporting 2-8 appliances in a single configuration. The appliances themselves are out of the data path and do not regulate or impede data flow under asynchronous situations. As part of your purchase you will receive the Dell appliances loaded with the RecoverPoint software. It goes without saying, but I’m saying make sure your implementer updates the appliances with the current code and make sure if your using say VMware Site Recovery Manager that the current RP code is supported (VMware’s Storage Partner Compatibility Matrix)
How is RecoverPoint licensed?
RP is licensed on a per replicated capacity basis. There are two flavors of RP-RP Full and RP/SE. As of the most current code, SE supports up to 150TB of replicated storage. RP full supports up to 600TB of replicated storage. SE is licensed in increments of 4TB, 8TB, 16TB, 24TB, 32TB, and then on a per TB basis up to 150TB. SE only supports the host and array based splitter. So for heterogeneous environments you will need to purchase RP full as well as any requirements set forth within your fabric splitter or host splitter.
RecoverPoint Terminology (Most Common)
- RPA-RecoverPoint Appliance- this is another name for the dell commodity hardware supportive of the RP software. Each appliance has 2 copper gig Ethernet ports and 4 (8G) FC ports. Word on the street is these will be virtualized soon (and by soon I mean sometime between now and Dec. 2012)
- Production Volume-this is what we all know and love our server data. This is the data that is worth protecting, worth replicating, peachy..
- Repository Volume-Contains configuration information for your RP environment. 3G in size volume, light I/O. Vault drive worthy..
- Journal volumes- these volumes are the core behind replication in RP. This volume is the first stop for all writes during replication local or remote. Depending on what side the journal volume is operating on each has a different function.
- Production JVOL-responsible for tracking writes during periods of WAN connectivity loss. Secondarily and only during times of “failover and replicate back”, the PJVOL becomes the RJVOL.
- Recovery JVOL-All writes being replicated hit this volume first. FIFO scenario. Every write is maintained for as long as possible based on the available capacity. The longer the Recovery Point Objective for your PVOL, the more JVOL space you will need. Look to snapshot consolidation to help.
- Replica-Directly proportional to the size of the PVOL. As all incoming writes hit the RJVOL they are immediately rolled to the replica.
- Cluster-A cluster is a band of 2 to 8 RPAs which accesses a single repository volume for their configuration.
- Site-Geo locale. Each site can have multiple clusters but each cluster has a single repository volume.
- System-Source and destination endpoints in a RecoverPoint environment whether its one site or two makes up a system
- Consistency Group-Logical grouping of like volumes or a single volume with the intent of maintaining write order between volumes.
- Replication Set-Defined on a per CG basis. Mapping of ProdVol to Replica. Additionally is the allocation of PJVOL and RJVOL’s for the CG in question. Remember you can have multiple Replication Sets within a single Consistency Group, with the assumption that they will share the same journal volumes.
- CDP-Continuous Data Protection. No write left behind. Every write is captured and sent to RecoverPoint synchronously over fiber channel. Prominently known as continuous backup.
- CRR-Continuous Remote Replication. Unidirectional asynchronous replication to a remote site.
- CLR-Concurrent Local and Remote Replication. Combination of local (CDP ) and remote (CRR) replication for a single production volume.
- Splitters-How do you copy a write from a production LUN in midstroke for Replication? Splitters! See above..
Is RP a HA/Scale Out solution, how is that accomplished?
RP itself is out of the data path, meaning the absence or loss of the appliances themselves will not cause you to lose access to your data. In synchronous situations, RP can regulate the application in an attempt to control data flow, but the appliances themselves are NOT in the data path.
Based on Dell commodity hardware, traditional HA capabilities are embedded and baked through with such architecture. HA and scale out is maintained via appliance clustering. As with typical clustering, each RecoverPoint appliance is privy to a shared cluster volume known as the repository volume. It is here where meta-data specific to the cluster itself is maintained. This includes, system and site information, Consistency group, replication set specifics, etc. If an appliance is lost, replication will pause momentarily and then failover over to an existing appliance in the cluster. This simply couldn’t happen without a shared cluster environment.
There is NO redundancy built into the networking, such as NIC teaming. There is a copper port for LAN and a copper port for WAN, that’s it. RP will continually bit map track all writes if the WAN interface is dropped. This mapping is held within the Production JVOL. Once the interface is up, all tracked block changes will be replicated. An important note on site control (or Virtual IP for the cluster), the VIP is passed between the first two RPA’s within a cluster, RPA1 and RPA2 for example. If RPA1 loses its LAN interface then it will pass site control to RPA2 and vice versa.
Each R610 has a single Quad port Qlogic FC card. The single card implies a SPOF, but think of this environment as a redundant array of nodes, where most HA capabilities are not defined on a per node basis, but rather on the cluster as a whole. Standard Clariion zoning applies for HA, of which I will go over later in subsequent posts.
Beyond that, scale out is accomplished via the addition of more appliances up to 8 per cluster. With the introduction of RecoverPoint 3.3, the idea of distributed consistency groups has surfaced. Traditionally a single consistency group has been tied to a single RecoverPoint appliance. With DCG’s you now have the option of spreading the data load across up to four appliances effectively increasing the throughput from under 100Mb per appliance/per CG to over 200Mb per CG.
In summary, redundancy extends beyond a single appliance, keep your physical networking and storage infrastructure within your sights when designing such an environment. More to come..