blog.virtualtacit.com

Root Down in a 2009 World

Data De-dupe on Primary storage not so Peachy

with 2 comments

In case you guys missed it, NetApp offered a shakeup back in July about the ability of their V-series line to de-dupe their competitors primary storage, noted here. I would have to agree with the general consensus that deduplication has its place and its place aint’ on primary storage. Applying de-dupe to secondary, tertiary storage and backup operations is really the meat behind this features punch.

The goal here is to provide your production data with the means to achieve high throughput, low latency execution, right? What you are doing is permeating your critical business infrastructure with an operation that is known to degrade performance. Not to mention this doesn’t in any way, shape or form provide the customer with an end to end, low maintenance, SUPPORTED solution. As a customer you must decide whether or not your investment in your array (whether EMC, HP, HDS, IBM, etc)  is all for not. 

By fronting your array with the V-series essentially strips all management capabilities of your array and reduces it to JBOD. Your investment is no longer an investment. E-labs, EMC’s own interoperability entity, works with associated vendors to resolve issues for qualified configurations. These support agreements run deep with various vendors but NetApp, particularly the V-series, is not one of them.

peach

So here is what NetApp suggests:

· Support for de-duplication of primary data on third party storage arrays from EMC, HDS, HP, and others when connected to NetApp V-Series Virtualization systems.

· NetApp de-duplication, a feature of the Data ONTAP operating system provided at no cost on FAS systems, is now also offered free with the V-Series.

· End-to-end de-duplication including primary data, as opposed to other vendors’ de-duplication of only backup or archive environments.

· Improved business efficiency and reduced data management complexity using V-Series with non-NetApp storage arrays.

· De-duplication helps improve space efficiency and reduce raw storage requirements.

· By using V-Series with de-duplication, customers are able to better control their heterogeneous data growth while reducing costs and simplifying data management.

· More than 10,000 NetApp systems and 2,500 customers running NetApp de-duplication technology.

· All NetApp storage technologies will include de-duplication by the end of 2008.

Ok so here’s the rub when it comes to data de-dupe on NetApp Filers:

· Active snapshots? sorry you can’t de-dupe 

·Severe volume limitations are imposed as part of the 3D process

· No de-dupe over FlexVols, Aggregates, or Filers.

· Backup to tape inflates data to pre-dedupe size

· Since de-duplication is a post-process operation, NetApp offers no reduction of capacity requirements for initial purchase of new systems.

· No reclamation of space in block based storage (FC and iSCSI)

· Scheduling complications are now a reality. Avoiding periods of snapshotting, replication, archiving and general heavy work loads can be difficult.

· NetApp says “If there is very little new data, run de-duplication infrequently, because it doesn’t make sense to unnecessarily consume CPU resources.” http://www.netapp.com/us/library/technical-reports/tr-3505.html

·  De-duplication itself is free, but are SnapVault and SnapMirror?  Should I remind you that nothing in life is free.

De-dupe, like every other storage feature, whether its EMC, NetApp, DataDomain,etc, has its positives and negatives. Just make sure you as a customer look beyond the marketing wooglie booglie and understand the technology you are depending on.

One last thought, if you do decide to turn on NetApp 3D please take 10 minutes to fill out a “I told you so” form that releases any wrong doing from NetApp  when your performance dips below the equator, http://www.crn.com/storage/209901632, I kid you not….

Written by Joe Kelly

October 5, 2008 at 4:13 pm

Posted in storage

2 Responses to 'Data De-dupe on Primary storage not so Peachy'

Subscribe to comments with RSS or TrackBack to 'Data De-dupe on Primary storage not so Peachy'.

  1. There are definitely limitations to NetApp’s dedupe, but you’re mixing apples and oranges by saying that dedupe is not for primary storage. Just because one product has limitations does not mean the category is not important.

    The fact is that all file data (not databases, not block device data) has a fairly predictable histogram of access. Files that have not been modified for, say, two weeks have a very low likelihood of being modified again. It’s this histogram of access that makes dedupe for primary storage so compelling, and futhermore, that mandates that it be a post-process. You don’t want to introduce the overhead of compression or dedupe for primary when the file is being created (adding latency and risk to the write), or in the early days of a file’s life (when it is most likely to be modified again).

    However, if you compress and/or dedupe a file once it is a couple weeks or months old, then you pay no penalty on writes, you have no data loss risks (because it will already have been backed up by then at least once) and - if you pick the right dedupe solution - you will have no noticeable added latency on reads.

    Primary storage costs a fortune because of emphasis on performance (latency) and protection (mirroring, snapshots, synchronous replication, etc). However, those features really only apply to when data is being created. Most files that are stored on primary storage needed those features for the first few weeks of their life, and are now living on expensive primary storage as static, read-mostly files. Why not shrink them as much as you can?

    To give you an example of a solution that works with NetApp, and BlueArc, and HP, and Isilon, or any other standard NAS, take a look at Ocarina Networks. The Ocarina solution shrinks files where they sit, allows backups of compressed and deduped files, does not sit in the write path at all, and has an average of +4ms latency on read back for most file types. I would argue that if you haven’t opened a file for a month, then waiting an extra 4ms to open it is not going bother (or even be noticed by) most users or applications.

    Carter

    6 Oct 08 at 6:49 pm

  2. I do agree with somewhat with this post. At our implementation, all the critical VMs are ran over a 4 gig fiber connected EMC. On the development and QA side, they have been offloaded to an ISCSI (soon to be fiber) connected NetApp. ASIS is running fine on the VM ISCSI volumes and is getting 59% dedupe rate.

    NAS> df -sh /vol/testVol

    Filesystem used saved %saved
    /vol/testVol/ 519GB 754GB 59%

    Saving 754 gig so far! ASIS (NetApp dedupe) with VMWare information can be found here:

    http://blog.colovirt.com/2008/10/23/netapp-deduplication-a-sis-and-vmware/

    We also use Data Domains that do dedupe real time, but I definitely do not see that as an option to actually run VMs off of. At least with NetApp nightly deduplication, we only get the performance hit during its run time, instead of what would be a constant one with any real time dedupe appliance. This has been taken into account and has saved us from expanding another tray of EMC drives to support non-critical VMs. Also, I hear the newer line of NetApp controllers will allow deduplication beyond the current limitation of 1 terrabyte volumes (the version we are on).

    kcollo

    15 Nov 08 at 2:19 pm

Leave a Reply