Is Your Backup System Constipated?
Volume-II in the Edborg-3-Step Process to Increase Confidence in Your Data Protection and Recovery Posture.
Okay forgive the coarseness of the title. I hope I got your attention. This blog post is a second in series I started a few days ago that digs into the findings of EMC’s Global Data Protection Index. In Volume I, I talked about the confidence that organizations had in their ability to restore data and recover. In this post, I dig into Step-1 in the Edborg-3-Step Process to Increase Confidence in Your Data Protection and Recovery Posture; Step-1 is all about Archiving.
Why Archiving? I choose Archiving as the starting point because I believe that it is the biggest sleeper cause for organizations’ lack confidence in their ability to recover. Going back to the Global Data Protection Index I talked about in my last post, there were some clear practices that those ahead of the curve did compared to those behind the curve. Slide 9 in the key findings of the index shows archive practices of the organizations studied; for each of the four groups, here is what their archive systems looked like.Leaders: Use archiving application with retention policies Adopters: Use archiving application with offsite replication Evaluators: Archive to disk Laggards: Archive to tape
The Leaders and Adopters also had confidence in their data protection systems to recover, while the 87% behind the curve in the Evaluators and Laggards did not have complete confidence to recover. Further, based on the many backup assessments that EMC has run, most organizations mix backup and archive in the same service. Okay time to be a heretic; yes and I am a proud heretic if I may say. “Backup and Archive should be segregated” – “Backup is for Recovery” – “Archive is for Preservation”. When the two are mixed, constipation occurs.
I’m sure you all have seen the surveys about the explosion of data growth; and part of the reason data grows geometrically forgetting compliance for the moment, is that we love our data and do not want to part with it. And not just at a corporate level, the behavior is rooted in our individual being. I too love my data; when I first started at EMC a few years ago I got my first MacBook Air with 128GB of flash storage, my next one had 256GB, and currently I have one with 512GB, and of that, 80% is used. How’d that happen, not because of laziness, but because I could.
My addiction to data is because I want to be more productive. I want to be able to quickly find and use data I created yesterday, last week, last month and years ago. So I keep it all online, indexed, and searchable. I find myself doing searches several times a day; and actually finding what I was looking for and quickly being able to use it; my personal productivity is greatly enhanced.
But what was Newton’s Third Law of Motion that we learned in high school science? For every action, there is an equal and opposite reaction. And Newton’s Law also applies to my love of data, it costs me; sure I have my data at my fingertips, but the opposite reaction is that my local backups take longer, and every year (even though I get a company discount), my offsite Mozy plan gets larger and more costly.
over 65% to 75% of the files in the filesystems studied have not been accessed in over a year.
So take an individual’s data love behavior and stratify it across your organization. And for those 87% of organizations that do not have a leading edge archive system or the confidence that they can recover. Their practices affects their organization’s ability to recover? How? It manifests itself in a couple of ways. First not recently accessed data occupies a lot of expensive online storage; second it slows down backup and recovery; third it unreasonably increases the cost of a recovery solution.
We in EMC Professional Services perform a lot a File System Analysis to help our clients understand the characteristics of the data that they have. What we find over and over again is that over 65% to 75% of the files in the filesystems studied have not been accessed in over a year. Okay, so what’s the impact of that? I’ll give you a concrete example from an assessment we just completed. In that assessment, the customer had 10PB of online data protected across 26 data centers, and 55% (5.5PB) was non-OS/application binary filesystem data. Conservatively applying the 65% metric to the 5.5PB of filesystem data, about 3.6PB of that data probably has not been accessed in over a year.
Okay, now relating that back to backup and recovery. Ignoring generational backup copies to simplify the example; online protected data results in 3-copies of data; the online primary copy, the local generational backup copy, and a protected offsite copy of the backup. And if the online data was replicated offsite, there would be 4-copies of data; the online primary copy, online copy at the second site, the backup copy, and the backup copy offsite. Let’s assume for a moment that the online copy of data has a loaded cost of $2/GB*. If the customer was able to pull 3.6PB of not used data off of the online filesystem and into an archive system, the customer, for that one copy could save or defer $7,200,000 in storage cost alone. And from an application recovery perspective, less data would be needed to be recovered at the DR site; lowering the cost of recovery.
The customer also had issues with completing their backups on-time, without looking at the individual backup jobs for tuning, imagine what pulling 36% of the data out of the backup stream would mean for backup windows; constipation over!!! And if the customer could avoid $7.2M in primary storage cost, how many dollars in the IT budget could be available to improve the posture of a recovery solution?
Hopefully, I’ve convinced you of the value of archive vs. backup and how you could fund the improvement of recovery. Now comes the bigger question, what technologies are available and how do I realize the savings? There are a number file archive products on the market, but let me tell you about one that my company makes and sells; it’s called EMC SourceOne for File. With SourceOne for File, polices can be set to migrate files to archive and set retention periods; and even though everyone loves their data, at some point it would be nice to have an auto-cleanup system. The file archive is easily searchable and even indexed to speed up the search, and the ability exists to leave a stub in the filesystem to allow automatic transparent retrieval of an archived file.
SourceOne for File has a number of options for the archive repository; it can be your favorite flavor of storage system, a NAS system, DeDupe system, Object Store, or even the cloud. My favorite is DataDomain. With DataDomain, doing both backup and archive to the same system, you could potentially enjoy dedupe of file strings across archive and backup; especially when people like me tend to make multiple versions of a file for historical purposes. A solution that’s a bit out there so to speak is to put the archive on Isilon and access the data using Hadoop, giving you the potential to offer Data Mining as a Service.
Okay so how do you get there? A couple of choices, one I’ll cover in a future installment, and a couple of ideas here. Probably the most classic way to implement an Archive system is to first define an Information Lifecycle Management System or ILM System. But there’s a cost associated with that and the bureaucracy it creates may be more than what’s needed for a lot of use cases.
Offer your users a $100 gift certificate to the restaurant or store of their choice for each user that does “spring cleaning” on their filesystems and meets some preset objectives for moving data to archive controlled shares.
My favorite; bribery… told you I was a heretic. In the bribery scheme you need to first implement the archive service; and then for users, I would argue that you could setup a set of fileshares each with a separate archive and retention policy. One could be for online regularly accessed data, one for auto-purge after the file has not been touched in over a year, and a third one for indefinite retention. Now the bribery part. Offer your users a $100 gift certificate to the restaurant or store of their choice for each user that does “spring cleaning” on their filesystems and meets some preset objectives for moving data to archive controlled shares. Certainly nobody wants to pay someone to do their job, but the incentive for the customer with $7.2M in archive eligible data could certainly realize gains quickly by spending $1M for 10,000 users (and I’ll bet they could negotiate a volume discount with retailers and restaurants for the gift certificates). Talk about a moral improver!!!
Okay, in summary, I hope I’ve convinced you to setup an archive service for files, along with some benefits, and a couple of ideas of how to do it. In my next blog I’ll cover Step 2 in the Edborg-3-Step Process to Increase Confidence in Your Data Protection and Recovery Posture. Until then, play safe with your data!!!
*$2/GB cost basis is used for an order of magnitude swag and it was derived using Amazon S3 Volume List Pricing, assuming a 5-year occupancy ($1.60/GB) with no price increase, and a guesstimate of network costs, and administration effort. Your results will vary.