Skip to content

Recent Articles

24
Aug

Defending the indefensible

Business DefenceIf there’s one thing law has taught us over the years it’s that the following defences aren’t often accepted:

I didn’t know what I was doing.

I was only following orders.

I wasn’t told to do it.

Backup and recovery is one of those areas where you hear IT people say those sorts of things an awful lot. Conversations will run like this:

Consultant: How long do you keep your backups for?

IT: 5 weeks

Consultant: What about long term backups?

IT: The business hasn’t told us we have to do it

Or like this:

Consultant: You’re only keeping your long term backups for 18 months?

IT: Yes. We’ve not received any direction from the business on how long they should be kept for.

When we’re custodians of a system, in lieu of direction otherwise, that puts an onus on us to be fully responsible for it. I frequently have conversations about data retention where the summary of the conversation is essentially we don’t know what to do.

That’s all well and good, but we’re also all grown-ups, and data recoverability and records retention are, whether we find them personally interesting or not, essential to business compliance. I’m not saying people are lazy, of course – but almost everyone is kept busy in their jobs, and if the company seems lackadaisical in its attitude towards data retention, there doesn’t seem like a lot of point being driven to be personally concerned about it.

Legally, it frightens the bejesus out of me.  If you’re responsible for backups, and you don’t have in-writing evidence you were specifically told not to perform specific retention policies, you’re leaving yourself open to a can of worms. Even if you have that in-writing evidence, you may find yourself in hot water unless you’re able to kick the can down the road and into a manager’s lap.

I’ve had conversations with staff at academic institutions where they’ve said there are no clear records retention requirements, which is proven wrong by a 5 minute search of their own extranet, let alone intranet. Staff at government bodies and private industries say the business doesn’t know how long data should be kept for, but Google searches of the local regulatory bodies for the industry or government vertical usually shows there to be clear documented guidance on the topic.

Searching for a pile of documents that outline retention requirements, muddling through them to find the key details, then getting agreement from the business may sound like a lot of effort, but if you’re the backup administrator it’s part of your job. You may not make the final call – and in fact ideally you won’t make the final call, but unless your company has legal advisors providing that sort of comprehensive information, someone needs to take ownership of it.

In the end, you can choose to be a tape monkey or a data jockey. I know which I’d prefer. If you choose incorrectly, you may just find yourself trying to defend the indefensible.

 

22
Mar

You can bet on that

Poker Player in DespairI come from a gambling family.

Every week, my parents buy their lotto tickets.

Every week, my father has his bets on the horses.

The only thing that’s changed since my parents have retired is the amount they place on their bets has reduced.

Many of my family members consider a good night out to be a night at a club where they can have a “flutter on the pokies”.

You might say gambling is in our blood.

Except

For me, it’s not. I’m not a gambler. The world-famous Melbourne Cup in November? Even now that I live in Melbourne, I can’t be bothered betting on it. I’m not a gambler. Not with money.

And not with backups.

One of the more common things I hear is “…but the chances of that happening are low.”

Unless you’re talking EM pulse, zombie attack, nuclear war or some other major catastrophe, the chances are that someone is basically saying “I like to gamble”.

Take the classic response to synthetic full backups. Don’t get me wrong – I like synthetic fulls. They have their place. But, if you’re going to position them as a solution for backing up remote sites, you have to – you have to – understand the risk that this poses should it become necessary to perform a full filesystem recovery. So when someone says “but the chances of doing a full filesystem recovery is pretty low”, it raises my hackles.

It’s not low if all you’re using for information lifecycle protection is RAID and backups, which is how the majority of those remote-office backup configurations are made. Under those circumstances, you’re not making an informed bet, you’re throwing all your money on Black 15 and hoping like hell it comes up.

To be sure, everything in backup is risk vs cost. I don’t joke when I say in my book that you could spend the entire IT budget of a company – hell, the entire budget of a company – for 5 years just on backup, and still not be 100% guaranteed of protecting against any possible failure. That’s the nature of life. Instead, we plan around:

  • The scenarios most likely to occur, and
  • The scenarios that have at least a plausible chance of happening and could cause significant damage to the company if they do so, and it’s cheaper to protect against the failure than wear the cost of the failure.

As such, a suitable branch office backup policy is rarely, if ever, going to be “let’s just hope we never have to do a full filesystem restore”. Just ask the manager of the branch office. Or the payroll department who have to pay the users in the branch office to be unproductive in that scenario. Chances are, they’ll have a different perspective – and if they don’t, you’ll have at least done due diligence, and documented it.

Because if that branch office fileserver fails and you tell your manager “it’ll take us 5 weeks to recover it across the wire, but hey our backups were always fast”, she or he won’t be amused.

You can bet on that.

4
Mar

Preparedness is not wastefulness

(Originally posted 30 August 2011 on the NetWorker Blog.)

This morning, Christopher Biggs, aka @unixbigot, tweeted a truth in IT that has always been a strong personal bugbear for me. He said:

“The reward for successful disaster preparation is always idiots decrying the wasted effort & resources.”

This is so, so very true.

While I’m sure it had been experienced by countless administrators beforehand, my first real experience of this was the year 2000 issue – Y2K. Post-Y2K there were hundreds of opinionated trashbag journalists and management consultants happy to jump up and slam the amount of money invested in addressing the issue. It’s a sad fact of life that there’s always going to be people who want to write negatively. (Those same trashbags would have equally written about the disgusting unprofessional nature of IT people had the skies really fallen in post-Y2K, after all.)

The work of system administrators is largely invisible, but the work of backup administrators is even more so. No-one cares about backup until something goes wrong, so an exceedingly common reaction in IT is for people to jump up and down and decry the amount of money or time spent on such activities as:

  • disaster recovery testing;
  • disaster recovery planning (trust me, they’re often done in this order…);
  • backup duplication;
  • high availability.

And why? Because in each case the end goal or the hope is that they’re not actually required.

It’s a tired, stupid meme that we, as a data protection industry, have to put to rest. It has to become accepted fact that all these activities are required for healthy business function, and you should be grateful that you don’t need to act on those plans and backups, rather than getting upset about the time and money taken.

Will we convince everyone? No. Then again, there’s still flat earthers out there. There’ll always be that small percentage who stubbornly cling to rampant stupidity as a shield against the real world.

Preparedness is not wastefulness.

Make it your mantra.

1
Feb

If you wouldn’t drink it, don’t cook with it

There’s a pertinent adage in cooking when it comes to using wine in recipes:

If you wouldn’t drink it, don’t cook with it.

It’s simple: if you don’t like the taste of it in a glass, what makes you think you’ll like the taste of food you’ve added it to?

There are two similar rules for backup, and they’re particularly important when it comes time to do those periodic hardware refreshes in your environment:

If it’s not good enough to run production, don’t use it for DR.

If it’s not good enough to run production, don’t use it for backup.

The way in which both of these come into play is quite simple:

  1. If it’s not good enough to run production, don’t use it for DR. I’ve seen companies have a hardware refresh cycle of “move production equipment to DR, buy new production equipment”. However, invariably that equipment is being pulled out of production because it’s either lacking in capacity, or lacking in performance. That equipment is then going to be replaced with new equipment with planned usage time of (typically) 2-3 years. So let’s assume you get a year down the track – your in-use storage capacity has gone up, your processing load has increased, then there’s a major production fault and you have to failover to DR. At which point, you’re trying to run your production environment on something that was sized to max out 12 months ago. Chances of it adequately running production? Minimal.
  2. If it’s not good enough to run production, don’t use it for backup. Another common mistake is a situation whereby say, a storage array is pulled out of production and replaced with a new, faster array with more capacity. People invariably hate to see things go to waste, so someone suggests “let’s use the old array as {backup to disk | VTL | etc}”. Again, sounds simple enough on the face of it, except the equipment was either lacking in performance, or lacking in capacity. If it was lacking in performance, you’re putting it into a situation where you’re going to be copying off something that is purchased, on the outset, to be significantly faster than it. It’s similar with capacity – you’re going to be trying to backup a very large bucket to a much smaller bucket.

Whether your company likes the idea of it or not, backup and disaster recovery are not areas that should be assigned “hand me downs” by the rest of the business. They require their own capital budget, and a planning that allows for the following two factors:

  1. Performance should at least match the throughput on offer from production;
  2. It should exceed your production capacity.

If either of these conditions are not met, your strategy is insufficient.

23
Jan

Peaches or pears

Canned peaches

My grandmother turned 92 in November. For my 39 years, any time I’ve seen her, she’s told me a “peaches or pears” story. That’s where she’s actually telling me an entirely different story – possibly quite a large story, involving meeting someone she hadn’t seen for 20 years, or finding out someone had died, or having an incident with a less than charitable person. Anything ‘big’.

But they’d seemingly always start with her going shopping, and they’d always seem to involve her buying a tin of peaches. Or what it a tin of pears?

No, she already had a tin of pears, it must have been peaches.

But, Iris might have been coming to visit and she prefers peaches, it must have been buying extra peaches.

But Iris would normally take her shopping …

And so the story would hang, right at the start, on a totally irrelevant detail – whether she’d been shopping for peaches, or pears.

What’s this got to do with backup? Funnily enough, quite a bit.

My grandmothers’ stories would often falter from the outset because she gets distracted by details that aren’t relevant to the story – whether she was out to buy a tin of peaches or a tin of pears didn’t really relate to whether her friend Beatrice had an emergency hip replacement while skiing through Austria.

So when I hear enterprise vendors telling everyone about how tape really needs to die, because of all the great features about backup to disk, all I hear is a peaches or pears story.

These days I’m not really in disagreement with them. The occasional time I stumble across a customer who is still backing up directly to tape usually leaves me a little flummoxed. Tape isn’t going to die any time soon, no matter how much people close their eyes and wish for it three times. Its purpose is changing, however. Those messages now about how disk backup is so much better than tape backup – they’re all relevant, but they’re talking to that small percentage of companies that are still doing direct-to-tape backup, not the bigger enterprises who are deliberately keeping tape in their long-term backup plans.

The reason the “tape must die” argument is failing is because it’s a peaches or pears argument. Usually the points are highly impressive:

  • When backing up to a disk array you get redundancy on the backups through RAID;
  • Disk is instant access;
  • Disk can be easily replicated;
  • Disk can be highly efficiently deduplicated;
  • etc.

There’s a myriad of reasons, all really good reasons, to pull tape out of primary, first-write use. (By first-write, I mean the first time the data is backed up.) I concur with a large number of them and advocate them to customers myself.

Yet for businesses who are shifting to tape for longer-term backup, those arguments don’t really hold any truck – businesses are buying tape in that scenario for the following reasons:

  • It’s cheap – really cheap, per unit of TB.
  • It’s totally, utterly, completely offline. Short of some massive EM pulse, nothing that affects their online backup environment will impact a tape cartridge stored in a secure, offsite facility.
  • It’s cheap – I mentioned that, right?
  • Redundancy comes from multiple copies.
  • It’s cheap enough for multiple copies – I mentioned that, too?

That’s why people are still buying tape – it’s cheap, cheap enough for multiple redundant copies, and it’s offline. It’s so totally offline, compared to near-line and online backups. Companies may like the notion of keeping months of backups online for easy recovery, but many companies (and rightly so) equally like the notion of keeping years of data offlineinviolate and quiesced. Disk backup just doesn’t provide for that.

It’s not about peaches or pears (or in this case, feeds and speeds), it’s about massive amounts of cheap, offline data. That’s why tape hangs around.

19
Jan

Zero error policy management

In the first article on the subject, What is a zero error policy?, I established the three rules that need to be followed to achieve a zero error policy, viz:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

As a result of various questions and discussions I’ve had about this, I want to expand on the zero error approach to backups to discuss management of such a policy.

Saying that you’re going to implement a zero error policy – indeed, wanting to implement a zero error policy, and actually implementing are significantly different activities. So, in order to properly manage a zero error policy, the following three components must be developed, maintained and followed:

  1. Error classification.
  2. Procedures for dealing with errors.
  3. Documentation of the procedures and the errors.

In various cases I’ve seen companies try to implement a zero error policy by following one or two of the above, but they’ve never succeeded unless they’ve implemented all three.

Let’s look at each one individually.

Error Classification

Classification is at the heart of many activities we perform. In data storage, we classify data by its importance and its speed requirements, and assign tiers. In systems protection, we classify systems by whether they’re operational production, infrastructure support production, development, Q&A, test, etc. Stepping outside of IT, we routinely do things by classification – we pay bills in order of urgency, or we go shopping for the things we need sooner rather than the things we’re going to run out of in three months time, etc. Classification is not only important, but it’s also something we do (and understand the need for) naturally – i.e., it’s not hard to do.

In the most simple sense, errors for data protection systems can be broken down into three types:

  • Critical errors – If error X occurs then data loss occurs.
  • Hard errors – If error X occurs and data loss occurs, then recoverability cannot be achieved.
  • Soft errors – If error X occurs and data loss occurs, then recoverability can still be achieved, but with non-critical data recoverability uncertain.

Here’s a logical follow-up from the above classification – any backup system designed such that it can cause a critical error has been incorrectly designed. What’s an example of a critical error? Consider the following scenario:

  • Database is shutdown at 22:00 for cold backups by scheduled system task
  • Cold backup runs overnight
  • Database is automatically started at 06:00 by scheduled system task

Now obviously our preference would be to use a backup module, but that’s actually not the risk of critical error here: it’s the divorcing of the shutdown/startup from the actual filesystem backup. Why does this create a “critical error” situation, you may ask? On any system where exclusive file locking takes place, if for any reason the backup is still running when the database is started, corruption is likely to occur. (For example, I have seen Oracle databases on Windows destroyed by such scenarios.)

So, a critical error is one where the failure in the backup process will result in data loss. This is an unacceptable error; so, not only must we be able to classify critical errors, but all efforts must be made to ensure that no scenarios which permit critical errors are ever introduced to a system.

Moving on, a hard error is one where we can quantify that if the error occurs and we subsequently have data loss (recovery required), then we will not be able to facilitate that recovery to within our preferred (or required) windows. So if a client completely fails to backup overnight, or one filesystem on the client fails, then we would consider that to be a hard error – the backup did not work and thus if there is a failure on that client we cannot use that backup to recover.

A soft error, on the other hand, is an error that will not prevent core recovery from happening. These are the most difficult to classify. Using NetWorker as an example, you could say that these will often be the warnings issued during the backups where the backup still manages to complete. Perhaps the most common example of this is files being open (and thus inaccessible) during backup. However, we can’t (via a blanket rule) assume that any warning is a soft error – it could be a hard error in disguise.

To use language as an example, a syntax error is one which is immediately obvious. A semantic error is one where the meaning is not obvious. Thus, syntax errors cause an immediate failure, whereas semantic errors usually cause a bug.

Taking that analogy back to soft vs hard errors, and using our file-open example, you can readily imagine a scenario where files open during backup could constitute a hard or a soft error. In the case of a soft error, it may refer to temporary files that are generated by a busy system during backup processing. Such temporary files may have no relevance to the operational state of a recovered system, and thus the recoverability of the temporary files does not affect the recoverability* of the system as a whole. On the other hand, if critical data files are missed due to being open at the time of the backup, then the recoverability of the system as a whole is compromised.

So, to achieve a zero error policy, we must be able to:

  1. Classify critical errors, and ensure situations that can lead to them are designed out of the solution.
  2. Classify hard errors.
  3. Classify soft errors and be able to differentiate them from hard errors.

One (obvious) net result of this is that you must always check your backup results. No ifs, no buts, no maybes. For those who want to automatically parse backup results, as mentioned in the first article, it also means you must configure the automatic parser such that any unknown result is treated as an error for examination and either action or rule updating.

[Note: An interesting newish feature in NetWorker was the introduction of the “success threshold” option for backup groups. Set to “Warning”, by default, this will see savesets that generated warnings (but not hard errors) flagged as successful. The other option is “Success”, which means that in order for a saveset to be listed as a successful saveset, it must complete without warning. One may be able to argue that in an environment where all attempts have been made to eliminate errors, and the environment operates under a zero-error policy, then this option should be changed from the default to the more severe option.]

Procedures for dealing with errors

The ability to classify an error as critical, hard, or soft is practically useless unless procedures are established for dealing with the errors. Procedures for dealing with errors will be driven, at first, by any existing SLAs within the organisation. I.e., the SLA for either maximum amount of data loss or recovery time will drive the response to any particular error.

That response however shouldn’t be an unplanned reaction. That is, there should be procedures which define:

  1. By what time backup results will be checked.
  2. To whom (job title), to where (documentation), and by when critical and hard errors shall be reported.
  3. To where (documentation) soft errors shall be reported.
  4. For each system that is backed up, responses to hard errors. (E.g., some systems may require immediate re-run of the backup, whereas others may require the backup to be re-run later, etc.)

Note that this isn’t an exhaustive list – for instance, it’s obvious that any critical errors must be immediately responded to, since data loss has occurred. Equally it doesn’t take into account routine testing, etc., but the above procedures are more for the daily procedures associated with enacting a zero error policy.

Now, you may think that that the above requirements don’t constitute the need for procedures – that the processes can be followed informally. It may seem a callous argument to make, but in my experience in data protection, informal policies lead to laxity in following up those policies. (Or: if it isn’t written down, it isn’t done.)

Obviously when checks aren’t done it’s rarely for a malicious reason. However, knowing that “my boss would like a status report on overnight backups by 9am” is elastic – and so if we’re feeling there’s other things we need to look at first, we can choose to interpret that as “would like by 9am, but will settle for later”. If however there’s a procedure that says “management must have backup reports by 9am”, it takes away that elasticity. Where that is important is it actually helps in time management – tasks can be done in a logical and process required order, because there’s a definition of importance of activities within the role. This is critically important – not only for the person who has to perform the tasks, but also for those who would otherwise feel that they can assign other tasks that interrupt these critical processes. You’ve heard that a good offense is a good defense? Well, a good procedure is also a good defense – against lower priority interruptions.

Documentation of the procedures and the errors

There are two acutely different reasons why documentation must be maintained (or three, if you want to start including auditing as a reason). So, to rephrase that, there are three acutely different reasons why documentation must be maintained. These are as follows:

  1. For auditing and compliance reasons it will be necessary to demonstrate that your company has procedures (and documentation for those procedures) for dealing with backup failures.
  2. To deal with sudden staff absence – it may be as simple as someone not being able to make it in on time, or it could be the backup administrator gets hit by a bus and will be in traction in the hospital for two weeks (or worse).
  3. To assist any staff member who does not have an eidetic memory.

In day to day operations, it’s the third reason that’s the most important. Human memory is a wonderfully powerful search and recall tool, yet it’s also remarkably fallible. Sometimes I can remember seeing the exact message 3 years prior in an error log from another customer, but forget that I’d asked a particular question only a day ago and ask it again. We all have those moments. And obviously, I also don’t remember what my colleagues did half an hour ago if I wasn’t there with them at the time.

I.e., we need to document errors because that guarantees us being able to reference them later. Again – no ifs, no buts, no maybes. Perhaps the most important factor in documenting errors in a data protection environment though is documenting in a system that allows for full text search. At bare minimum, you should be able to:

  1. Classify any input error based on:
    • Date/Time
    • System (server and client)
    • Application (if relevant)
    • Error type – critical, hard, soft
    • Response
  2. Conduct a full text search (optionally date restricted):
    • On any of the methods used to classify
    • On the actual error itself

The above scenario fits nicely with Wiki systems, so that may be one good scenario, but there are others out there that can be equally used.

The important thing though is to get the documentation done. What may initially seem time consuming when a zero error policy is enacted will quickly become quick and automatic; combined with the obvious reduction in errors over time in a zero error policy, the automatic procedural response to errors will actually streamline the activities of the backup administrator.

That documentation obviously, on a day to day basis, provides the most assistance to the person(s) in the ongoing role of backup administrator. However, in any situation where someone else has to fill in, this documentation becomes even more important – it allows them to step into the role, data mine for any message they’re not sure of and see what the local response was if a situation had happened before. Put yourself into the shoes of that other person … if you’re required to step into another person’s role temporarily, do you want to do it with plenty of supporting information, or with barely anything more than the name of the system you have to administer?

Wrapping Up

Just like when I first discussed zero error policies, you may be left thinking at the end of this that it sounds like there’s a lot of work involved in managing a zero error policy. It’s important to understand however that there’s always effort involved in any transition from a non-managed system to a managed system (i.e., from informal policies to formal procedures). However, for the most part this extra work mainly comes in at the institution of the procedures – namely in relation to:

  • Determining appropriate error categorisation techniques
  • Establishing the procedures
  • Establishing the documentation of the procedures
  • Establishing the documentation system used for the environment

Once these activities have been done, day to day management and operation of the zero error policy becomes a standard part of the job, and therefore doesn’t represent a significant impact to work. That’s for two key reasons: once these components are in place then following them really doesn’t take a lot of extra time, and that time that it does take is actually factored into the job, so the extra time taken can hardly be considered wasteful or frivolous.

At both a personal and ethical level, it’s also extremely satisfying to be able to answer the question, “How many errors slipped through the net today?” with “None”.

19
Jan

What is a zero error policy?

In my book, I recommend that all businesses should adopt a zero error policy in regards to backup. I personally think that zero error policies are the only way that a backup system should be run. To be perfectly frank, anything less than a zero error policy is irresponsible in data protection.

Now, the problem with talking about zero error policies is that many people get excited about the wrong things when it comes to them. That is, they either focus on:

  • This will be too expensive!

or

  • Who gets into trouble when errors DO occur?

Not only are these attitudes not helpful, but they’re not necessary either.

Having a zero error policy requires the following three rules:

  1. All errors shall be known.
  2. All errors shall be resolved.
  3. No error shall be allowed to continue to occur indefinitely.

You may think that rule (2) implies rule (3), and it does, but rule (3) gives us a special case/allowance for noting that some errors are permitted, in the short term, if there is a sufficient reason.

The actual purpose of the zero error policy is to ensure that any error or abnormal report from the backup system is treated as something requiring investigation and resolution. If this sounds like a lot of work, there’s a couple of key points to make:

  • When switching from any other policy to a zero error policy, yes, there will be a settling-in period that takes more time and effort, but once the initial hurdle has been cleared there should not be a significant ongoing drain of resources;
  • Given the importance of successful backups (i.e., being able to successfully recover when required), the work that is required is not only important, but very easily arguably necessary and ethically required.

Let’s step through those three rules.

All errors shall be known

Recognising that there must be limits to the statement “all errors shall be known”, we take this to mean that if an error is reported it will be known about. The most simple interpretation of this is that all savegroup completion reports must be read. For the purposes of a NetWorker backup environment, any run-time backup error is going to appear in the savegroup completion report, and so reading the report and checking on a per-host basis is the most appropriate action.

There are some logical consequences of this requirement:

  1. Backups reports shall be checked.
  2. Recoveries shall be tested.
  3. An issue register shall be maintained.
  4. Backup logs shall be kept for at least the retention period of the backups they are for.

Note: By “…all savegroup completion reports must be read”, I’m not suggesting that you can’t automatically parse results – however, there’s a few rules that have to be carefully followed on this. Discussed more in my book, the key rule however is that when adopting both automated parsing and a zero error policy, one must configure the system such that any unknown output/text is treated as an error. I.e., anything not catered for at time of writing of an automated parser must be flagged as a potential error so that it is either dealt with or added to the parsing routine.

All errors shall be resolved

Errors aren’t meant to just keep occurring. Here’s some reasonably common errors within a NetWorker environment:

  • System fails backup every night because it’s been decommissioned.
  • System fails backup every night because it’s been incorrectly configured for inclusive backups and a filesystem/saveset is no longer present.
  • File open errors on Windows systems.
  • Errors about files changing during backup on Linux/Unix systems.

There’s not a single error in the above list (and I could have made it 5x longer) that can’t be resolved. The purpose of stating “all errors shall be resolved” is to discourage administrators (either backup or individual system administrators) from leaving errors unchallenged.

Every error represents a potential threat to the backup system, in one of two distinct ways:

  1. Real errors represent a recovery threat.
  2. Spurious errors may discourage the detection of a real error.

What’s a spurious error? That’s one where the fault condition is known. E.g., “that backup fails every night because one of the systems has been turned off”. In most cases, spurious errors are going to either come down to at best a domain error (“I didn’t fix that because it’s someone else’s problem”) or at worst, laziness (“I haven’t found the <1 minute required to turn off the backup for a decommissioned system”).

Spurious errors, I believe, are actually as bad, if not worse, than the real errors. While we work to protect our systems against real errors, it’s a fact of life and systems administration that they will periodically occur. Systems change, minor bugs may surface, environmental factors may play a part, etc. The role of the backup administrator therefore is to be constantly vigilant in detecting errors, taking preventative actions where applicable, and corrective actions where necessary.

Allowing spurious errors to continually occur within a backup system is however inappropriate, and runs totally contrary to good administration practices. The key problem is that if you come to anticipate that particular backups will have failures, you become lax in your checking, and thus may skip over real errors that creep in. As an example, consider the “client fails because it has been decommissioned” scenario. In NetWorker terms, this may mean that a particular savegroup completes every day with a status of “1 client failed”. So, every day, an administrator may note that the group had 1 failed client and not bother to check the rest of the report, since that failed client is expected. But what if another administrator had decommissioned that client? What if that client is no longer in the group, but another client is now being reported as failed every day?

That’s the insidious nature of spurious errors.

No error shall be allowed to continue indefinitely

No system is perfect, so we do have to recognise that some errors may have a life-span greater than a single backup job. However, in order for a zero error policy to work properly, we must give time limits to any failure condition.

There are two aspects to this rule – one is the obvious, SLA style aspect, to do with the length at which an error is allowed to occur before it is escalated and/or must be resolved. (E.g., “No system may have 3 days of consecutive backup failures”).

The other aspect to this rule that can be more challenging to work with is dealing with those “expected” errors. E.g., consider a situation where the database administrators are trialling upgrades to Oracle on a development server. In this case, it may be known that the development system’s database backups will fail for the next 3 days. In such instances, to correctly enable zero-error policies, one must maintain not only an issues register, but an expected issues register – that is, noting which errors which are going to happen, and when they should stop happening*.

Summarising

Zero error policies are arguably not only a functional but ethical requirement of good backup administration. While they may take a little while to implement, and may formalise some of the work processes involved in the backup system, these should not be seen as a detriment. Indeed, I’d go so far as to suggest that you can’t actually have a backup system without a zero error policy. That is, without a zero error policy you can still get backups/recoveries, but with less degrees of certainty – and the more certainty you can build into a backup environment, the more it becomes a backup system.

[Ready for more? Check out the next post on this topic, Zero Error Policy Management.]


* In the example given, we could in theory use the “scheduled backup” feature of a client instance to disable backups for that particular client. However, that feature has a limitation in that there’s no allowances for automatically turning scheduled backups on again at a later date. Nevertheless, it’s a common enough scenario that it serves the purpose of the example.

2
Jan

In, out and off

Laptop out of office

There’s lots of different ways we can classify data. By its operational type (production, development, QA, test, etc), by priority (mission critical, important, unimportant), by arbitrarily named categories (platinum, gold, silver, bronze), and so on.

One way which doesn’t get a lot of attention which I think is useful for categorising data is “in, out, and off”, viz:

  • In(side) the datacentre;
  • Out(side) the datacentre;
  • Off(site).

While the world of cloud computing and shared datacentres can sometimes blur the physical attributes of the above, the logical attributes will remain relatively constant. “In” refers to data which is housed in the server room(s) for the business. It’s either on direct attached storage, or SAN/NAS storage. “Out” refers to data which is housed outside of the server room(s) for the business. It’s the data on user desktops and laptops, and any other assorted pieces of equipment outside the server room but still within the physical locations of the business.

“Off”, as you’ll have intuited by now, refers to data which is owned by the business but is outside the physical location of the business. It can refer to data collected by mobile devices for some organisations, but it’ll equally refer to company laptops that have been taken out from the physical business premises, and the growing plethora of smart phones, tablets, etc., which may house company data.

Where it’s useful to contemplate data as “in”, “out” and “off” is that each type of data in these circumstances will have significantly different backup options and requirements. Consider one of the most basic options for backup – how the backup is started:

  • “In” data should be able to backed up automatically, outside of primary production hours, so as to minimise the impact to the business and users of any backup processing requirements.
  • “Out” data should also be able to backed up automatically, but more likely than not during primary production hours, where the devices are most likely to be turned on and accessible to the backup mechanism.
  • “Off” data will typically need to be backed up where the data-host is self-polling.

Of course, it’s not always going to be that black and white, but it does set reasonable expectations – particularly when discussing with non-IT management, as to why data protection is complex. In fact, you could alter the classification to also show difficulty in automation thusly:

  • Inside the data centre – easy
  • Outside the data centre – hard
  • Offsite – hardest

Some may raise an eyebrow at calling inside-datacentre backups easy, but in reality, they are. The implementation may be a challenge depending on the complexity and amount of data, but the techniques for in-datacentre backups are well established and straight forward. They become problems of scaling the architecture and operational management. As soon as the data moves outside of the datacentre, protecting it becomes more challenging. For instance, the “outside” data – data inside the offices, but not hosted directly on the servers, raise all sorts of questions, such as:

  1. Do you have operational policies that prohibit the creation of such data as much as possible?
  2. Do you have IT functional policies that prevent the creation of such data unless an exception is created?
  3. Do you allow the data to be created, and back it up?
  4. Do you allow the data to be created, and replicate it to servers that are backed up?

Rarely is there an “all or nothing” answer (except for the unacceptable “5. Let the users go crazy, don’t do anything about it” scenario too often encountered). So immediately the number of potential ways of dealing with the data (either for lifecycle management or lifecycle protection) increases considerably because there’s more randomness (i.e., humans) being introduced.

Once the data moves further away – once it’s offsite, the problem is exacerbated because questions 1-4 above still apply, but you also have the introduction of further complexities:

  1. The data is unlikely to be guaranteed to be connected at a given time;
  2. The bandwidth to the data is likely to be highly constrained and highly costed;
  3. There’ll may need to be a local and remote backup option (i.e., remote to the user with the data, remote back to datacentre);
  4. Backups may be sent over insecure communications channels.

That’s just a few potential challenges, again. Hence, classifying the backups of inside data as “easy”, outside data as “hard” and offsite data as “hardest” should no longer seem odd, but reasonably logical.

We don’t solve all potential problems when it comes to data protection by employing a single classification scheme, so you’ll still need to look at mechanisms such as priority and operational type of the data, too. That being said, without being able to also classify data into its proximity to the datacentre – the inside, outside and offsite options, you won’t be able to properly formulate data protection strategies.

20
Nov

Are we hiring or firing?

File under F for Failure

Let’s imagine your company has had a serious issue which has resulted in data loss. Not just a document or a spreadsheet, but significant data loss. This might be:

  • Multiple days of data lost from a transactional database;
  • Entire departmental or workgroup fileserver data lost (or a month or more of data lost from it);
  • Irretrievable loss of substantial archive data.

A typical scenario to follow this will be a disaster recovery (or in some cases, business continuity) review. This should, at very minimum:

  • Determine a timeline of the failure;
  • Analyse what went wrong;
  • Determine actions to be taken as a result.

This isn’t going to be cheap. There’ll be multiple people involved, and indeed, there’ll be multiple people involved from multiple departments – not just IT.

One thing that is of critical importance in planning how much time and effort you’re going to allocate to this is determined by a simple question:

Are hiring or firing?

Are you firing? If you’re looking for a brief review with “decisive” actions that can be taken immediately and wanting to “remove an issue”, you need to acknowledge that you’re pretty much in a firing mood. You either want to fire an employee (or multiple employees), or a piece of technology.

…or are you hiring? If you’re looking to perform a comprehensive review and fix the root causes (both technical and procedural), then likely you’ve accepted that the net result of your review will be to spend money. That may be hiring a document writer for 6 months to cleanup the procedures, or getting additional staff trained, or it might be a deeper problem and therefore a deeper fix – e.g., starting from scratch to develop the business requirements, write the documentation and test the processes.

And let me be blunt about that last point – if you’re not prepared to test said procedures, either in whole, or with appropriately third-party audited “dry runs” with component testing, you’re deluding yourself in thinking you’ll actually have a DR solution, let alone a BCP solution.

It’s no surprise that ISO 22301 includes sections for testing – it’s probably the most important aspect of a functional business continuity plan, and equally the easiest to leave out, either deliberately or accidentally. Deliberately because it seems too hard, or accidentally because it seems too easy. Getting it properly tested is most certainly non-trivial, but if the appropriate planning has been rigorously performed, it shouldn’t be too hard either.

So, if you have experienced a site disaster, make sure when you’re planning that review session at the end of it you know what your objective is. Unless it was demonstrably, easily identifiably a case of gross incompetence or gross negligence (in which case, a review session isn’t actually needed to determine the root cause), you need to scope and go into the review process accepting a very likely outcome will be additional budgetary requirements.

22
Oct

Information Lifecycle Policies vs Backup Policies

Periodically, I talk about backup being just a part of a broader set of strategies that I refer to as Information Lifecycle Protection (ILP). This is distinct from Information Lifecycle Management (ILM), and has components as follows:

A common mistake within an organisation, sometimes triggered by not having merged Backup, Storage and Virtualisation administration, is to approach all backup requirements and challenges only from a backup perspective. When approached from just a backup technology perspective, sometimes it doesn’t matter how elegant your solution is – it just may not be optimal.

Optimal solutions sometimes require extending the umbrella. A classic example of this is NAS. Consider for instance an enterprise environment that has a NAS in the production datacentre, replicating to a disaster recovery datacentre:

Replicated NAS

This is a fairly standard strategy, yet NAS often presents significant challenges to backup environments. Even with NDMP in place, coming up with a nightly data protection strategy for fileservers presenting tens of millions of files is not easy. Various NDMP techniques may allow for speeding up the backup process via block level strategies, but file level recovery from these styles of backups tend to either be challenging at best, or not even possible in the worst case scenario.

As is always the case, whether you can even get a backup done is irrelevant if you can’t recover the data in an appropriately usable way.

What’s more, unstructured data doesn’t really lend itself well to more frequent backups than every 24 hours. While database logs can be captured on an almost continual basis, if it takes 8 hours to do an incremental walk of a highly dense filesystem for traditional backup, but the business requires a Recovery Point Objective (RPO) of just 1 hour, your traditional nightly-incremental strategy just doesn’t cut it.

So, we turn to other aspects in ILP.

The first step is to start using snapshots:

NAS and snapshotsOnce configured at the storage layer, NAS snapshots happen pretty much automatically. If the business requires an RPO of 1 hour, then the most obvious protection strategy is to have the NAS take a snapshot every hour. These copy-on-write style snapshots are typically browsable by end-users, and in that situation they have an added advantage – if users can browse a snapshot and find the file they want, they don’t need to ask the backup team to recover the file(s) they need.

However – snapshots on their own represent a poor data protection strategy, since they’re only as safe as the array they’re sitting on, and relying solely on snapshots to protect data on an array, when the snapshots are also on that array, is … well, insane.

So, we have to make use of that replication strategy, and ensure that the snapshots are replicated as well:

Replicated Snapshots

So at this point, we’ve got:

  • Snapshots providing an hourly RPO;
  • Snapshots providing a user-directed nearly recovery process;
  • Replication providing protection for snapshots in case of total array failure.

Now, some storage manufacturers would like to suggest that at this point you’ve got a valid backup solution. Not so fast, though! It’s only a valid backup solution if you’re prepared to burn through money to buy enough storage to provide long-term recoverability from snapshot. It’s around this point that you’ll want a backup product inserted into the protection strategy.

However, we don’t just insert a daily backup and leave it at that; if the NAS snapshots are configured correctly we can extend that the convenience factor for end-users whilst still getting a copy out to off-line storage. In this scenario, we might end up with a solution such as the following:

Snapshots with Daily Backup

In this scenario, hourly snapshots are kept for 24 hours, with the final snapshot of each day kept in turn as the “daily” backup for n days. In many businesses this will extend to more than a week – e.g., 28 or 31 days. In the above example, those “daily” snapshots are each written out to tape. Keep in mind that we’re still replicating the NAS and its snapshots from one site to another, so we hit a new benefit of combining snapshot, replication and backup into a comprehensive ILP strategy – when the traditional backup is run, it can be run from the replicated data, offloading the impact of the backup from the production NAS:

Replica Snapshot Backups

Of course, this isn’t the only way the backup strategy can work. If sufficient protection is available on both the production and replica NAS units, and the filesystems are large enough, only weekly backups might get output to tape:

Snapshots with Weekly Backups

With that strategy, no incremental backups of the NAS are ever written to tape – just weekly fulls.

Nothing in the above data protection strategy is particularly complex – but equally, none of it is really all that possible when considering backups in isolation. As soon as backups are considered along side with the other activities in ILP (RAID, Replication and Snapshots), advanced and flexible strategies such as the above become available.

So before you design you approach your next data protection challenge, ask yourself the following question:

Does this need a backup strategy, or does it need an Information Lifecycle Protection strategy?