Disaster Recovery Plan Development
There are several steps, policies, procedures and processes which should be developed in support of a disaster recovery plan. This document will outline the steps and processes to develop a disaster recovery plan. Depending on the size and type of business being analyzed, this disaster recovery plan development document may need more or less detail.
This document attempts to follow guidelines set forth by NIST and specifically is based on the NIST Special Publication 800-34 titled Contingency Planning Guide for Information Technology Systems.
The following project outline is provided solely as a guide. It is only intended to be "one example" of requirements for a disaster recovery project plan. It is not, by any stretch of the imagination, the only way to set up a project plan. Examples in this document are not meant to provide a working technical solution but to briefly provide an example of how to write a plan to cover some items.
The purpose of this document is to provide some guidance for performing disaster recovery planning and for starting the plan. This document only provides examples up to the beginning of the plan development and shows a brief business impact analysis, risk assessment, and budgeting cost justification process.
1.0 Disaster Recovery Plan Policy Statement
The purpose for this statement is to define what your organization desires from a disaster recovery plan. By providing this statement and defining the scope of the project, it will be much easier to keep the project on track.
2.0 Business Impact Analysis - Analyze Business Processes
Analyze your organization's business processes - List your business processes from the most important to the least important. Identify those processes that are essential for your business. Also estimate the period of time your business could survive without those processes and how long your business could effectively be without those processes before serious damage or loss would occur. In addition, you should try to estimate any known losses on a per hour rate or a per day rate associated with the loss of primary business processes.
For example, if the main function of the business is sales and your business can not sale items for a period of time, how long would it take before the loss of revenue had a serious negative impact on the business such as requiring a member of your staff to be laid off. How long would it take before you lost a substantial percentage of your customers which could cause the business to close?
Depending on the type of business, potential business processes may include:
- Customer or public relations
- Customer service
- Finance and treasury
- Accounts Payable
- Accounts Billable
- Manufacturer production line
- Human resources
- Facilities management and maintenance
- Quality control
- Research and Product Development
- Engineering and Product Design
- Legal services
In this part, do not list support services, at this time, such as IT and telephone unless the business is primarily an IT or telephone business. Complete the Business Process Impact Analysis Form.
Complete the Business Process Dependency form. In this part of the business process analysis, determine what business infrastructure is critical to the operation of your business. Some of these parts of the infrastructure may include:
- IT Services
- Telephone Services
- Electric Power
- Sales floor
- Office space
These components of infrastructure may include specific offices at specific locations.
3.0 Risk Assessment - Analyze Threats
Performing this risk assessment will not only help you with your disaster recovery plan, but it will help you develop security policies and procedures to improve your computer security and save money. For example, if you determine that virus incidents are costing your organization significant downtime or loss of productivity, you may decide to change your file attachment policy in your mail server to limit the sending of certain attachment types. This action can cut down on the number of virus incidents in your organization and work well if you keep your users advised of the changes and educate them about work arounds.
During the risk assessment, you should analyze threats to all business processes including the processes that primary process are dependent upon. These threats would include threats to IT infrastructure, communications, building services and others. The threats should be analyzed against both the processess and supporting systems. This risk assessment should cover all risks to the organization including security risks whether they will be addressed in the disaster recovery plan or not. Possible threats are listed in the document titled "Organizational Threats".
To perform a risk assessment and calculate the possible loss amount, the annual loss expectancy (ALE) should be determined.
The formula used by SANS for IT risk assessment is Risk = Threat * Vulnerability * Asset Value
Basically, you would want to get an estimate of the number of incidents expected per year and the overall cost of each incident.
From a computer security viewpoint SANS says that the Vulnerability is the weakness in a system that could be exploited. It also says that the Threat is any event that can cause an undesirable consequence. Since the real goal in calculating Threat * Vulnerability is to determine the expected number of incidents per year I would estimate the number of events per year by the type of threat. Since there are many unknown vulnerabilities and they are specifically hard to estimate, you may want to base your calculations on a combination of one or more of the following:
- The number of those event types last year in your organization.
- Adjustments based on organizational changes, improvements, or new vulnerabilities recently materializing.
- Your perceived vulnerability in that threat area times the number of events worldwide.
- Insurance company estimates of the risk of occurance for a business of your type and in your area. For instance what percentage of businesses in your area were hit by a tornado in the last few years, how many had a fire?
To calculate cost per incident, instead of using an exact asset value, I would rather calculate the damage done per incident and use that as the asset value. Damage may be in several forms including:
- Loss or damage to equipment
- Loss of productivity
- Loss of staff time to fix the problem.
- Loss of revenue due to the incident such as loss of sales.
- Loss of sensitive information
These types of losses per incident should be quantified and used to calculate the total loss per incident.
4.0 Recovery Strategy
Once you have determined your threats and have quantified the potential losses, you need to determine cost effective methods to use which can reduce the chance of the threat happening and/or reduce the amount of damage. One solution mentioned in the earlier example is to have some additional computer equipment or a complete server available for quick build up in case one of the servers is completely destroyed. Other methods include clustered, mirrored, or load balanced servers.
The available technologies and costs will help drive the decision toward the best recovery solution. At this point, you would want to list possible solutions for each scenario which could cause damage to the business. In this example we did not consider server down time but this can be reduced or prevented with some of the following solutions:
- Providing an Uninterruptable Power Supply (UPS) to keep servers running in the case of a power failure.
- Providing an external generator to keep servere and the business running in the case of a long term power failure.
- Use a Redundant Array of Inexpensive Disks (RAID) as a solution to keep servers running and preserve data integrity in the event of a single hard drive failure on a server.
- Use servers with two power supplies in the event of a power supply failure in a server.
- Use clustered, mirrored, or load balanced servers so if one server fails, the other can continue operations and will have a copy of required data in the event of a server failure.
5.0 Establish a Budget
When attempting to establishing a budget it may not be completely obvious how much money should be spent. This is mainly because, the threat may never be materialized or it could happen tomorrow. It is somewhat of a gamble. However, it is possible to calculate the real value in dollars of some threats per year to an organization.
At this point, you can decide what amount of money you want to spend. Don't forget to consider maintenance costs associated with the solution you decide to use. Many of these factors are not considered in this document for simplicity. Once a decision is made, management will approve the budget and you will begin to develop the disaster recovery plan.
6.0 Develop the Plan
At this point, a budget and a list of threats we want to protect against or mitigate should exist. It is time to develop the disaster recovery plan.
6.1 Plan Objectives and Scope
To begin, the scope and objectives of the plan must be defined including organizations that are covered and affected by the plan. Any related policies or procedures should also be referenced by the plan including:
- Backup plan or policy.
- Emergency contact information and plan.
- Network documentation policies.
The plan should address four phases of a disaster or covered event including:
- Business continuity
The plan should provide contingencies for the loss of:
Situations and conditions covered or not covered by the plan should be defined including lengths of disruptions and locations that are covered. Systems covered by the contingency plan should be identified. Teams and task responsibilities should be identified by the plan.
6.1.1 Example Scope
This plan provides for reduction of risk of system and catostrophic failure. It specifically provides for protection of the IT server area including server redundancy to prevent any long term failure beyond four hours. This plan is supported by the following plans:
- File Backup and Restore Policy - Defines what computers, equipment, and software will be used to perform file backups and what files and systems will be backed up by which devices.
- Network Documentation Policy - Defines the level of network documentation required such as documentation of which switch ports connect to what rooms and computers. Defines who will have access to read it and who will have access to change it. Defines where documentation will be stored.
- Server Documentation Policy - Defines the level of server documentation required such as documentation of server services and configuration. Defines who will have access to read it and who will have access to change it. Defines where documentation will be stored.
- Incident Response Plan - Defines the response to a security incident such as a virus, network intrusion, abuse of a computer system or other situations.
- IT Equipment Purchase and Failure Prevention Policy - Defines technologies to be used in specific areas of functionality to reduce the chance of any serious disruption of service.
- Emergency Contact Plan - Provides an emergency contact plan defining where emergency contact information is stored, the people to contact based on the emergency type, and how employees will learn about closings.
This plan covers incidents with recovery times in excess of four hours. The incident response plan covers incidents that have a recovery time of four hours or less. The IT Equipment Purchase and Failure Prevention Policy covers purchasing of equipment to reduce the chance of service interruptions due to a single point of failure. For the sake of brevity, this plan deals with the IT server area and IT servers only and does not provide for contingencies for the sales floor, or other office areas for the organization. For brevity, this plan covers the loss of facilities only and covers loss of communications through the Emergency Contact Plan.
6.2 Notification Phase
The discoverer of the disaster shall follow the procedures set in the Incident Response Plan. This section of your plan should be expanded beyond the incident response plan and should consider possible communications outages.
6.3 Recovery Phase
Defines how services are switched to the temporary interum facilities. Provides for repair or replacement or the original facilities. Tasks are assigned to teams defined in step 6.1.
6.4 Reconstitution Phase
Defines when services and/or personnel will be moved back to the original facility or replacement facility.
7.0 Test the Plan
An in depth test should be done at least every five years or when the plan changes significantly. I would recommend some drills and assume some specific things have happened such as:
- Some specific facility has been destroyed (part of a disaster drill).
- Some staff members are dead or not available.
- Specific communications are not working.
One or more of these things may be in effect.
Once these assumptions are established, make the disaster recovery teams show they can do their jobs without access to destroyed facilities. Make them produce:
- Their procedures for doing their jobs.
- Documentation about the network, system configuration, security policies, points of contact including vendor information and other business contacts.
- Software and license information.
- Purchasing information required to purchase replacement equipment.
This is not a comprehensive list of items that should be tested but only is a basic starting list.
8.0 Plan maintenance
Review your plan at least once per year to be sure it is current and effective.