The Situation
So, you work for a small retail company that pulls in about $15 million in annual sales. You’re the person in charge of the 10 servers running at HQ, which manages the 15 stores around the tri-state area. The stores are open 8am to 8pm weekdays, 9am until 7pm on Saturday and closed on Sunday. That means you have 70 hours of system up-time that is critical to your stores, but you also run a 24-hour on-line marketplace that contributes to the overall revenue.
Like most businesses, you have deployed some level of recovery plans for individual applications, entire servers and the complete datacenter. You probably know your hourly downtime costs that includes indirect costs, like additional labor to get back online, lost data and reputation.
Your recovery plan includes human, server, provider or even mother nature ‘error events’, and it includes a defined Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Maybe you are paying a service company to provide new kit within a specified service level agreement (SLA). You have defined your Work Recovery Time (WRT) and got signoff from the business on your Maximum Tolerable Downtime (MTD) for any failure event. With your daily backup routine, which includes a full backup and regular incremental saves, you're confident that you can start the restore process as soon as the new equipment is available. Having tested this process within the past year, you’re confident the processes work as designed.
The Issue
In this instance, Mother Nature decides to take out your data center 35 minutes before the next scheduled incremental backup. You declare a disaster and your new equipment is available within the defined SLA. The restores start for all 10 servers and eventually you are back to the last known good point, which is 3 hours and 25 minutes before the failure. How do you recover the lost data?
This is where the recovery begins; everything before this point is just replacement of kit and resetting of data.
The Resolution
Now it is time to piece together the transactions through emails, logs, journals and other methods. If the logs were stored someplace else that is safe, you may be able to pull them down and replay them back to the Database, but will that keep the information on the other servers at the same point in time? Do you need to replay the entire workflow for each transaction? There are a lot of questions that need to be asked to get through this. It might take several weeks or longer to get, what you think is, everything back.
The Solution
Looking back on that situation you can clearly identify what worked well and what did not. If the disaster occurred just after the save, things would have been much different. How can you provide predictable recovery times for most any incident without spending enterprise level money?
RPO is the biggest variable so let’s focus on that. Most backup solutions provide a full backup image plus incremental backups that shorten the save time but adversely impacts the restore time. Once done, you’re only at the point-in-time of the last save.
Since the backups are faster with the incremental you can take more of them, but most limit the incremental backup to every few hours. This means you have multiple points-in-time to recover too, but it will require more time and overhead to restore. It does provide for smaller RPO windows to start your recovery process, but the recovery is still a variable duration that can exceed your WRT and MTD requirements.
What if you can shrink that RPO to the point-of-failure and achieve a predictable WRT? Utilizing a byte level replication tool, like OpenText Availability, enables a ‘crash-consistent’ copy of the entire system, not just the data. OpenText Availability is a byte level replication engine that keeps a target environment in synch with the source using File System I/O. Low impact on the servers, low impact on the network and, in certain environments can operate through a single appliance for smaller footprints.
With OpenText Availability, you can failover to your DR system in 15 minutes or less and start your recovery right then. The byte level replication engine is designed for Availability with the data being sent to the target immediately, leaving no lengthy gaps in your RPO. Odds are, the database rolls back the last transaction and is ready to go, virtually eliminating the long data recovery process.
OpenText’s replication engine is a proven technology that has been around for more than 30 years. OT Replication is agent based, point-to-point, asynchronous byte level replication that sends changed bytes, not files or blocks to the destination for processing. It crosses infrastructure so you can replicate and recover into any Physical, hyperscale, hypervisor or hybrid environment. Cost optimization for your DR systems can be achieved by sizing your DR to fit the need, then resizing during a DR event. Reverting to your replacement kit can be as simple as performing a cutover, in 15 minutes or less with minimal risk and limited impact on the business.
Add-On Benefits
When considering Availability and DR tools, it’s best to consider the ability to do more than a backup and restore. OpenText Replication can perform additional functions to simplify and optimize your entire recovery processes compared to PIT tools.
-
Optimized DR Traffic
- You can optimize your recovery process by excluding any temporary files that get recreated after a restart. Temporary files perform significant disk writes but are deleted when the application restarts.
- If these files help with recovery or are required for audits you can create a second job to replicate them to a different location, on a different network to minimize the impact. At a failure event, these files are at a crash consistent state as well.
-
Perform no-impact DR tests
- Perform individual, group or entire datacenter DR testing without impacting production. Establish a “bubble network”, assign a second NIC, cutover to that network and test your DR processes all while the production environment is running.
-
Failover to a known Point-in-Time
- OpenText Availability license provides the ability to perform multiple target side snapshots. At failover, choose ‘live data’ or ‘snapshot’ to recover too. During a DR event the RPO is crash-consistent however, there may be scenarios where a recent PIT can recover faster. Case in point, is a complex and long running batch process. If you snapshot the target prior to the batch, it may be faster to failover to that snap and just restart the batch.
- Additionally, if you perform target Snapshots through the OpenText product you may also benefit from a simplified malware recovery. If you get hit with malware, fail over to a point in time prior to the attack, clean the malware code then restart your processes.
-
Revert with a difference Mirror
- OpenText’s replication engine can identify data already on the destination disks, perform a byte level CRC check and send the differences without stopping production use. This reduces network overhead and shortens the re-synchronization process.
-
Utilize optimized kit for DR
- With OpenText replication, your target hardware can be different. Only disk changes are sent so you can reduce the active hardware footprint at your DR site. During cutover you can expand as needed. Pre and post failover scripts can be executed to automate actions addressing any environmental or application requirements like adding processors or memory.
Conclusion
Shorten your WRT and achieve your MTD objectives, without breaking the bank on complex and expensive hardware, tools and processes. Look at integrating an agent based, real-time replication process into your DR planning that will provide for better availability, flexibility and predictability to recover individual servers, groups of servers or entire datacenters.
A tool like OpenText Availability also provides on-boarding, consolidation or distribution of data, upgrades to your Microsoft SQL Servers and more.
Learn more about using this tool for Managed Service Providers. Increase Windows and Linux Server availability utilizing the same processes and procedures across any infrastructure all while reducing risk and keeping costs down.
https://www.carbonite.com/business/products/availability/
Join the discussion at OpenText’s Cybersecurity user forum here, to find insights and ask questions related to any of the OpenText Cybersecurity or Data protection products!