Design for Failure, nothing will fail
Be pessimist while designing architectures, assume things will fail. In other words, always design, implement and deploy for automated recovery from failure. By being a pessimist, you end up thinking about recovery strategies during design time, which helps in designing an overall system better.
- Assume that your hardware will fail.
- Assume that outages will occur.
- Assume that some disaster will strike your application.
- Assume that you will be bumped up with expected number of requests per second some day.
- Assume that with time your application software will fail sometime.
If you realize that things fail over time and incorporate that thinking into your architecture, build mechanisms to handle that failure before disaster strikes to deal with a scalable infrastructure, you will end up creating a fault-tolerant architecture that is optimized for the cloud.
We should query ourself : What happens when a node fails? How do we recognize that failure? How do I replace or bring up that node? What all scenarios do I have to plan for? Identify single points of failure and setup mitigation plan? What happens when my load balancer fails? How does the failover occur and how is a new slave instantiated and brought into sync with the master in a master slave environment?
As we designing for hardware failure similarly software failures also need to be handled. We should think What happens to my application if the interface of dependent app is changed? What if a service that I connect times out or returns an exception or has too much wait time? What if the memory limit of an instance grows beyond? What happens when I run out of my connections? What happens when I have too much of wait time?
In Short we need to build mechanisms that can handle failures. For example, the following strategies can help in event of failure:
- Automate the backup and restore strategy for data
- Processes has to be build so that they resume on reboot
- Keep a close monitor of all those Key Performances Indicators, there are a bunch of open source/paid ones available
- Try automating the manual processes, use robotics/automations/AI etc..,