1. The riskiness of a mitigation should scale with the severity of the outage
  2. Recovery mechanisms should be fully tested before an emergency
  3. Canary all changes
  4. Have a “Big Red Button”
  5. Unit tests alone are not enough - integration testing is also needed
  6. COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!
  7. Intentionally degrade performance modes
  8. Test for Disaster resilience
  9. Automate your mitigations
  10. Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong
  11. A single global hardware version is a single point of failure

Source: Lessons learned from two decades of Site Reliability Engineering

This is a good list. More details are in the article.