OpsDev

From an internal email of a very very big corporate company

Incident Background:
BIGPROJECT has been unavailable since APAC SOD due to a data refresh activity being wrongly triggered in from UAT to Production environments.

Business Impact

  1. BIGPROJECT is unavailable for all users in the bank
  2. BIGPROJECT2 platform which sits on BIGPROJECT is unavailable this includes the Click-to-chat serviceCurrent Status
  3. Initial attempt of flashback Database to restore from the last good restore point failed due to errors due to absence of flashback logs– this was a quicker option, but now ruled out.
  • Currently going ahead with full restoration in the Primary database – this activity is tentatively supposed to take 8-9 hours (in place of 6 hours earlier mentioned)

  • […manual recovery instruction follows… ]

After 5 hours in another email they dare to say:

  1. Currently 32% of database back up is completed and will take approximately 8-14 hours.

Let’s explain

BIGPROJECT is trouble ticketing + change management internal software so entire bank cannot delivery software today…
So what happen? We can try to translate the email in a more “ops-dev” way….

  1. Someone clicked a key, and destroyed production db schema
  2. We where unable to restore the database using a trick called Oracle flashback.
  3. Our recovery strategy will take more then 14 hours to complete
  4. Keep in touch for some thrilling news for us

By the way oracle flashback is not meant to replace your backup.
DevOps is a mental state.
You must have a reasonable fast recovery procedure for mission critical application and it must be completely automatic. No a trial and error approach based on slow tape backups.

Series NavigationDocker & containers: uso ideale

Leave a Reply