OpsDev

Mar 23, 2018 · 2 min read

From an internal email of a very very big corporate company

Incident Background: BIGPROJECT has been unavailable since APAC SOD due to a data refresh activity being wrongly triggered in from UAT to Production environments.
Business Impact

BIGPROJECT is unavailable for all users in the bank

BIGPROJECT2 platform which sits on BIGPROJECT is unavailable this includes the Click-to-chat serviceCurrent Status

Initial attempt of flashback Database to restore from the last good restore point failed due to errors due to absence of flashback logs– this was a quicker option, but now ruled out.

Currently going ahead with full restoration in the Primary database – this activity is tentatively supposed to take 8-9 hours (in place of 6 hours earlier mentioned)

[...manual recovery instruction follows... ]

After 5 hours in another email they dare to say:

Currently 32% of database back up is completed and will take approximately 8-14 hours.

Let's explain

BIGPROJECT is THE trouble ticketing + change management internal software, so entire bank cannot delivery software today... So what happen? We can try to translate the email in a more "ops-dev" way....

Someone clicked a button, made a wrong "promote" in production and altered production database schema
They were unable to restore the database using a trick called Oracle flashback.
Their recovery strategy will take more then 14 hours to complete. In the meantime the entire Bank cannot deploy anything. Hope you did not have some urged need.
Keep in touch for some thrilling news (are you with us? you fainted?)

By the way oracle flashback is not meant to replace your backup. DevOps is a mental state. You must have a reasonable fast recovery procedure for mission critical application and it must be completely automatic. No a trial and error approach based on slow tape backups.