19c Observe-Only Data Guard FSFO: no split-brain risk in manual failover
Fast-Start Failover (FSFO) is an amazing feature of Oracle Data Guard Broker which brings High Availability (HA)features in addition to the Disaster Recovery (DR) one.
Data Guard as an HA solution
By default, a physical standby database protects from Disaster Recovery (like when your Data Center is on fire or underwater, or with a power cut,…). But it requires a manual action to do the failover. Then, even if the failover is quick (seconds to minutes) and there’s no loss of data (if in SYNC), it cannot be considered as HA because of the manual decision which can take hours. The idea of the manual decision is to understand the cause as it may be better to just wait in case of a transient failure. Especially if the standby site is less powerful and application performance will be degraded.
With FSFO, the failure of the primary database is automatically detected (with an observer process constantly testing the connection from another site) and then the failover to a designated standby is initiated. If Transparent Application Failover (TAF) is correctly configured, the application will directly reconnect to the new primary without your intervention.
How does the observer decide that the primary is down? By default, the failover is triggered when the primary is not reachable by both the observer and by the standby. In 12c, thanks to the ObserverOverride property, it is also possible to get the observer issuing the failover as soon as it cannot connect to the primary, even when the standby can see the primary. This can be used when the observer is on the application side.
12cR2 brought new additional possibilities for complex Data Guard configurations, such as defining multiple observers and multiple targets.
No human decision?
However, FSFO is faster than a manual decision but may initiate a failover when we do not want it. If the observer is on the same site as the primary (bad idea) then a network connection between the two sites will trigger the failover. When running in FSFO you must be sure that all your infrastructure is ok so that no undesired failover is initiated, and that you have no manual tasks to do in order to get the applications running again.
Enable FSFO in observe-only mode
If it is not the case, you will not decide to automate the failover. This is where 19c ‘observe only’ mode is interesting. The observer will report the failure but will not initiate a failover. I enable FSFO in this mode:
[oracle@cloud ~]$ dgmgrl / "enable fast_start failover observe only"
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Wed Mar 6 10:02:01 2019
Version 19.2.0.0.0Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.Welcome to DGMGRL, type "help" for information.
Connected to "CDB1A"
Connected as SYSDG.
Enabled in Observe-Only Mode.
And start the observer (you should use the full syntax to run it in background with a log directory, but this is just for a small test which logs on my screen):
[oracle@cloud ~]$ dgmgrl sys/oracle "start observer" &
[1] 26589[oracle@cloud ~]$ DGMGRL for Linux: Release 19.0.0.0.0 - Production on Wed Mar 6 10:04:50 2019
Version 19.2.0.0.0Copyright (c) 1982, 2019, Oracle and/or its affiliates. All rights reserved.Welcome to DGMGRL, type "help" for information.
Connected to "CDB1A"
Connected as SYSDBA.
Observer 'cloud' started[W000 2019-03-06T10:04:50.342+00:00] Observer trace level is set to USER
[W000 2019-03-06T10:04:50.342+00:00] Try to connect to the primary.
Now if I simulate a crash of the primary database, killing the PMON process:
[oracle@cloud ~]$ kill -9 $(pgrep -f ora_pmon_CDB1A)
This immediately logs the following from the observer:
[oracle@cloud ~]$ [W000 2019-03-06T10:37:50.198+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:37:50.198+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 30 seconds
The default 30 seconds threshold is there to avoid a failover decision in case of a transient network failure.
After this threshold, a fast-start failover would have been initiated in normal FSFO mode, but here it is only reported in the log:
[W000 2019-03-06T10:37:51.198+00:00] Try to connect to the primary.
ORA-12537: TNS:connection closed
Unable to connect to database using //cloud:1521/CDB1A_DGMGRL
[W000 2019-03-06T10:37:51.235+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:37:52.235+00:00] Try to connect to the primary.
[W000 2019-03-06T10:37:53.287+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:37:54.288+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:17.902+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:17.902+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 3 seconds
[W000 2019-03-06T10:38:18.902+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:19.952+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:19.952+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 1 second
[W000 2019-03-06T10:38:20.952+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:22.003+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:22.003+00:00] Fast-Start Failover threshold has expired.
[W000 2019-03-06T10:38:22.003+00:00] Try to connect to the standby.
[W000 2019-03-06T10:38:22.003+00:00] Making a last connection attempt to primary database before proceeding with Fast-Start Failover.
[W000 2019-03-06T10:38:22.003+00:00] Check if the standby is ready for failover.
[W000 2019-03-06T10:38:22.006+00:00] A fast-start failover would have been initiated...
[W000 2019-03-06T10:38:22.006+00:00] Unable to failover since this observer is in observe-only mode
[W000 2019-03-06T10:38:22.006+00:00] Fast-Start Failover is not possible because observe-only mode.
[W000 2019-03-06T10:38:23.007+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:24.058+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:25.058+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:26.110+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:27.110+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:50.727+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:50.727+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 3 seconds
[W000 2019-03-06T10:38:51.727+00:00] Try to connect to the primary.
[W000 2019-03-06T10:38:52.777+00:00] Primary database cannot be reached.
[W000 2019-03-06T10:38:52.777+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 1 second
...
Of course, this is something you should monitor, and have a manual decision about it. This is useful to run the FSFO in ‘dry-run’ mode where you want to be sure that your infrastructure is ok, without false alerts, before having it fully automated. But even when you don’t want to go to fully automated FSFO, this mode is very helpful for the post-failover tasks.
Let’s say I decide to failover. I do not stop the observer. I do not disable FSFO. I just initiate the failover manually:
DGMGRL> failover to cdb1b;
Failover succeeded, new primary is "cdb1b"
The observer is still running and detects that the primary has changed:
[W000 2019-03-06T13:13:25.182+00:00] Primary database cannot be reached.
[W000 2019-03-06T13:13:25.182+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 2 seconds
[W000 2019-03-06T13:13:26.183+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:27.233+00:00] Primary database cannot be reached.
[W000 2019-03-06T13:13:27.233+00:00] Fast-Start Failover threshold has expired.
[W000 2019-03-06T13:13:27.233+00:00] Try to connect to the standby.
[W000 2019-03-06T13:13:27.233+00:00] Making a last connection attempt to primary database before proceeding with Fast-Start Failover.
[W000 2019-03-06T13:13:27.280+00:00] Check if the standby is ready for failover.
[W000 2019-03-06T13:13:27.285+00:00] A fast-start failover would have been initiated...
[W000 2019-03-06T13:13:27.285+00:00] Unable to failover since this observer is in observe-only mode
[W000 2019-03-06T13:13:27.285+00:00] Fast-Start Failover is not possible because observe-only mode.
[W000 2019-03-06T13:13:28.284+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:28.284+00:00] Primary database cannot be reached.
[W000 2019-03-06T13:13:28.284+00:00] Fast-Start Failover observe-only mode enabled.
[W000 2019-03-06T13:13:28.284+00:00] Will not attempt a Fast-Start Failover.
[W000 2019-03-06T13:13:28.284+00:00] Retry connecting to primary.
[W000 2019-03-06T13:13:29.284+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:30.285+00:00] Primary database has changed to cdb1b.
[W000 2019-03-06T13:13:30.337+00:00] Try to connect to the primary.
[W000 2019-03-06T13:13:30.337+00:00] Try to connect to the primary //instance-20190305-2110:1522/CDB1B_DGMGRL.
[W000 2019-03-06T13:13:30.380+00:00] The standby cdb1a needs to be reinstated
[W000 2019-03-06T13:13:30.380+00:00] Try to connect to the new standby cdb1a.
[W000 2019-03-06T13:13:30.380+00:00] Connection to the primary restored!
[W000 2019-03-06T13:13:32.380+00:00] Connection to the new standby restored!
[W000 2019-03-06T13:13:32.380+00:00] Disconnecting from database //instance-20190305-2110:1522/CDB1B_DGMGRL.
[W000 2019-03-06T13:13:33.384+00:00] Failed to ping the new standby.
[W000 2019-03-06T13:13:34.385+00:00] Try to connect to the new standby cdb1a.
[W000 2019-03-06T13:13:36.385+00:00] Connection to the new standby restored!
[W000 2019-03-06T13:13:36.388+00:00] Failed to ping the new standby.
[W000 2019-03-06T13:13:37.389+00:00] Try to connect to the new standby cdb1a.
Reinstate old primary as new standby
This is the other automation of FSFO: as soon as the old site comes up again, this situation is detected. It is mandatory in the fully automated failover because you don’t want the old desynchronized primary to come up and have applications connecting to it again, which is a case of split-brain. FSFO detects this and can even automatically reinstate this old primary as a new standby (this is why FSFO requires the database to be in FLASHBACK ON).
Let’s see it, I Startup CDB1A — the one I killed before
ORACLE instance started.Total System Global Area 4.2950E+10 bytes
Fixed Size 30386848 bytes
Variable Size 8187281408 bytes
Database Buffers 3.4628E+10 bytes
Redo Buffers 103829504 bytes
Database mounted.
ORA-16649: possible failover to another database prevents this database from being opened
Impossible to open, to prevent against split brain.
And then I do nothing, just let the observer do its magic:
SQL> [W000 2019-03-06T13:17:32.728+00:00] Try to connect to the primary //instance-20190305-2110:1522/CDB1B_DGMGRL.
[W000 2019-03-06T13:17:33.728+00:00] Connection to the primary restored!
[W000 2019-03-06T13:17:33.728+00:00] Wait for new primary to be ready to reinstate.
[W000 2019-03-06T13:17:33.731+00:00] New primary is now ready to reinstate.
[W000 2019-03-06T13:17:34.732+00:00] Issuing REINSTATE command.2019-03-06T13:17:34.732+00:00
Initiating reinstatement for database "cdb1a"...
Reinstating database "cdb1a", please wait...
[W000 2019-03-06T13:17:48.752+00:00] The standby cdb1a is ready to be a FSFO target
Reinstatement of database "cdb1a" succeeded
2019-03-06T13:18:17.381+00:00
[W000 2019-03-06T13:18:17.789+00:00] Successfully reinstated database cdb1a.
Now the old primary is up in a standby role, fully synchronized. This is awesome. Just imagine the following situation. You have a few hours of power cut. You decided to failover. This is one command only with Data Guard Broker, but you have probably a lot of work to do with all the other systems.
And then you will postpone the reinstate of the primary databases. And maybe you will forget it after that exhausting day. And then you run unprotected… Now, Murphy’s law breaks the DR site… and you have lost everything. If you run FSFO, then the old site has been synchronized again without your intervention and is ready for a new failover. And in 19c this is possible even if you want full control with the manual decision to failover.