Configure automatic failover with repmgrd in PostgreSQL

In this blog, we will configure repmgr to perform Automatic failover when primary failure is detected.

We will use the same PostgreSQL Cluster setup we did in blog How to Setup Streaming Replication with repmgr and barman.

Automatic failover is achieved with repmgr daemon (repmgrd) is a management and monitoring daemon which runs on each node in a replication cluster. It can automate actions such as failover and updating standbys to follow the new primary, as well as providing monitoring information about the state of each standby.

Below are the high-level steps we will follow.
1. Configuring repmgrd Parameters
2. Starting the repmgr Daemon
3. Simulating a Failed Primary

Below are setup details and the same will be used in this demonstration.

Sr No. Hostname IP Role
1 test-machine02 192.168.114.176 Master / Primary Server
2 test-machine01 192.168.114.177 Standby / Secondary Server

Step 1. Configuring repmgrd ParametersTo enable the repmgr daemon and automatic failover, a number of other parameters need to be enabled/added. We need to set the below parameters on bot Primary Standby node.

Master Node

[root@test-machine02 ~]#
[root@test-machine02 ~]# vi /etc/repmgr/13/repmgr.conf
failover ='automatic'
promote_command ='/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file'
follow_command ='/usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=2
:wq!
[root@test-machine02 ~]#

Standby Node

[root@test-machine01 ~]#
[root@test-machine01 ~]# vi /etc/repmgr/13/repmgr.conf
failover ='automatic'
promote_command ='/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file'
follow_command ='/usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=2
:wq!
[root@test-machine01 ~]#

Step 2. Starting the repmgr DaemonWith the parameters now set in the Primary and Standby nodes, we execute a dry run of the command to start the repmgr daemon. The command has to be executed as the postgres user.

Master Node

[root@test-machine02 ~]#
[root@test-machine02 ~]# su - postgres
Last login: Sat Aug  7 14:01:32 +03 2021 on pts/1
[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf daemon start --dry-run
INFO: prerequisites for starting repmgrd met
DETAIL: following command would be executed:
  sudo /usr/bin/systemctl start repmgr13.service
[postgres@test-machine02]
 
[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgrd -f /etc/repmgr/13/repmgr.conf --pid-file /tmp/repmgrd.pid
[2021-08-07 15:13:06] [WARNING] "repmgrd_pid_file" will be overridden by --pid-file
[2021-08-07 15:13:06] [NOTICE] redirecting logging output to "/var/log/repmgr.log"
 
[postgres@test-machine02]cat /var/log/repmgr.log
[2021-08-07 15:13:06] [NOTICE] repmgrd (repmgrd 5.2.1) starting up
[2021-08-07 15:13:06] [INFO] connecting to database "host=test-machine02 user=repmgr dbname=repmgr port=5432"
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2021-08-07 15:13:06] [NOTICE] starting monitoring of node "test-machine02" (ID: 1)
[2021-08-07 15:13:06] [INFO] "connection_check_type" set to "ping"
[2021-08-07 15:13:06] [NOTICE] monitoring cluster primary "test-machine02" (ID: 1)
[2021-08-07 15:13:06] [INFO] child node "test-machine01" (ID: 2) is attached
 
[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster event --event=repmgrd_start
 Node ID | Name           | Event         | OK | Timestamp           | Details
---------+----------------+---------------+----+---------------------+-----------------------------------------------------------------
 2       | test-machine01 | repmgrd_start | t  | 2021-08-07 15:21:39 | monitoring connection to upstream node "test-machine02" (ID: 1)
 1       | test-machine02 | repmgrd_start | t  | 2021-08-07 15:13:06 | monitoring cluster primary "test-machine02" (ID: 1)
 
[postgres@test-machine02]

Standby Node

[root@test-machine01 ~]#
[root@test-machine01 ~]# su - postgres
Last login: Sat Aug  7 14:01:41 +03 2021 on pts/0
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf daemon start --dry-run
INFO: prerequisites for starting repmgrd met
DETAIL: following command would be executed:
  sudo /usr/bin/systemctl start repmgr13.service
[postgres@test-machine01]
 
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgrd -f /etc/repmgr/13/repmgr.conf --pid-file /tmp/repmgrd.pid
[2021-08-07 15:21:39] [WARNING] "repmgrd_pid_file" will be overridden by --pid-file
[2021-08-07 15:21:39] [NOTICE] redirecting logging output to "/var/log/repmgr.log"
 
[postgres@test-machine01]
[postgres@test-machine01]cat /var/log/repmgr.log
[2021-08-07 15:21:39] [NOTICE] repmgrd (repmgrd 5.2.1) starting up
[2021-08-07 15:21:39] [INFO] connecting to database "host=test-machine01 user=repmgr dbname=repmgr port=5432"
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2021-08-07 15:21:39] [NOTICE] starting monitoring of node "test-machine01" (ID: 2)
[2021-08-07 15:21:39] [INFO] "connection_check_type" set to "ping"
[2021-08-07 15:21:39] [INFO] monitoring connection to upstream node "test-machine02" (ID: 1)
[postgres@test-machine01]
 
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster event --event=repmgrd_start
 Node ID | Name           | Event         | OK | Timestamp           | Details
---------+----------------+---------------+----+---------------------+-----------------------------------------------------------------
 2       | test-machine01 | repmgrd_start | t  | 2021-08-07 15:21:39 | monitoring connection to upstream node "test-machine02" (ID: 1)
 1       | test-machine02 | repmgrd_start | t  | 2021-08-07 15:13:06 | monitoring cluster primary "test-machine02" (ID: 1)
 
[postgres@test-machine01]

Step 3. Simulating a Failed Primary: Now we will simulate a failed primary by stopping the primary node (test-machine02). After Primary postgresql server is down repmgr Daemon will try to connect Primary server as per repmgr config parameter reconnect_attempts by default it 6. When all attempts failed Failover process will trigger, We can check repmgr logfile and with command repmgr cluster event like below to confirm actual failover.

Master Node

[root@test-machine02 ~]#
[root@test-machine02 ~]# systemctl stop postgresql-13
[root@test-machine02 ~]#

Standby Node

[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
 ID | Name           | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+----------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------
 1  | test-machine02 | primary | - failed  | ?        | default  | 100      |          | host=test-machine02 user=repmgr dbname=repmgr port=5432
 2  | test-machine01 | primary | * running |          | default  | 100      | 4        | host=test-machine01 user=repmgr dbname=repmgr port=5432
 
WARNING: following issues were detected
  - unable to connect to node "test-machine02" (ID: 1)
 
HINT: execute with --verbose option to see connection error messages
[postgres@test-machine01]
 
 
[postgres@test-machine01]
[postgres@test-machine01] /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster event
 Node ID | Name           | Event                    | OK | Timestamp           | Details
---------+----------------+--------------------------+----+---------------------+----------------------------------------------------------------------------------------------------------
 2       | test-machine01 | repmgrd_reload           | t  | 2021-08-07 15:35:06 | monitoring cluster primary "test-machine01" (ID: 2)
 2       | test-machine01 | repmgrd_failover_promote | t  | 2021-08-07 15:35:06 | node "test-machine01" (ID: 2) promoted to primary; old primary "test-machine02" (ID: 1) marked as failed
 2       | test-machine01 | standby_promote          | t  | 2021-08-07 15:35:06 | server "test-machine01" (ID: 2) was successfully promoted to primary
 2       | test-machine01 | repmgrd_start            | t  | 2021-08-07 15:21:39 | monitoring connection to upstream node "test-machine02" (ID: 1)
 1       | test-machine02 | repmgrd_start            | t  | 2021-08-07 15:13:06 | monitoring cluster primary "test-machine02" (ID: 1)
 1       | test-machine02 | standby_switchover       | t  | 2021-08-05 12:04:39 | node  "test-machine02" (ID: 1) promoted to primary, node "test-machine01" (ID: 2) demoted to standby
 1       | test-machine02 | standby_promote          | t  | 2021-08-05 12:04:35 | server "test-machine02" (ID: 1) was successfully promoted to primary
 1       | test-machine02 | node_rejoin              | t  | 2021-08-05 11:27:39 | node 1 is now attached to node 2
 2       | test-machine01 | standby_promote          | t  | 2021-08-05 10:59:54 | server "test-machine01" (ID: 2) was successfully promoted to primary
 2       | test-machine01 | standby_register         | t  | 2021-08-05 10:33:16 | standby registration succeeded; upstream node ID is 1
 2       | test-machine01 | standby_clone            | t  | 2021-08-05 10:32:56 | cloned from host "test-machine02", port 5432; backup method: barman; --force: N
 1       | test-machine02 | primary_register         | t  | 2021-08-05 10:32:01 |
 1       | test-machine02 | cluster_created          | t  | 2021-08-05 10:32:01 |
 
[postgres@test-machine01]
 
[root@test-machine01 ~]# cat /var/log/repmgr.log
[2021-08-07 15:34:44] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-08-07 15:34:44] [INFO] sleeping up to 10 seconds until next reconnection attempt
[2021-08-07 15:34:54] [INFO] checking state of node "test-machine02" (ID: 1), 5 of 6 attempts
[2021-08-07 15:34:54] [WARNING] unable to ping "user=repmgr dbname=repmgr host=test-machine02 port=5432 connect_timeout=2 fallback_application_name=repmgr"
[2021-08-07 15:34:54] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-08-07 15:34:54] [INFO] sleeping up to 10 seconds until next reconnection attempt
[2021-08-07 15:35:04] [INFO] checking state of node "test-machine02" (ID: 1), 6 of 6 attempts
[2021-08-07 15:35:04] [WARNING] unable to ping "user=repmgr dbname=repmgr host=test-machine02 port=5432 connect_timeout=2 fallback_application_name=repmgr"
[2021-08-07 15:35:04] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-08-07 15:35:04] [WARNING] unable to reconnect to node "test-machine02" (ID: 1) after 6 attempts
[2021-08-07 15:35:04] [INFO] 0 active sibling nodes registered
[2021-08-07 15:35:04] [INFO] 2 total nodes registered
[2021-08-07 15:35:04] [INFO] primary node  "test-machine02" (ID: 1) and this node have the same location ("default")
[2021-08-07 15:35:04] [INFO] no other sibling nodes - we win by default
[2021-08-07 15:35:04] [NOTICE] this node is the only available candidate and will now promote itself
[2021-08-07 15:35:04] [INFO] promote_command is:
  "/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file"
[2021-08-07 15:35:04] [NOTICE] redirecting logging output to "/var/log/repmgr.log"
 
[2021-08-07 15:35:04] [INFO] user "repmgr" does not have permission to execute "pg_promote()", falling back to "pg_ctl promote"
[2021-08-07 15:35:04] [NOTICE] promoting standby to primary
[2021-08-07 15:35:04] [DETAIL] promoting server "test-machine01" (ID: 2) using "/usr/pgsql-13/bin/pg_ctl  -w -D '/var/lib/pgsql/13/data' promote"
[2021-08-07 15:35:06] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2021-08-07 15:35:06] [NOTICE] STANDBY PROMOTE successful
[2021-08-07 15:35:06] [DETAIL] server "test-machine01" (ID: 2) was successfully promoted to primary
[2021-08-07 15:35:06] [INFO] checking state of node 2, 1 of 6 attempts
[2021-08-07 15:35:06] [NOTICE] node 2 has recovered, reconnecting
[2021-08-07 15:35:06] [INFO] connection to node 2 succeeded
[2021-08-07 15:35:06] [INFO] original connection is still available
[2021-08-07 15:35:06] [INFO] 0 followers to notify
[2021-08-07 15:35:06] [INFO] switching to primary monitoring mode
[2021-08-07 15:35:06] [NOTICE] monitoring cluster primary "test-machine01" (ID: 2)

Related posts

Perform Manual Failover & Switchover with repmgr in PostgreSQL

Monitor Streaming Replication trong PostgreSQL

Thủ tục Failover/Switchover Streaming Replication trên PostgreSQL