In this blog, we will configure repmgr to perform Automatic failover when primary failure is detected.
We will use the same PostgreSQL Cluster setup we did in blog How to Setup Streaming Replication with repmgr and barman.
Automatic failover is achieved with repmgr daemon (repmgrd) is a management and monitoring daemon which runs on each node in a replication cluster. It can automate actions such as failover and updating standbys to follow the new primary, as well as providing monitoring information about the state of each standby.
Below are the high-level steps we will follow.
1. Configuring repmgrd Parameters
2. Starting the repmgr Daemon
3. Simulating a Failed Primary
Below are setup details and the same will be used in this demonstration.
Sr No.
Hostname
IP
Role
1
test-machine02
192.168.114.176
Master / Primary Server
2
test-machine01
192.168.114.177
Standby / Secondary Server
Step 1. Configuring repmgrd Parameters: To enable the repmgr daemon and automatic failover, a number of other parameters need to be enabled/added. We need to set the below parameters on bot Primary & Standby node.
Master Node
[root@test-machine02 ~]#
[root@test-machine02 ~]# vi /etc/repmgr/13/repmgr.conf
failover ='automatic'
promote_command ='/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file'
follow_command ='/usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=2
:wq!
[root@test-machine02 ~]#
Standby Node
[root@test-machine01 ~]#
[root@test-machine01 ~]# vi /etc/repmgr/13/repmgr.conf
failover ='automatic'
promote_command ='/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file'
follow_command ='/usr/pgsql-13/bin/repmgr standby follow -f /etc/repmgr/13/repmgr.conf --log-to-file --upstream-node-id=%n'
monitoring_history=yes
monitor_interval_secs=2
:wq!
[root@test-machine01 ~]#
Step 2. Starting the repmgr Daemon: With the parameters now set in the Primary and Standby nodes, we execute a dry run of the command to start the repmgr daemon. The command has to be executed as the postgres user.
Master Node
[root@test-machine02 ~]#
[root@test-machine02 ~]# su - postgres
Last login: Sat Aug 7 14:01:32 +03 2021 on pts/1
[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf daemon start --dry-run
INFO: prerequisites for starting repmgrd met
DETAIL: following command would be executed:
sudo /usr/bin/systemctl start repmgr13.service
[postgres@test-machine02]
[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgrd -f /etc/repmgr/13/repmgr.conf --pid-file /tmp/repmgrd.pid
[2021-08-07 15:13:06] [WARNING] "repmgrd_pid_file" will be overridden by --pid-file
[2021-08-07 15:13:06] [NOTICE] redirecting logging output to "/var/log/repmgr.log"
[postgres@test-machine02]cat /var/log/repmgr.log
[2021-08-07 15:13:06] [NOTICE] repmgrd (repmgrd 5.2.1) starting up
[2021-08-07 15:13:06] [INFO] connecting to database "host=test-machine02 user=repmgr dbname=repmgr port=5432"
INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2021-08-07 15:13:06] [NOTICE] starting monitoring of node "test-machine02" (ID: 1)
[2021-08-07 15:13:06] [INFO] "connection_check_type" set to "ping"
[2021-08-07 15:13:06] [NOTICE] monitoring cluster primary "test-machine02" (ID: 1)
[2021-08-07 15:13:06] [INFO] child node "test-machine01" (ID: 2) is attached
[postgres@test-machine02]
[postgres@test-machine02]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster event --event=repmgrd_start
Node ID | Name | Event | OK | Timestamp | Details
---------+----------------+---------------+----+---------------------+-----------------------------------------------------------------
2 | test-machine01 | repmgrd_start | t | 2021-08-07 15:21:39 | monitoring connection to upstream node "test-machine02" (ID: 1)
1 | test-machine02 | repmgrd_start | t | 2021-08-07 15:13:06 | monitoring cluster primary "test-machine02" (ID: 1)
[postgres@test-machine02]
Standby Node
[root@test-machine01 ~]#
[root@test-machine01 ~]# su - postgres
Last login: Sat Aug 7 14:01:41 +03 2021 on pts/0
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf daemon start --dry-run
INFO: prerequisites for starting repmgrd met
DETAIL: following command would be executed:
sudo /usr/bin/systemctl start repmgr13.service
[postgres@test-machine01]
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgrd -f /etc/repmgr/13/repmgr.conf --pid-file /tmp/repmgrd.pid
[2021-08-07 15:21:39] [WARNING] "repmgrd_pid_file" will be overridden by --pid-file
[2021-08-07 15:21:39] [NOTICE] redirecting logging output to "/var/log/repmgr.log"
[postgres@test-machine01]
[postgres@test-machine01]cat /var/log/repmgr.log
[2021-08-07 15:21:39] [NOTICE] repmgrd (repmgrd 5.2.1) starting up
[2021-08-07 15:21:39] [INFO] connecting to database "host=test-machine01 user=repmgr dbname=repmgr port=5432"
INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2021-08-07 15:21:39] [NOTICE] starting monitoring of node "test-machine01" (ID: 2)
[2021-08-07 15:21:39] [INFO] "connection_check_type" set to "ping"
[2021-08-07 15:21:39] [INFO] monitoring connection to upstream node "test-machine02" (ID: 1)
[postgres@test-machine01]
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster event --event=repmgrd_start
Node ID | Name | Event | OK | Timestamp | Details
---------+----------------+---------------+----+---------------------+-----------------------------------------------------------------
2 | test-machine01 | repmgrd_start | t | 2021-08-07 15:21:39 | monitoring connection to upstream node "test-machine02" (ID: 1)
1 | test-machine02 | repmgrd_start | t | 2021-08-07 15:13:06 | monitoring cluster primary "test-machine02" (ID: 1)
[postgres@test-machine01]
Step 3. Simulating a Failed Primary: Now we will simulate a failed primary by stopping the primary node (test-machine02). After Primary postgresql server is down repmgr Daemon will try to connect Primary server as per repmgr config parameter reconnect_attempts by default it 6. When all attempts failed Failover process will trigger, We can check repmgr logfile and with command repmgr cluster event like below to confirm actual failover.
Master Node
[root@test-machine02 ~]#
[root@test-machine02 ~]# systemctl stop postgresql-13
[root@test-machine02 ~]#
Standby Node
[postgres@test-machine01]
[postgres@test-machine01]/usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+----------------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------
1 | test-machine02 | primary | - failed | ? | default | 100 | | host=test-machine02 user=repmgr dbname=repmgr port=5432
2 | test-machine01 | primary | * running | | default | 100 | 4 | host=test-machine01 user=repmgr dbname=repmgr port=5432
WARNING: following issues were detected
- unable to connect to node "test-machine02" (ID: 1)
HINT: execute with --verbose option to see connection error messages
[postgres@test-machine01]
[postgres@test-machine01]
[postgres@test-machine01] /usr/pgsql-13/bin/repmgr -f /etc/repmgr/13/repmgr.conf cluster event
Node ID | Name | Event | OK | Timestamp | Details
---------+----------------+--------------------------+----+---------------------+----------------------------------------------------------------------------------------------------------
2 | test-machine01 | repmgrd_reload | t | 2021-08-07 15:35:06 | monitoring cluster primary "test-machine01" (ID: 2)
2 | test-machine01 | repmgrd_failover_promote | t | 2021-08-07 15:35:06 | node "test-machine01" (ID: 2) promoted to primary; old primary "test-machine02" (ID: 1) marked as failed
2 | test-machine01 | standby_promote | t | 2021-08-07 15:35:06 | server "test-machine01" (ID: 2) was successfully promoted to primary
2 | test-machine01 | repmgrd_start | t | 2021-08-07 15:21:39 | monitoring connection to upstream node "test-machine02" (ID: 1)
1 | test-machine02 | repmgrd_start | t | 2021-08-07 15:13:06 | monitoring cluster primary "test-machine02" (ID: 1)
1 | test-machine02 | standby_switchover | t | 2021-08-05 12:04:39 | node "test-machine02" (ID: 1) promoted to primary, node "test-machine01" (ID: 2) demoted to standby
1 | test-machine02 | standby_promote | t | 2021-08-05 12:04:35 | server "test-machine02" (ID: 1) was successfully promoted to primary
1 | test-machine02 | node_rejoin | t | 2021-08-05 11:27:39 | node 1 is now attached to node 2
2 | test-machine01 | standby_promote | t | 2021-08-05 10:59:54 | server "test-machine01" (ID: 2) was successfully promoted to primary
2 | test-machine01 | standby_register | t | 2021-08-05 10:33:16 | standby registration succeeded; upstream node ID is 1
2 | test-machine01 | standby_clone | t | 2021-08-05 10:32:56 | cloned from host "test-machine02", port 5432; backup method: barman; --force: N
1 | test-machine02 | primary_register | t | 2021-08-05 10:32:01 |
1 | test-machine02 | cluster_created | t | 2021-08-05 10:32:01 |
[postgres@test-machine01]
[root@test-machine01 ~]# cat /var/log/repmgr.log
[2021-08-07 15:34:44] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-08-07 15:34:44] [INFO] sleeping up to 10 seconds until next reconnection attempt
[2021-08-07 15:34:54] [INFO] checking state of node "test-machine02" (ID: 1), 5 of 6 attempts
[2021-08-07 15:34:54] [WARNING] unable to ping "user=repmgr dbname=repmgr host=test-machine02 port=5432 connect_timeout=2 fallback_application_name=repmgr"
[2021-08-07 15:34:54] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-08-07 15:34:54] [INFO] sleeping up to 10 seconds until next reconnection attempt
[2021-08-07 15:35:04] [INFO] checking state of node "test-machine02" (ID: 1), 6 of 6 attempts
[2021-08-07 15:35:04] [WARNING] unable to ping "user=repmgr dbname=repmgr host=test-machine02 port=5432 connect_timeout=2 fallback_application_name=repmgr"
[2021-08-07 15:35:04] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2021-08-07 15:35:04] [WARNING] unable to reconnect to node "test-machine02" (ID: 1) after 6 attempts
[2021-08-07 15:35:04] [INFO] 0 active sibling nodes registered
[2021-08-07 15:35:04] [INFO] 2 total nodes registered
[2021-08-07 15:35:04] [INFO] primary node "test-machine02" (ID: 1) and this node have the same location ("default")
[2021-08-07 15:35:04] [INFO] no other sibling nodes - we win by default
[2021-08-07 15:35:04] [NOTICE] this node is the only available candidate and will now promote itself
[2021-08-07 15:35:04] [INFO] promote_command is:
"/usr/pgsql-13/bin/repmgr standby promote -f /etc/repmgr/13/repmgr.conf --log-to-file"
[2021-08-07 15:35:04] [NOTICE] redirecting logging output to "/var/log/repmgr.log"
[2021-08-07 15:35:04] [INFO] user "repmgr" does not have permission to execute "pg_promote()", falling back to "pg_ctl promote"
[2021-08-07 15:35:04] [NOTICE] promoting standby to primary
[2021-08-07 15:35:04] [DETAIL] promoting server "test-machine01" (ID: 2) using "/usr/pgsql-13/bin/pg_ctl -w -D '/var/lib/pgsql/13/data' promote"
[2021-08-07 15:35:06] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2021-08-07 15:35:06] [NOTICE] STANDBY PROMOTE successful
[2021-08-07 15:35:06] [DETAIL] server "test-machine01" (ID: 2) was successfully promoted to primary
[2021-08-07 15:35:06] [INFO] checking state of node 2, 1 of 6 attempts
[2021-08-07 15:35:06] [NOTICE] node 2 has recovered, reconnecting
[2021-08-07 15:35:06] [INFO] connection to node 2 succeeded
[2021-08-07 15:35:06] [INFO] original connection is still available
[2021-08-07 15:35:06] [INFO] 0 followers to notify
[2021-08-07 15:35:06] [INFO] switching to primary monitoring mode
[2021-08-07 15:35:06] [NOTICE] monitoring cluster primary "test-machine01" (ID: 2)