##
PostgreSQL Barman的两则案例
1、案例汇总
案例1:Barman与TIMELINE的案例
案例2:Barman PITR与SSH的案例
2、Barman与TIMELINE的案例
PG的备份工具太多了,当然大部分都是第三方的,Barman是一个值得学习的工具,和EDB有很大关系,大部分barman都用于remote场景,local并不多,这也许和streaming backup调用pg_basebackup有关。
2.1、案例复现
这是一个barman local的案例,恢复之后发现barman的进程起不来,复现之后barman的日志如下:
2024-04-04 18:52:59,418 [23130] barman.wal_archiver INFO: Activating WAL archiving through streaming protocol
2024-04-04 18:52:59,486 [23130] barman.command_wrappers INFO: pgdb01: pg_receivewal: starting log streaming at 0/63000000 (timeline 1)
2024-04-04 18:52:59,486 [23130] barman.command_wrappers INFO: pgdb01: pg_receivewal: error: unexpected termination of replication stream: ERROR: requested WAL segment 000000020000000000000063 has already been removed
2024-04-04 18:52:59,487 [23130] barman.command_wrappers INFO: pgdb01: pg_receivewal: error: disconnected
2024-04-04 18:52:59,490 [23130] barman.server ERROR: ArchiverFailure:pg_receivewal terminated with error code: 1
2.2、相关日志和信息检查
pg数据库的日志如下:
2024-04-04 18:47:22.831 CST [22933] ERROR: requested WAL segment 000000020000000000000063 has already been removed
2024-04-04 18:47:22.831 CST [22933] STATEMENT: START_REPLICATION SLOT "barman" 0/63000000 TIMELINE 1
2024-04-04 18:48:29.540 CST [22980] ERROR: requested WAL segment 000000020000000000000063 has already been removed
2024-04-04 18:48:29.540 CST [22980] STATEMENT: START_REPLICATION SLOT "barman" 0/63000000 TIMELINE 1
2024-04-04 18:49:43.189 CST [23007] ERROR: replication slot "barman" already exists
2024-04-04 18:49:43.189 CST [23007] STATEMENT: CREATE_REPLICATION_SLOT barman PHYSICAL
2024-04-04 18:50:07.559 CST [23030] ERROR: requested WAL segment 000000020000000000000063 has already been removed
2024-04-04 18:50:07.559 CST [23030] STATEMENT: START_REPLICATION SLOT "barman" 0/63000000 TIMELINE 1
恢复之后归档的日志信息如下:
[postgres@pgdb01:/postgresql/archive]$ll
total 98308
-rw------- 1 postgres postgres 16777216 Apr 4 18:08 000000020000000000000063
-rw------- 1 postgres postgres 16777216 Apr 4 18:43 000000020000000000000064
-rw------- 1 postgres postgres 16777216 Apr 4 18:47 000000020000000000000065
-rw------- 1 postgres postgres 16777216 Apr 4 18:51 000000020000000000000066
-rw------- 1 postgres postgres 16777216 Apr 4 18:52 000000020000000000000067
-rw------- 1 postgres postgres 16777216 Apr 4 18:52 000000020000000000000068
-rw------- 1 postgres postgres 42 Apr 4 18:08 00000002.history
2.3、解决方法
reset receive-wal 状态文件
[postgres@pgdb01:/postgresql/archive]$barman receive-wal --reset pgdb01
Resetting receive-wal directory status
Removing status file /postgresql/backup/barman/pgdb01/streaming/000000010000000000000063.partial
Creating status file /postgresql/backup/barman/pgdb01/streaming/000000020000000000000069.partial
查看reset的参数意思
reset the status of receive-wal removing any status files
barman日志
2024-04-04 18:55:21,299 [23544] barman.wal_archiver INFO: Resetting receive-wal directory status
2024-04-04 18:55:21,300 [23544] barman.wal_archiver INFO: Removing status file /postgresql/backup/barman/pgdb01/streaming/000000010000000000000063.partial
2024-04-04 18:55:21,306 [23544] barman.wal_archiver INFO: Creating status file /postgresql/backup/barman/pgdb01/streaming/000000020000000000000069.partial
2024-04-04 18:55:55,645 [23647] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-04 18:55:55,781 [23647] barman.utils INFO: Cleaning up lockfiles directory.
2024-04-04 18:55:55,933 [23650] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-04 18:55:55,987 [23651] barman.config WARNING: Discarding configuration file: .barman.auto.conf (not a file)
2024-04-04 18:55:56,027 [23650] barman.wal_archiver INFO: No xlog segments found from streaming for pgdb01.
2024-04-04 18:55:56,260 [23651] barman.server INFO: Starting receive-wal for server pgdb01
2024-04-04 18:55:56,262 [23651] barman.wal_archiver INFO: Activating WAL archiving through streaming protocol
2024-04-04 18:55:56,328 [23651] barman.command_wrappers INFO: pgdb01: pg_receivewal: starting log streaming at 0/69000000 (timeline 2)
2.4、检查barman状态
[postgres@pgdb01:/postgresql/archive]$barman cron
Starting WAL archiving for server pgdb01
Starting streaming archiver for server pgdb01
[postgres@pgdb01:/postgresql/archive]$ps -ef|grep barman
postgres 23651 1 12 18:55 ? 00:00:00 /usr/bin/python /usr/local/bin/barman -c /home/postgres/.barman.conf -q receive-wal pgdb01
postgres 23660 23651 3 18:55 ? 00:00:00 /postgresql/app/pgdb/bin/pg_receivewal --dbname=dbname=replication host=192.168.5.171 options=-cdatestyle=iso replication=true user=streaming_barman application_name=barman_receive_wal --verbose --no-loop --no-password --directory=/postgresql/backup/barman/pgdb01/streaming --slot=barman
postgres 23661 23080 0 18:55 ? 00:00:00 postgres: walsender streaming_barman 192.168.5.171(44464) streaming 0/69000060
postgres 23663 16165 0 18:55 pts/0 00:00:00 grep --color=auto barman
2.5、再次备份并检查备份集
[postgres@pgdb01:/postgresql/archive]$barman backup pgdb01 -w
Starting backup using postgres method for server pgdb01 in /postgresql/backup/barman/pgdb01/base/20240404T191231
Backup start at LSN: 0/69000148 (000000020000000000000069, 00000148)
Starting backup copy via pg_basebackup for 20240404T191231
Copy done (time: 1 second)
Finalising the backup.
Backup size: 37.6 MiB
Backup end at LSN: 0/6B000000 (00000002000000000000006A, 00000000)
Backup completed (start time: 2024-04-04 19:12:31.442820, elapsed time: 1 second)
Waiting for the WAL file 00000002000000000000006A from server 'pgdb01'
Processing xlog segments from streaming for pgdb01
00000002.history
000000020000000000000069
Processing xlog segments from streaming for pgdb01
00000002000000000000006A
[postgres@pgdb01:/postgresql/archive]$barman list-backup pgdb01
pgdb01 20240404T191231 - Thu Apr 4 19:12:33 2024 - Size: 37.6 MiB - WAL Size: 0 B
pgdb01 20240404T180053 - Thu Apr 4 18:00:54 2024 - Size: 37.5 MiB - WAL Size: 20.5 KiB
3、Barman PITR与SSH的案例
相同的客户那里遇到不同的问题,今天是一个关于SSH和get-wal小问题。
Barman PITR的细节很多,相同的客户那里遇到不同的问题,今天是以个关于SSH和get-wal小问题。
3.1、案例复现
这是一个barman remote PITR的案例,恢复之后发现数据库起不来,复现之后pg的日志如下:
[postgres@pgdb01:/postgresql/data]$log
2024-04-06 17:42:43.907 CST [22156] LOG: starting PostgreSQL 15.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4), 64-bit
2024-04-06 17:42:43.908 CST [22156] LOG: listening on IPv4 address "0.0.0.0", port 5432
2024-04-06 17:42:43.908 CST [22156] LOG: listening on IPv6 address "::", port 5432
2024-04-06 17:42:43.909 CST [22156] LOG: listening on Unix socket "./.s.PGSQL.5432"
2024-04-06 17:42:43.911 CST [22156] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-04-06 17:42:43.915 CST [22160] LOG: database system was interrupted; last known up at 2024-04-06 15:55:55 CST
2024-04-06 17:42:43.915 CST [22160] LOG: creating missing WAL directory "pg_wal/archive_status"
ERROR: Connection problem with ssh
2024-04-06 17:42:44.166 CST [22160] LOG: starting point-in-time recovery to XID 915
ERROR: Connection problem with ssh
ERROR: Connection problem with ssh
2024-04-06 17:42:44.595 CST [22160] LOG: invalid checkpoint record
2024-04-06 17:42:44.595 CST [22160] FATAL: could not locate required checkpoint record
2024-04-06 17:42:44.595 CST [22160] HINT: If you are restoring from a backup, touch "/postgresql/data/recovery.signal" and add required recovery options.
If you are not restoring from a backup, try removing the file "/postgresql/data/backup_label".
Be careful: removing "/postgresql/data/backup_label" will result in a corrupt cluster if restoring from a backup.
2024-04-06 17:42:44.601 CST [22156] LOG: startup process (PID 22160) exited with exit code 1
2024-04-06 17:42:44.601 CST [22156] LOG: aborting startup due to startup process failure
2024-04-06 17:42:44.613 CST [22156] LOG: database system is shut down
从日志可以看出问题是很明确的。
3.2、相关日志和信息检查
1.参数文件内容检查
[postgres@pgdb01:/postgresql/data]$cat postgresql.auto.conf
# Do not edit this file manually!
# It will be overwritten by the ALTER SYSTEM command.
# The 'barman get-wal' command must run as 'postgres' user
# The 'barman-wal-restore' command is provided in the 'barman-cli' package
restore_command = '/usr/local/bin/barman-wal-restore -P -U postgres barman pgdb01 %f %p'
recovery_target_xid = '915'
指定了一些PITR相关的参数
2.ssh检查
[postgres@pgdb01:/postgresql/data]$ssh postgres@192.168.5.172
Last login: Sat Apr 6 17:43:42 2024 from 192.168.5.171
[postgres@barman:/home/postgres]$
3.hosts文件检查
[postgres@pgdb01:/postgresql/data]$cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.5.171 pgdb01
192.168.5.172 barman
3.3、解决方法
这里配置SSH互信的时候,是copy给了IP地址,但是在restore_command 中却指定的是主机名,所以解决方法有两种:
1.ssh copy给主机名
2.把restore_command 中的信息改成IP地址
这里选择第一种。
[postgres@pgdb01:/postgresql/data]$ssh-copy-id barman
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/postgres/.ssh/id_rsa.pub"
The authenticity of host 'barman (192.168.5.172)' can't be established.
ED25519 key fingerprint is SHA256:tnU+CTtmrgv15H+opUQr+XlqndpaDAI6jWHRdPOgQUE.
This host key is known by the following other names/addresses:
~/.ssh/known_hosts:1: 192.168.5.172
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: WARNING: All keys were skipped because they already exist on the remote system.
(if you think this is a mistake, you may want to use -f option)
[postgres@pgdb01:/postgresql/data]$ssh postgres@barman
Last login: Sat Apr 6 17:35:07 2024
再次启动数据库
[postgres@pgdb01:/postgresql/data]$pg_ctl start
waiting for server to start....2024-04-06 17:43:49.044 CST [22212] LOG: redirecting log output to logging collector process
2024-04-06 17:43:49.044 CST [22212] HINT: Future log output will appear in directory "/postgresql/log".
.... done
server started
pg日志
2024-04-06 17:43:50.022 CST [22216] LOG: starting point-in-time recovery to XID 915
2024-04-06 17:43:50.796 CST [22224] FATAL: the database system is starting up
2024-04-06 17:43:50.873 CST [22216] LOG: restored log file "00000002.history" from archive
2024-04-06 17:43:51.582 CST [22227] FATAL: the database system is starting up
2024-04-06 17:43:51.980 CST [22216] LOG: restored log file "00000002000000000000001F" from archive
2024-04-06 17:43:52.014 CST [22216] LOG: redo starts at 0/1F000028
2024-04-06 17:43:52.736 CST [22230] FATAL: the database system is not yet accepting connections
2024-04-06 17:43:52.736 CST [22230] DETAIL: Consistent recovery state has not been yet reached.
2024-04-06 17:43:53.129 CST [22216] LOG: restored log file "000000020000000000000020" from archive
2024-04-06 17:43:53.161 CST [22216] LOG: consistent recovery state reached at 0/1F000100
2024-04-06 17:43:53.161 CST [22212] LOG: database system is ready to accept read-only connections
2024-04-06 17:43:53.163 CST [22216] LOG: recovery stopping after commit of transaction 915, time 2024-04-06 16:02:57.92549+08
2024-04-06 17:43:53.163 CST [22216] LOG: pausing at the end of recovery
2024-04-06 17:43:53.163 CST [22216] HINT: Execute pg_wal_replay_resume() to promote.
2024-04-06 17:48:49.153 CST [22214] LOG: restartpoint starting: time
2024-04-06 17:48:51.502 CST [22214] LOG: restartpoint complete: wrote 25 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=2.333 s, sync=0.002 s, total=2.350 s; sync files=19, longest=0.001 s, average=0.001 s; distance=16384 kB, estimate=16384 kB
2024-04-06 17:48:51.502 CST [22214] LOG: recovery restart point at 0/200000C8
2024-04-06 17:48:51.502 CST [22214] DETAIL: Last completed transaction was at log time 2024-04-06 16:02:57.92549+08.
一切正常
3.4、barman-cli的安装
PG的主机是要求安装barman-cli,如果有S3,那么需要安装barman-cli-cloud
# The 'barman-wal-restore' command is provided in the 'barman-cli' package
我这里的操作系统是Rocky Linux release 9.2 (Blue Onyx),可以通过EPEL安装,建议还是通过./setup.py去安装完整的barman。
[root@pgdb01 ~]# dnf install --downloadonly barman-cli
Rocky Linux 9 - BaseOS 1.8 kB/s | 4.1 kB 00:02
Rocky Linux 9 - AppStream 5.0 kB/s | 4.5 kB 00:00
Rocky Linux 9 - Extras 3.4 kB/s | 2.9 kB 00:00
Dependencies resolved.
=====================================================================================================================================================================
Package Architecture Version Repository Size
=====================================================================================================================================================================
Installing:
barman-cli noarch 3.10.0-1.el9 epel 40 k
Installing dependencies:
python3-argcomplete noarch 1.12.0-5.el9 appstream 61 k
python3-barman noarch 3.10.0-1.el9 epel 489 k
Transaction Summary
=====================================================================================================================================================================
Install 3 Packages
Total download size: 589 k
Installed size: 3.1 M
DNF will only download packages for the transaction.
Is this ok [y/N]: y
Downloading Packages:
(1/3): python3-argcomplete-1.12.0-5.el9.noarch.rpm 217 kB/s | 61 kB 00:00
(2/3): python3-barman-3.10.0-1.el9.noarch.rpm 578 kB/s | 489 kB 00:00
(3/3): barman-cli-3.10.0-1.el9.noarch.rpm 20 kB/s | 40 kB 00:01
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 185 kB/s | 589 kB 00:03
Complete!
The downloaded packages were saved in cache until the next successful transaction.
You can remove cached packages by executing 'dnf clean packages'.
4、近日闲事
半个月之后,流感终于好了…当时还好奇为啥这么多人戴口罩,结果下一个就是我了。
最近又听说了一个新词,共享员工,上一个与共享工作相关的公司还是WeWork(破产了…),现在都共享员工了….。下一步共享什么呢?
原文始发于微信公众号(库海无涯):PostgreSQL Barman的两则案例
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/271026.html