Postgresql
Postgres 卡在恢復模式
我有一個處於恢復模式的獨立 PostgreSQL 實例。一直在說
2014-03-24 18:45:57 MDT FATAL: the database system is starting up
幾個小時,並
ps
顯示:postgres 2637 0.1 0.1 116916 4420 ? Ds 15:43 0:18 postgres: startup process recovering 00000001000000040000007E
有什麼方法可以觀察恢復的進度,最好是使用 ETA?我怎樣才能讓這個過程“不卡住”?
我可以使用標準的啟動/停止腳本停止 postgres,但是當我再次啟動它時,它仍然停留在恢復模式。
Debian 7.4、Linux 核心 3.2.0、PostgreSQL 9.1.12
輸出
strace -p 2637
:Process 2637 attached - interrupt to quit close(154) = 0 getppid() = 2600 open("pg_clog/0003", O_RDWR|O_CREAT, 0600) = 154 lseek(154, 221184, SEEK_SET) = 221184 write(154, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192 fsync(154) = 0
上述輸出看似無限重複。
一個 gdb 回溯(三個後續 bts 幾乎都是相同的):
(gdb) bt #0 0x00007f0c8f917d70 in fsync () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f0c9169f255 in pg_fsync_no_writethrough (fd=<optimized out>) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/storage/file/fd.c:286 #2 0x00007f0c9169f265 in pg_fsync (fd=<optimized out>) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/storage/file/fd.c:274 #3 0x00007f0c9151917b in SlruPhysicalWritePage (ctl=ctl@entry=0x7f0c91b856c0, pageno=pageno@entry=123, slotno=slotno@entry=3, fdata=fdata@entry=0x0) at /home/cbe/projects/postgresql/9.1/postgresql-9.1- 9.1.12/build/../src/backend/access/transam/slru.c:801 #4 0x00007f0c91519925 in SlruInternalWritePage (ctl=ctl@entry=0x7f0c91b856c0, slotno=3, fdata=fdata@entry=0x0) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/slru.c:528 #5 0x00007f0c9151a1c7 in SimpleLruWritePage (ctl=ctl@entry=0x7f0c91b856c0, slotno=<optimized out>) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/slru.c:565 #6 0x00007f0c915018e3 in clog_redo (lsn=..., record=<optimized out>) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/clog.c:720 #7 0x00007f0c91515f08 in StartupXLOG () at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/xlog.c:6664 #8 0x00007f0c91518537 in StartupProcessMain () at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/xlog.c:10179 #9 0x00007f0c9152f1ba in AuxiliaryProcessMain (argc=<optimized out>, argc@entry=2, argv=<optimized out>, argv@entry=0x7fff651b0610) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/bootstrap/bootstrap.c:409 #10 0x00007f0c9167ddeb in StartChildProcess (type=StartupProcess) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/postmaster/postmaster.c:4512 #11 0x00007f0c916806ac in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x7f0c91d8e180) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/postmaster/postmaster.c:1118 #12 0x00007f0c914d392f in main (argc=5, argv=0x7f0c91d8e180) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/main/main.c:209
如果我設置:
fsync = off
並再次執行 bt:
(gdb) bt #0 0x00007f3d99ab7160 in write () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f3d9b6be0c6 in SlruPhysicalWritePage (ctl=ctl@entry=0x7f3d9bd2a6c0, pageno=pageno@entry=123, slotno=slotno@entry=3, fdata=fdata@entry=0x0) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/slru.c:783 #2 0x00007f3d9b6be925 in SlruInternalWritePage (ctl=ctl@entry=0x7f3d9bd2a6c0, slotno=3, fdata=fdata@entry=0x0) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/slru.c:528 #3 0x00007f3d9b6bf1c7 in SimpleLruWritePage (ctl=ctl@entry=0x7f3d9bd2a6c0, slotno=<optimized out>) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/slru.c:565 #4 0x00007f3d9b6a68e3 in clog_redo (lsn=..., record=<optimized out>) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/clog.c:720 #5 0x00007f3d9b6baf08 in StartupXLOG () at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/xlog.c:6664 #6 0x00007f3d9b6bd537 in StartupProcessMain () at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/access/transam/xlog.c:10179 #7 0x00007f3d9b6d41ba in AuxiliaryProcessMain (argc=<optimized out>, argc@entry=2, argv=<optimized out>, argv@entry=0x7fff99d477a0) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/bootstrap/bootstrap.c:409 #8 0x00007f3d9b822deb in StartChildProcess (type=StartupProcess) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/postmaster/postmaster.c:4512 #9 0x00007f3d9b8256ac in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x7f3d9d614180) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/postmaster/postmaster.c:1118 #10 0x00007f3d9b67892f in main (argc=5, argv=0x7f3d9d614180) at /home/cbe/projects/postgresql/9.1/postgresql-9.1-9.1.12/build/../src/backend/main/main.c:209
在Google上搜尋了幾個小時後,我偶然發現了一個與我的問題無關的執行緒,我能說的最好,但它似乎無害,可以嘗試,瞧!
我看了看
/var/lib/postgresql/9.1/main/pg_clog
,發現:drwx------ 2 postgres postgres 4096 Mar 15 15:20 . drwx------ 13 postgres postgres 4096 Mar 25 12:15 .. -rw------- 1 postgres postgres 262144 Feb 4 19:39 0000 -rw------- 1 postgres postgres 262144 Feb 13 11:10 0001 -rw------- 1 postgres postgres 262144 Mar 15 15:20 0002 -rw------- 1 postgres postgres 229376 Mar 25 14:51 0003
注意到這
/var/lib/postgresql/9.1/main/pg_clog/0003
是在輸出中打開/查找的文件strace
,並且這是論壇文章中有問題的文件,我在論壇中嘗試了建議的操作:dd if=/dev/zero bs=8k count=1 >> /var/lib/postgresql/9.1/main/pg_clog/0003
現在 postgres 又開始了。