Postgresql
我們如何繞過 PostgreSQL autovacuum 在副本上使用表級鎖 ACCESS EXCLUSIVE?
我們正在執行:
user@primary/client-n:~$ psql -d database -c "SELECT version();" version --------------------------------------------------------------------------------------------------------------------------------------------- PostgreSQL 10.7 (Ubuntu 10.7-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609, 64-bit (1 row)
在:
user@primary/client-n:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.6 LTS Release: 16.04 Codename: xenial
並設置一個主客戶端和兩個流複製客戶端,配置如下:
user@client-n:~$ psql -d postgres -c "SELECT name, setting FROM pg_settings WHERE name IN ( 'hot_standby', 'hot_standby_feedback', 'max_standby_streaming_delay' );" name | setting ----------------------------+--------- hot_standby | on hot_standby_feedback | on max_standby_streaming_delay | 150000 (3 rows)
我們只有一個數據庫(標準數據庫除外)和數據庫中的一個表。每天大約 3 到 4 次,我們會遇到一種特殊情況下的 autovacuum 情況,文件中將其描述為:
$$ … $$它(autovacuum)不會將空間返回給作業系統,除非在特殊情況下,表末尾的一個或多個頁面完全空閒並且可以輕鬆獲得獨占表鎖
我們正在監視
pg_locks
並且已經能夠觀察到 autovacuum 守護程序獲取了表級鎖定ACCESS EXCLUSIVE
,這反過來又導致客戶端上的一大堆阻塞程序,如下面的日誌條目所示:基本的:
... 2019-06-04 05:59:29.154 BST [8998-1] LOG: automatic vacuum of table "database.schema.table": index scans: 1 ...
客戶 1:
... 2019-06-04 05:59:03.660 BST [21167-858] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 21167 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.222 ms 2019-06-04 05:59:03.660 BST [21167-859] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415. 2019-06-04 05:59:03.660 BST [21167-860] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ... 2019-06-04 05:59:03.730 BST [1215-51] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 1215 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.188 ms at character 15 2019-06-04 05:59:03.730 BST [1215-52] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415. 2019-06-04 05:59:03.730 BST [1215-53] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ... ... 2019-06-04 05:59:19.975 BST [22242-4569] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 22242 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.281 ms at character 15 2019-06-04 05:59:19.975 BST [22242-4570] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 2741. Wait queue: 21167, 1215, 26415, 2423, 1289, 24009, 22441, 2640, 1843, 1056, 23336, 28060, 1860, 1134, 19419, 14649, 2721, 29540, 20138, 22242. 2019-06-04 05:59:19.975 BST [22242-4571] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT... ...
以及持有鎖的過程:
postgres=# SELECT pid, backend_type, wait_event_type, wait_event FROM pg_stat_activity WHERE pid = 2741; pid | backend_type | wait_event_type | wait_event ------+--------------+-----------------+---------------- 2741 | startup | Activity | RecoveryWalAll (1 row)
客戶 2:
... 2019-06-04 06:00:08.964 BST [16153-1] [PostgreSQL JDBC Driver@ip_address(port):role@database] | FATAL: terminating connection due to conflict with recovery 2019-06-04 06:00:08.964 BST [16153-2] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: User was holding a relation lock for too long. 2019-06-04 06:00:08.964 BST [16153-3] [PostgreSQL JDBC Driver@ip_address(port):role@database] | HINT: In a moment you should be able to reconnect to the database and repeat your command. 2019-06-04 06:00:09.964 BST [5747-537] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 5747 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.248 ms 2019-06-04 06:00:09.964 BST [5747-538] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228. 2019-06-04 06:00:09.964 BST [5747-539] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ... 2019-06-04 06:00:09.975 BST [19765-6847] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 19765 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.180 ms 2019-06-04 06:00:09.975 BST [19765-6848] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228. 2019-06-04 06:00:09.975 BST [19765-6849] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ... ... 2019-06-04 06:01:25.487 BST [15873-1] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 15873 still waiting for AccessShareLock on relation 16390 of database 16388 after 1000.218 ms at character 15 2019-06-04 06:01:25.487 BST [15873-2] [PostgreSQL JDBC Driver@ip_address(port):role@database] | DETAIL: Process holding the lock: 12709. Wait queue: 5747, 19765, 16036, 14617, 12280, 14513, 14728, 15398, 27611, 14542, 15948, 23398, 5853, 5098, 4324, 10760, 23480, 30192, 15300, 16228, 16127, 16285, 15873. 2019-06-04 06:01:25.487 BST [15873-3] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ... ... 2019-06-04 06:01:29.160 BST [16127-6] [PostgreSQL JDBC Driver@ip_address(port):role@database] | LOG: process 16127 acquired AccessShareLock on relation 16390 of database 16388 after 8560.748 ms at character 15 2019-06-04 06:01:29.160 BST [16127-7] [PostgreSQL JDBC Driver@ip_address(port):role@database] | STATEMENT: SELECT ... ...
再次持有鎖的程序:
postgres=# SELECT pid, backend_type, wait_event_type, wait_event FROM pg_stat_activity WHERE pid = 2741; pid | backend_type | wait_event_type | wait_event ------+--------------+-----------------+---------------- 12709 | startup | Activity | RecoveryWalAll (1 row)
客戶端上被阻止的查詢導致 API 延遲在 10 到 20 秒之間,有時還會出現大量 5xx 響應。Out SRE 團隊的任務是降低這些事件期間的 API 延遲,我們正在尋找解決這個問題的方法,我們認為這是一個非常小眾的情況。我們目前正在
recovery_min_apply_delay = 120s
客戶端 1 上進行試驗(因此後面的日誌條目),以便兩個客戶端不會同時鎖定。這在一定程度上減少了錯誤響應的數量並稍微降低了延遲峰值。我們不確定如何完全解決這個問題,事實上,是否有可能這樣做。我們將不勝感激您的建議。我們找到了這個相關的文章,但遺憾的是,它也沒有得到解決。
VACUUM
避免截斷及其所需的獨占鎖的一種未記錄的方法是設置old_snapshot_threshold
為 -1 以外的值。這反過來將擺脫鎖和由此產生的恢復衝突。從 PostgreSQL v12 開始,您可以使用更好的技術來禁用
VACUUM
單個表的截斷:ALTER TABLE mytab SET (vacuum_truncate = on, toast.vacuum_truncate = on);