timescaledb：批量插入耗盡所有記憶體

August 23, 2021

概括
我正在嘗試將數據批量插入 timescaledb 超表中。無論我嘗試什麼，記憶體使用量都會逐漸增長，直到伺服器程序由於記憶體不足而被終止。我在伺服器上的數據集小至 160 萬行時觀察到這一點，其中 128 GB 的 RAM 可用於 postgres/timescaledb，所以我一定做錯了什麼。對不是超表的表執行完全相同的插入操作就可以了，因此問題必須與 timescaledb 有關。
我想要做什麼
我希望插入的表定義如下：
CREATE TABLE test (
   gid BIGINT NOT NULL,
   location_ GEOGRAPHY(POINT),
   -- about 15 other VARCHAR and BIGINT columns omitted for brevity
   far_end_gid BIGINT NOT NULL,
   day_partition_key DATE NOT NULL,
   PRIMARY KEY (gid, far_end_gid, day_partition_key)
);
SELECT create_hypertable(
   'test', 'day_partition_key', chunk_time_interval =&gt; INTERVAL '1 day');
要插入的數據在數據庫中的另一個表中（通過\COPY操作填充的臨時表）；除了某些欄位需要一些解析（將字元串轉換為日期等）之外，此另一個表具有與目標超表相同的欄位。失敗的INSERT查詢是
INSERT INTO topology_test
   SELECT
       tobigintornull(gid) AS gid_as_int,
       -- about 15 other VARCHAR and BIGINT columns omitted
       CAST(CAST(REPLACE(far_end_gid, ',', '.') AS DOUBLE PRECISION) AS BIGINT),
       CAST(tobigintornull(far_end_cell_id) AS BIGINT),
       TO_DATE(day_partition_key, 'YYYYMMDD')
   FROM test_temp_table
   WHERE
       tobigintornull(gid) is not null
       and tobigintornull(REPLACE(far_end_gid, ',', '.')) is not null
       and day_partition_key is not null
   ON CONFLICT DO NOTHING;
該ON CONFLICT部分旨在靜默刪除重複的主鍵，並且該函式tobigintornull執行其名稱所暗示的操作：如果可能，它將輸入轉換為 bigint，如果不是，則返回 null（這有助於刪除無法解析的行）。它的定義是
CREATE OR REPLACE FUNCTION tobigintornull(text) RETURNS BIGINT AS $$
DECLARE x BIGINT;
BEGIN
   x = $1::BIGINT;
   RETURN x;
EXCEPTION WHEN others THEN
   RETURN NULL;
END;
$$
STRICT
LANGUAGE plpgsql IMMUTABLE;
輸入的定義DOUBLE PRECISION類似。WHERE請注意，該子句僅刪除了一小部分行（絕對小於 1%）。
完整的數據集是 5.93 億行，並且沒有以任何方式排序。懷疑輸入缺乏排序是問題的一部分，我創建了一個包含 160 萬行的數據子集，其中所有行的值都相同day_partition_key（從 timescaledb 的角度來看，該子集應該是完美排序的）。
問題
問題表現為 postgres 的記憶體使用量逐漸增加，直到數據庫可用的全部 128 GB 被使用（大約需要 5 分鐘）。在 100% 記憶體使用一分鐘左右後，插入查詢崩潰。日誌顯示以下內容：
timescaledb_1  | 2021-08-23 11:56:50.727 UTC [1] LOG:  server process (PID 1231) was terminated by signal 9: Killed
timescaledb_1  | 2021-08-23 11:56:50.727 UTC [1] DETAIL:  Failed process was running: INSERT INTO test
timescaledb_1  |            SELECT

(...)

timescaledb_1  | 2021-08-23 11:56:50.727 UTC [1] LOG:  terminating any other active server processes
timescaledb_1  | 2021-08-23 11:56:50.741 UTC [1215] WARNING:  terminating connection because of crash of another server process
timescaledb_1  | 2021-08-23 11:56:50.741 UTC [1215] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
timescaledb_1  | 2021-08-23 11:56:50.741 UTC [1215] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
timescaledb_1  | 2021-08-23 11:56:50.744 UTC [1221] WARNING:  terminating connection because of crash of another server process
timescaledb_1  | 2021-08-23 11:56:50.744 UTC [1221] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
timescaledb_1  | 2021-08-23 11:56:50.744 UTC [1221] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
timescaledb_1  | 2021-08-23 11:56:50.771 UTC [1] LOG:  all server processes terminated; reinitializing
timescaledb_1  | 2021-08-23 11:56:51.371 UTC [1259] LOG:  database system was interrupted; last known up at 2021-08-23 11:50:40 UTC
timescaledb_1  | 2021-08-23 11:56:51.711 UTC [1259] LOG:  database system was not properly shut down; automatic recovery in progress
timescaledb_1  | 2021-08-23 11:56:51.720 UTC [1259] LOG:  redo starts at 25A/16567EF8
timescaledb_1  | 2021-08-23 11:56:51.788 UTC [1259] LOG:  invalid record length at 25A/165C1A30: wanted 24, got 0
timescaledb_1  | 2021-08-23 11:56:51.788 UTC [1259] LOG:  redo done at 25A/165C19D0
timescaledb_1  | 2021-08-23 11:56:51.893 UTC [1] LOG:  database system is ready to accept connections
timescaledb_1  | 2021-08-23 11:56:52.039 UTC [1265] LOG:  TimescaleDB background worker launcher connected to shared catalogs
顯然，數據庫恢復後目標表中不存在任何行。
我試圖解決的問題
我已將數據集的大小從 5.93 億行減少到 160 萬行，確保子集對於用於分塊的日期列只有一個值 ( day_partition_key)。結果是完全一樣的。
在https://github.com/timescale/timescaledb/issues/643的討論之後，我將 timescaledb 配置更改為SET timescaledb.max_open_chunks_per_insert=1;. 問題仍然完全相同。
我嘗試創建另一個目標表而不將其設為超表。然後插入 1.6 M 行子集就可以了。我希望整套都能奏效，但我還沒有花時間去做。
版本、硬體和配置
docker 鏡像 timescale/timescaledb-postgis:latest-pg13 (bf76e5594c98) 用於執行 timescaledb。它包含：
x86_64-pc-linux-musl 上的 PostgreSQL 13.3，由 gcc (Alpine 10.2.1_pre1) 10.2.1 20201203 編譯，64 位
時標數據庫 2.3.0
POSTGIS=“2.5.5”$$ EXTENSION $$PGSQL=“130” GEOS=“3.8.1-CAPI-1.13.3” PROJ=“Rel. 7.1.1，2020 年 9 月 1 日” GDAL=“GDAL 3.1.4，2020/10/20 發布” LIBXML=" 2.9.10" LIBJSON=“0.15” LIBPROTOBUF=“1.3.3” 光柵
docker 容器限制為 16 個核心和 128 GB 記憶體。對於上面提到的測試，我使用了https://pgtune.leopard.in.ua/#/建議的配置參數用於帶有我的參數的數據倉庫，除了我將提供給 pgtune 的可用記憶體降低到 64 GB，希望這會導致數據庫使用更少記憶體的配置（我也嘗試了推薦用於 128 GB 記憶體的設置，結果是相同的）。設置如下：
max_connections = 30
shared_buffers = 16GB
effective_cache_size = 48GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 34952kB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 16
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
max_parallel_maintenance_workers = 4

偉大的工作，很好的解釋！
我認為您需要分析您的 Postgres 以更好地了解正在發生的事情。
只是想法：
刪除 plpgsql 函式並驗證您的字元串或使用coalesce來管理空值。
我還考慮使用公用表表達式在一個地方解決您的問題，並避免多次呼叫該函式。
如果只有規範化的數據重要，而去規範化的結果不重要，可以考慮在插入前增加一個預處理步驟來清理，避免使用 where 子句等。

引用自：https://dba.stackexchange.com/questions/298427

timescaledb：批量插入耗盡所有記憶體

概括

我想要做什麼

問題

我試圖解決的問題

版本、硬體和配置

相關問答

動態生成批量導入數據

在執行並行暫存表載入時減少記憶體壓力

使用 MariaDB 數據庫進行數據輪換

優化查詢以在分區表中跨多天獲取數據

是否可以創建一個 MySQL 分區，然後通過另一列創建一個子分區？

觸發分區表和性能