Postgres 12 按 coll 類型劃分的分區性能
我使用分區和大量數據對 posgtres 12 進行性能測試。每個分區僅包含一個具有 400k 行的站點,並且我生成了大約 1k 個分區表;
對於第一個測試套件,我使用 UUID 作為 id,但我認為如果我將 id 類型更改為 bigint,將使用更少的空間,因此性能更高。填充表格後,我使用不同的數據執行以下選擇一百次
SELECT SUM(amount), FROM test_table WHERE date >= '2021-02-06' AND date <= '2021-02-07' AND site IN ('c3b3771c-4b48-41a9-88eb-4c47d1630644', 'cbb11cdc-cd31-4da2-b14e-9ef878ce03c5', '2609ac86-995b-4320-a3b7-46ba175aa5e2') // randomly picked from site pool GROUP BY site ORDER BY site;
UUID 測試套件無日期索引:
CREATE TABLE public.test_table ( id UUID NOT NULL, site UUID, archive UUID, location UUID, col_1 UUID, col_2 UUID, col_3 UUID, amount numeric(8,2) date timestamp with time zone, .... ) PARTITION BY LIST (site); CREATE TABLE test_table_${site} PARTITION OF test_table FOR VALUES IN ('${site}'); One table size: "265 MB"
BIGINT 測試套件無日期索引:
CREATE TABLE public.test_table ( id bigint NOT NULL, site bigint, archive bigint, location bigint, col_1 bigint, col_2 bigint, col_3 bigint, amount numeric(8,2) date timestamp with time zone, ... ) PARTITION BY LIST (site); CREATE TABLE test_table_${site} PARTITION OF test_table FOR VALUES IN ('${site}'); One table size: "118 MB"
測試結果
UUID test results (ms) for 100 serial selects median 1,425.00 percentile 95% 1,930.05 BIGINT test results (ms) for 100 serial selects median 4,456.00 percentile 95% 9,037.50
同解釋:
UUID
"GroupAggregate (cost=61944.56..61947.03 rows=90 width=88)" " Group Key: test_table_c3b3771c_4b48_41a9_88eb_4c47d1630644.site" " -> Sort (cost=61944.56..61944.78 rows=90 width=48)" " Sort Key: test_table_c3b3771c_4b48_41a9_88eb_4c47d1630644.site" " -> Gather (cost=1000.00..61941.63 rows=90 width=48)" " Workers Planned: 3" " -> Parallel Append (cost=0.00..60932.63 rows=30 width=48)" " -> Parallel Seq Scan on test_table_c3b3771c_4b48_41a9_88eb_4c47d1630644 (cost=0.00..20311.16 rows=10 width=48)" " Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{c3b3771c-4b48-41a9-88eb-4c47d1630644,cbb11cdc-cd31-4da2-b14e-9ef878ce03c5,2609ac86-995b-4320-a3b7-46ba175aa5e2}'::uuid[])))" " -> Parallel Seq Scan on test_table_cbb11cdc_cd31_4da2_b14e_9ef878ce03c5 (cost=0.00..20311.16 rows=10 width=48)" " Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{c3b3771c-4b48-41a9-88eb-4c47d1630644,cbb11cdc-cd31-4da2-b14e-9ef878ce03c5,2609ac86-995b-4320-a3b7-46ba175aa5e2}'::uuid[])))" " -> Parallel Seq Scan on test_table_2609ac86_995b_4320_a3b7_46ba175aa5e2 (cost=0.00..20310.16 rows=10 width=48)" " Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{c3b3771c-4b48-41a9-88eb-4c47d1630644,cbb11cdc-cd31-4da2-b14e-9ef878ce03c5,2609ac86-995b-4320-a3b7-46ba175aa5e2}'::uuid[])))"
大整數
"Finalize GroupAggregate (cost=47951.35..47954.22 rows=21 width=80)" " Group Key: test_table_121.site" " -> Gather Merge (cost=47951.35..47953.63 rows=18 width=80)" " Workers Planned: 3" " -> Partial GroupAggregate (cost=46951.31..46951.48 rows=6 width=80)" " Group Key: test_table_121.site" " -> Sort (cost=46951.31..46951.33 rows=6 width=40)" " Sort Key: test_table_121.site" " -> Parallel Append (cost=0.00..46951.24 rows=6 width=40)" " -> Parallel Seq Scan on test_table_121 (cost=0.00..15651.09 rows=2 width=40)" " Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{121,122,242}'::bigint[])))" " -> Parallel Seq Scan on test_table_242 (cost=0.00..15651.09 rows=2 width=40)" " Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{121,122,242}'::bigint[])))" " -> Parallel Seq Scan on test_table_122 (cost=0.00..15649.02 rows=2 width=40)" " Filter: ((date_fiscal >= '2021-02-06 00:00:00+00'::timestamp with time zone) AND (date_fiscal <= '2021-02-07 00:00:00+00'::timestamp with time zone) AND (site = ANY ('{121,122,242}'::bigint[])))"
怎麼可能在數據量較小的情況下,在選擇時間上有如此大的差異?或者我在測試過程中犯了一個錯誤。
提前致謝!
我對相同的數據執行相同的測試
SELECT SUM(amount), FROM test_table WHERE date betwen (day | week | month) AND site IN ('site id 1', 'site id 2', 'site id 3') // randomly picked from site pool GROUP BY site ORDER BY site;
測試1 天間隔,3 個站點 ID:
UUID test results (ms) for 100 serial selects median 1,425.00 percentile 95% 1,930.05 BIGINT test results (ms) for 100 serial selects median 1,116.50 percentile 95% 1,641.55
測試1 週間隔,3 個站點 ID:
UUID test results (ms) for 100 serial selects median 1,406.50 percentile 95% 1,849.10 BIGINT test results (ms) for 100 serial selects median 1,147.00 percentile 95% 1,563.75
測試1 個月的間隔,3 個站點 id:
UUID test results (ms) for 100 serial selects median 1,446.00 percentile 95% 1,876.05 BIGINT test results (ms) for 100 serial selects median 1,146.50 percentile 95% 1,430.15
當我在選擇中添加 10 個站點 ID 時,我收到了更明確的區別:
測試 1 天間隔,10 個站點 ID:
UUID test results (ms) for 100 serial selects median 4,431.00 percentile 95% 5,237.55 BIGINT test results (ms) for 100 serial selects median 3,607.50 percentile 95% 4,220.05
測試1 週間隔,10 個站點 ID:
UUID test results (ms) for 100 serial selects median 4,458.50 percentile 95% 5,308.10 BIGINT test results (ms) for 100 serial selects median 3,405.50 percentile 95% 4,193.55
測試1 個月間隔,10 個站點 id:
UUID test results (ms) for 100 serial selects median 4,533.50 percentile 95% 5,540.70 BIGINT test results (ms) for 100 serial selects median 3,549.00 percentile 95% 4,162.90
我認為問題在於測試是在生成數據之前執行的,並且 postgres 伺服器可能會執行一些數據重組任務
我的猜測是後果來自您如何執行測試。我相信在檢查您的範例測試查詢後,您的一組數據與另一組數據可能會遇到非常受歡迎的測試參數。特別是您的
WHERE
條款:WHERE date >= '2021-02-06' AND date <= '2021-02-07' AND site IN ('c3b3771c-4b48-41a9-88eb-4c47d1630644', 'cbb11cdc-cd31-4da2-b14e-9ef878ce03c5', '2609ac86-995b-4320-a3b7-46ba175aa5e2') // randomly picked from site pool
如果沒有看到您
BIGINT
為數據集執行的等效測試查詢,很難進行比較,但我認為由於以下可能的原因,可能會發生不平衡測試:
- 您使用的日期範圍可能非常有利於通過
UUID
site 欄位而不是 site 欄位對數據進行分區BIGINT
,特別是因為BIGINT
假設它的值比UUID
.site
您在子句中為謂詞選擇值的方式WHERE
也可能有利於您的UUID
測試分區而不是BIGINT
測試。看起來您說您是從站點池中隨機選擇它們,但這實際上取決於它的真正隨機性,再加上一個事實,您的分區順序將與 aUUID
的順序大不相同那些分區為BIGINT
.BIGINT
再次沒有看到您對測試的等效範例查詢,以及您如何在兩種情況下隨機選擇該謂詞,很難說這有多大的影響。總而言之,我沒有看到任何其他可以保證結果有很大差異的東西,這讓我對上述內容進行理論化。不幸的是,如果這是您如何測試數據的問題,例如我懷疑的那樣,那麼將沒有任何有信譽的來源可以為您提供答案。相反,您應該首先簡化您的測試,以消除導致加權結果的潛在變數,然後從那裡開始。
例如,可以通過手動選擇第一個分區值、最後一個分區值、接近中間分區值的邊界測試開始,然後對所有分區執行測試,
date
在任何這些情況下都沒有謂詞,以消除潛在的來源我上面提到的錯誤。然後針對您正在測試的特定謂詞引入一個date
您知道的範圍謂詞,該謂詞包含相等數量的分區,每個分區具有相等的行數。site
本質上受控的測試將在這里為您提供比隨機測試更有意義的資訊。