Postgresql

PosgtreSQL GIN + BTree 排序

  • November 25, 2017

我有一個結構簡單的表(id、metadata_json、stamp),stamp 是一個時間戳,上面有 Btree 索引。MetadataJson 是一個帶有 GIN 索引的 jsonb。

該表有 25M 行。我正在使用 PostgreSQL 10。

                Table "public.metadata"
   Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
id            | uuid                        | not null
metadata_json | jsonb                       |
stamp         | timestamp without time zone |
Indexes:
   "metadata_pkey" PRIMARY KEY, btree (id)
   "metadata_idx" gin (metadata_json)
   "stamp_idx" btree (stamp)

我正在執行的查詢非常簡單:

select * from metadata where metadata_json @> '{"someBool": true}'
        AND stamp >= '01-01-2016' ORDER BY stamp DESC LIMIT 100;

我的想法應該如何工作:我有一個 btree 在郵票上,因此它應該以相反的順序通過索引,然後它應該測試 json 限制上的行(它具有 40% 的選擇性)。我希望它會在幾毫秒內返回。

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=80598.46..80598.71 rows=100 width=381) (actual time=445064.728..445064.791 rows=100 loops=1)
  ->  Sort  (cost=80598.46..80607.46 rows=3600 width=381) (actual time=445064.724..445064.754 rows=100 loops=1)
        Sort Key: stamp DESC
        Sort Method: top-N heapsort  Memory: 109kB
        ->  Bitmap Heap Scan on metadata  (cost=66591.00..80460.87 rows=3600 width=381) (actual time=2881.164..444283.520 rows=1437024 loops=1)
              Recheck Cond: ((metadata_json @> '{"someBool": true}'::jsonb) AND (stamp >= '2016-01-01 00:00:00'::timestamp without time zone))
              Heap Blocks: exact=882439
              ->  BitmapAnd  (cost=66591.00..66591.00 rows=3600 width=0) (actual time=2599.415..2599.415 rows=0 loops=1)
                    ->  Bitmap Index Scan on metadata_idx  (cost=0.00..260.25 rows=25100 width=0) (actual time=1762.166..1762.166 rows=10041746 loops=1)
                          Index Cond: (metadata_json @> '{"someBool": true}'::jsonb)
                    ->  Bitmap Index Scan on stamp_idx  (cost=0.00..66328.69 rows=3600034 width=0) (actual time=760.136..760.136 rows=3591329 loops=1)
                          Index Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
Planning time: 5.008 ms
Execution time: 445072.043 ms
(14 rows)

從計劃上看,planner 的統計數據確實不准確,但是分析了表,採樣設置為 1000。


編輯:經過一番搜尋,我發現 postgres 沒有 jsonb 數據類型的統計資訊……你能告訴我如何優化這種類型的查詢嗎?

編輯 2:如果我禁用點陣圖掃描,查詢速度非常快(2 毫秒)。但我不認為這是一個好的解決方案……

編輯 3:圍欄(使用 CTE 語句)

Limit  (cost=1489968.48..1489968.73 rows=100 width=56) (actual time=447543.199..447543.262 rows=100 loops=1)
 CTE t
   ->  Bitmap Heap Scan on metadata  (cost=67228.70..1408830.13 rows=3600034 width=381) (actual time=1045.566..441897.315 rows=3591329 loops=1)
         Recheck Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
         Heap Blocks: exact=1229457
         ->  Bitmap Index Scan on stamp_idx  (cost=0.00..66328.69 rows=3600034 width=0) (actual time=663.960..663.960 rows=3591329 loops=1)
               Index Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
 ->  Sort  (cost=81138.35..81147.35 rows=3600 width=56) (actual time=447543.197..447543.227 rows=100 loops=1)
       Sort Key: t.stamp DESC
       Sort Method: top-N heapsort  Memory: 109kB
       ->  CTE Scan on t  (cost=0.00..81000.76 rows=3600 width=56) (actual time=1045.577..446935.261 rows=1437024 loops=1)
             Filter: (metadata_json @> '{"someBool": true}'::jsonb)
             Rows Removed by Filter: 2154305
Planning time: 0.169 ms
Execution time: 447692.843 ms

編輯 4:圍欄(FROM 中的子選擇)

Limit  (cost=1798933.42..1811851.02 rows=100 width=381) (actual time=198282.400..198282.400 rows=0 loops=1)
 ->  Subquery Scan on foo  (cost=1798933.42..2186461.48 rows=3000 width=381) (actual time=198282.397..198282.397 rows=0 loops=1)
       Filter: (foo.metadata_json @> '{"someBool": true}'::jsonb)
       Rows Removed by Filter: 3591329
       ->  Gather Merge  (cost=1798933.42..2148961.13 rows=3000028 width=381) (actual time=184803.964..195869.763 rows=3591329 loops=1)
             Workers Planned: 2
             Workers Launched: 2
             ->  Sort  (cost=1797933.40..1801683.43 rows=1500014 width=381) (actual time=184599.426..188532.836 rows=1197110 loops=3)
                   Sort Key: metadata.stamp DESC
                   Sort Method: external merge  Disk: 461368kB
                   ->  Parallel Bitmap Heap Scan on metadata  (cost=67228.70..1382579.88 rows=1500014 width=381) (actual time=1171.006..178501.269 rows=1197110 loops=3)
                         Recheck Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
                         Heap Blocks: exact=408005
                         ->  Bitmap Index Scan on stamp_idx  (cost=0.00..66328.69 rows=3600034 width=0) (actual time=728.401..728.401 rows=3591329 loops=1)
                               Index Cond: (stamp >= '2016-01-01 00:00:00'::timestamp without time zone)
Planning time: 6.704 ms
Execution time: 198509.456 ms

強制禁用點陣圖掃描

set enable_bitmapscan = off;
explain analyze select * from metadata where metadata_json @> '{"someBool": true}' AND stamp >= '01-01-2015' ORDER BY stamp DESC LIMIT 100;

Limit  (cost=0.44..256064.27 rows=100 width=381) (actual time=0.065..1.814 rows=100 loops=1)
 ->  Index Scan Backward using stamp_idx on metadata  (cost=0.44..18423793.42 rows=7195 width=381) (actual time=0.064..1.777 rows=100 loops=1)
       Index Cond: (stamp >= '2015-01-01 00:00:00'::timestamp without time zone)
       Filter: (metadata_json @> '{"someBool": true}'::jsonb)
       Rows Removed by Filter: 126
Planning time: 0.180 ms
Execution time: 1.856 ms

編輯 5 僅存在復合 gin(stamp, metadata_json) 索引:

explain analyze select * from metadata where metadata_json @> '{"someBool": true}
        AND stamp >= '01-01-2016' ORDER BY stamp DESC LIMIT 100;
Limit  (cost=14132.36..14132.61 rows=100 width=381) (actual time=308836.991..308837.052 rows=100 loops=1)
 ->  Sort  (cost=14132.36..14141.36 rows=3600 width=381) (actual time=308836.988..308837.018 rows=100 loops=1)
       Sort Key: stamp DESC
       Sort Method: top-N heapsort  Memory: 109kB
       ->  Bitmap Heap Scan on metadata  (cost=124.90..13994.77 rows=3600 width=381) (actual time=3160.418..308183.328 rows=1437024 loops=1)
             Recheck Cond: ((stamp >= '2016-01-01 00:00:00'::timestamp without time zone) AND (metadata_json @> '{"someBool": true}'::jsonb))
             Heap Blocks: exact=882439
             ->  Bitmap Index Scan on metadata_stamp_metadata_json_idx  (cost=0.00..124.00 rows=3600 width=0) (actual time=2883.484..2883.484 rows=1437024 loops=1)
                   Index Cond: ((stamp >= '2016-01-01 00:00:00'::timestamp without time zone) AND (metadata_json @> '{"someBool": true}'::jsonb))
Planning time: 0.233 ms
Execution time: 308857.051 ms

最終解決方案:

我已將 json 分解為鍵值並將其儲存為表“recordId, key, value, stamp”。我在這些上創建了一個 btree 索引 - 結果以幾毫秒的形式普遍返回。我認為沒有統計數據的 json 沒有任何好的通用解決方案。

正確的答案是 Evan,因為這可能是 jsonb 結構中可以做的最好的事情。

您在這裡遇到的真正問題是 jsonb 上的統計數據很糟糕。這是一個已知問題。它也永遠不會得到修復。

->  Bitmap Index Scan on metadata_idx  (cost=0.00..260.25 rows=25100 width=0) (actual time=1762.166..1762.166 rows=10041746 loops=1)
   Index Cond: (metadata_json @> '{"someBool": true}'::jsonb)

這裡 PostgreSQL 預計 25100,但低選擇性返回 10041746。時間戳估計非常準確,返回 3.6 M。PostgreSQL 預計必須探勘 3.6 M 才能找到微不足道的 25,100。這是很多探勘。因此,它對 jsonb 進行索引掃描。

你有幾個選擇。

  • 按照 ypercube 的建議創建複合 GIN 索引。
CREATE EXTENSION btree_gin;
CREATE INDEX ON metadata USING gin(stamp, metadata_json);
  • 在 metadata_json->someBool 上添加索引
  • 使用優化圍欄。
SELECT *
FROM (
 SELECT *
 FROM metadata
 WHERE stamp >= '01-01-2016'
 ORDER BY stamp DESC
 OFFSET 0
)
WHERE metadata_json @> '{"someBool": true}'
ORDER BY stamp DESC
LIMIT 100;

您可能還想查看jsonb_path_ops. 它也可以用來創建複合 GIN 索引,

CREATE INDEX ON metadata USING gin(stamp, metadata_json jsonb_path_ops);

引用自:https://dba.stackexchange.com/questions/191513