Postgresql
使用“或”運算符的 PostgreSQL 性能下降
我公司的開發團隊建構了以下查詢:
select distinct c.customer_id from table1 c join table2 l on c.customer_id = l.customer_id join table3 cal on c.customer_id = cal.customer_id WHERE (l.customer_group_id = 'loyalty' and c.loyalty_number = '123456789') or (cal.account_id = '123456789' and cal.account_type = 'loyalty') ;
該查詢在非生產環境中執行,需要 2.5 分鐘才能完成。
Explain analyze
告訴我它正在對所有表執行 SEQ SCAN 以完成 SQL。如果我評論任何 where 子句(或執行一個新查詢,只留下一個 where 子句),範例 1:
(l.customer_group_id = 'loyalty' and c.loyalty_number = '123456789') -- or -- (cal.account_id = '123456789' and cal.account_type = 'loyalty')
或範例 2:
-- (l.customer_group_id = 'loyalty' and c.loyalty_number = '123456789') -- or (cal.account_id = '123456789' and cal.account_type = 'loyalty')
查詢在幾毫秒內完成。在這種情況下,
explain analyze
向我展示了查詢計劃器使用可用的索引來執行操作。我很難理解為什麼運算符“或”使查詢規劃器根本不使用索引。在任何情況下,檢索到的行數都是 3。在此過程中丟棄了很多行。
如果我將運算符“或”替換為“和”,也會使查詢在毫秒內完成。
關於“或”運算符背後的邏輯,我在概念上是否遺漏了什麼?我無法理解或想出發生這種情況的原因。不確定這是否是預期的行為,或者我是否可以做些什麼。
既然一個 where 子句以毫秒為單位完成,另一個(單獨執行)也以毫秒為單位完成,為什麼當它們放在一起時有一個荒謬的時間超過 2 分鐘才能完成?
我將在下面提供有關我的環境的更多詳細資訊。任何評論將不勝感激。先感謝您。
-- Environment details PostgreSQL running on AWS Amazon Aurora Serverless Engine PostgreSQL 10.14 -- # of rows for each table 1 - 67,904,804 2 - 67,955,984 3 - 67,902,549 -- Table sizes (without index or toast) table | size | value -------------------+------------+--------- 1 | TABLE SIZE | 8 GB 2 | TABLE SIZE | 10 GB 3 | TABLE SIZE | 7 GB -- Explain plan for complete query, taking 2.5 min to complete QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=12268978.36..12268978.39 rows=5 width=19) (actual time=150567.032..150567.104 rows=3 loops=1) -> Sort (cost=12268978.36..12268978.37 rows=5 width=19) (actual time=150567.032..150567.101 rows=3 loops=1) Sort Key: c.customer_id Sort Method: quicksort Memory: 25kB -> Gather (cost=6171007.40..12268978.30 rows=5 width=19) (actual time=141341.630..150567.067 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Hash Join (cost=6170007.40..12267977.80 rows=2 width=19) (actual time=142581.256..149478.215 rows=1 loops=3) Hash Cond: ((l.customer_id)::text = (c.customer_id)::text) Join Filter: ((((l.customer_group_id)::text = 'loyalty'::text) AND ((c.loyalty_number)::text = '654654654'::text)) OR (((cal.account_id)::text = '654654654'::text) AND ((cal.account_type)::text = 'loyalty'::text))) Rows Removed by Join Filter: 22611489 -> Hash Join (cost=2863005.22..7836194.25 rows=27605097 width=64) (actual time=48335.237..79764.899 rows=22611495 loops=3) Hash Cond: ((cal.customer_id)::text = (l.customer_id)::text) -> Parallel Seq Scan on cal (cost=0.00..1397382.48 rows=28291748 width=37) (actual time=0.014..4722.632 rows=22632059 loops=3) -> Hash (cost=1581955.32..1581955.32 rows=66252232 width=27) (actual time=48310.717..48310.718 rows=67955982 loops=3) Buckets: 65536 Batches: 2048 Memory Usage: 2428kB -> Seq Scan on l (cost=0.00..1581955.32 rows=66252232 width=27) (actual time=0.028..13237.966 rows=67955982 loops=3) -> Hash (cost=2008016.30..2008016.30 rows=67179830 width=29) (actual time=49635.414..49635.415 rows=67898462 loops=3) Buckets: 65536 Batches: 2048 Memory Usage: 2492kB -> Seq Scan on c (cost=0.00..2008016.30 rows=67179830 width=29) (actual time=0.040..15735.018 rows=67898462 loops=3) Planning time: 0.620 ms Execution time: 150568.156 ms (22 rows) Time: 150786.479 ms (02:30.786) -- Explain plan when I comment one of the where clauses QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=32.11..32.13 rows=3 width=19) (actual time=0.101..0.103 rows=3 loops=1) -> Sort (cost=32.11..32.12 rows=3 width=19) (actual time=0.101..0.102 rows=3 loops=1) Sort Key: c.customer_id Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=1.70..32.09 rows=3 width=19) (actual time=0.049..0.096 rows=3 loops=1) Join Filter: ((c.customer_id)::text = (cal.customer_id)::text) -> Nested Loop (cost=1.14..30.24 rows=3 width=38) (actual time=0.035..0.060 rows=3 loops=1) -> Index Scan using idx on c (cost=0.57..16.47 rows=3 width=19) (actual time=0.015..0.018 rows=3 loops=1) Index Cond: ((loyalty_number)::text = '654654654'::text) -> Index Only Scan using pkey on l (cost=0.57..4.59 rows=1 width=19) (actual time=0.013..0.013 rows=1 loops=3) Index Cond: ((customer_group_id = 'loyalty'::text) AND (customer_id = (c.customer_id)::text)) Heap Fetches: 0 -> Index Only Scan using idx on cal (cost=0.57..0.60 rows=1 width=19) (actual time=0.011..0.011 rows=1 loops=3) Index Cond: (customer_id = (l.customer_id)::text) Heap Fetches: 0 Planning time: 0.555 ms Execution time: 0.128 ms
一個基本的、100% 等效的“ugly-OR”查詢重寫
UNION
:SELECT customer_id FROM table1 c JOIN table2 l USING (customer_id) JOIN table3 cal USING (customer_id) WHERE l.customer_group_id = 'loyalty' AND c.loyalty_number = '123456789' UNION SELECT customer_id FROM table1 c JOIN table2 l USING (customer_id) JOIN table3 cal USING (customer_id) WHERE cal.account_id = '123456789' AND cal.account_type = 'loyalty'
應該已經快得多了,因為它可以使用像您的部分查詢這樣的索引。看:
通常,我們可以進一步優化。但是我們需要實際的表定義。最重要的是 PK 和 FK 約束,以及它們是否
customer_id
可以出現在一個表中而在另一個表中缺失。(或者有多個實例——但這與您的DISTINCT
查詢無關。)就像,如果
customer_id
保證三個表之一中存在的任何限定存在於所有三個(一行或多行)中,我們可以簡化:SELECT customer_id FROM table1 c JOIN table2 l USING (customer_id) WHERE l.customer_group_id = 'loyalty' AND c.loyalty_number = '123456789' UNION SELECT customer_id FROM table3 cal WHERE cal.account_id = '123456789' AND cal.account_type = 'loyalty';