Postgresql

為什麼 10,000 個 ID 的列表比使用等效的 SQL 選擇它們的性能更好?

  • December 22, 2020

我有一個帶有遺留查詢的 Rails 應用程序,我想對其進行翻新。目前實現執行兩個 SQL 查詢:一個獲取大量 ID,第二個查詢使用這些 ID 並應用一些額外的連接和過濾器來獲得所需的結果。

我試圖用避免往返的單個查詢來替換它,但是這樣做會在我的本地測試環境(這是完整生產數據集的副本)中導致性能大幅下降。新查詢中似乎沒有使用索引,從而導致全表掃描。我曾希望單個查詢能保持與原始程式碼相同的性能,理想情況下改進它,因為不需要發送所有的 ID。

這是我實際問題的一個相當最小化的版本。與選擇它們的等效 SQL 相比,為什麼 10,000 個 ID 的列表在具有多個 CTE 的複雜查詢中表現更好?.

目前查詢

有一個查詢需要大約 6.5 秒來計算 10000+ 個 ID 的列表。您可以visible_projects在下面的“建議的查詢”部分中將其視為 CTE。然後將這些 ID 輸入到此查詢中:

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
   SELECT
       id
   FROM
       "projects"
   WHERE
       "projects"."id" IN (
           -- 10000+ IDs removed
)),
visible_tasks AS MATERIALIZED (
   SELECT
       tasks.id
   FROM
       tasks
   WHERE
       tasks.project_id IN (
           SELECT
               id
           FROM
               visible_projects))
SELECT
   COUNT(1)
FROM
   visible_tasks;

查詢計劃(調度

Aggregate  (cost=1309912.31..1309912.32 rows=1 width=8) (actual time=148.661..153.739 rows=1 loops=1)
  Buffers: shared hit=73107 read=22301
  CTE visible_tasks
    ->  Gather  (cost=43024.54..1308639.80 rows=56556 width=4) (actual time=46.337..137.260 rows=48557 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          Buffers: shared hit=73107 read=22301
          ->  Nested Loop  (cost=42024.54..1301984.20 rows=23565 width=4) (actual time=28.871..120.682 rows=16186 loops=3)
                Buffers: shared hit=73107 read=22301
                ->  Parallel Bitmap Heap Scan on projects  (cost=42023.97..138877.16 rows=4378 width=4) (actual time=28.621..52.627 rows=3502 loops=3)
                      Recheck Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
                      Heap Blocks: exact=3536
                      Buffers: shared hit=30410 read=9833
                      ->  Bitmap Index Scan on projects_pkey  (cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)
                            Index Cond: (id = ANY ('{ REMOVED_IDS }'::integer[]))
                            Buffers: shared hit=30410 read=1111
                ->  Index Scan using test_tasks_on_project on tasks  (cost=0.57..263.85 rows=182 width=8) (actual time=0.012..0.018 rows=5 loops=10507)
                      Index Cond: (project_id = projects.id)
                      Buffers: shared hit=42697 read=12468
  ->  CTE Scan on visible_tasks  (cost=0.00..1131.12 rows=56556 width=0) (actual time=46.339..144.641 rows=48557 loops=1)
        Buffers: shared hit=73107 read=22301
Planning:
  Buffers: shared hit=10 read=10
Planning Time: 8.857 ms
Execution Time: 156.102 ms

提議的查詢

這是相同的查詢結構,但我沒有將 10000 多個 ID 直接插入visible_projectsCTE,而是嵌入了查找這些 ID 的 SQL。

EXPLAIN (ANALYZE, BUFFERS)
WITH visible_projects AS NOT MATERIALIZED (
   SELECT
       id
   FROM
       "projects"
   WHERE
       "projects"."company_id" = 11171
       AND "projects"."state" < 6
       AND "projects"."is_template" = FALSE),
visible_tasks AS MATERIALIZED (
   SELECT
       tasks.id
   FROM
       tasks
   WHERE
       tasks.project_id IN (
           SELECT
               id
           FROM
               visible_projects))
SELECT
   COUNT(1)
FROM
   visible_tasks;

查詢計劃(調度):

Aggregate  (cost=2212223.53..2212223.54 rows=1 width=8) (actual time=40675.984..40686.708 rows=1 loops=1)
  Buffers: shared hit=118145 read=1567727
  CTE visible_tasks
    ->  Gather  (cost=279353.08..2208430.12 rows=168596 width=4) (actual time=7050.894..40666.025 rows=48557 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          Buffers: shared hit=118145 read=1567727
          ->  Hash Join  (cost=278353.08..2190570.52 rows=70248 width=4) (actual time=7038.932..40650.430 rows=16186 loops=3)
                Hash Cond: (tasks.project_id = projects.id)
                Buffers: shared hit=118145 read=1567727
                ->  Parallel Seq Scan on tasks  (cost=0.00..1828314.43 rows=31963043 width=8) (actual time=0.397..29372.029 rows=25572144 loops=3)
                      Buffers: shared read=1508684
                ->  Hash  (cost=277961.56..277961.56 rows=31322 width=4) (actual time=6977.480..6977.481 rows=10507 loops=3)
                      Buckets: 32768  Batches: 1  Memory Usage: 626kB
                      Buffers: shared hit=118061 read=59031
                      ->  Index Scan using index_projects_on_company_id on projects  (cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)
                            Index Cond: (company_id = 11171)
                            Filter: ((NOT is_template) AND (state < 6))
                            Rows Removed by Filter: 63512
                            Buffers: shared hit=118061 read=59031
  ->  CTE Scan on visible_tasks  (cost=0.00..3371.92 rows=168596 width=0) (actual time=7050.896..40671.054 rows=48557 loops=1)
        Buffers: shared hit=118145 read=1567727
Planning:
  Buffers: shared hit=2 read=18
Planning Time: 9.528 ms
Execution Time: 40687.524 ms

即使考慮到前兩個查詢的組合,這也需要 6 倍於目前實現的時間。

我看到這已選擇使用Parallel Seq Scan on tasks哪個是主要的時間因素。我不明白為什麼選擇它以及我應該做些什麼來恢復使用索引。

通過研究,我了解到 Postgres 不提供查詢提示來強制使用索引,因此我認為一個好的解決方案將包括向查詢計劃者展示使用索引是有益的。

我在這個問題中COUNT(1)結合使用AS MATERIALIZED/AS NOT MATERIALIZED控制項來生成一個較小的範例。

應用程序中較大的查詢不使用這些,但它還tasks在生成許多其他 CTE 和一些聚合指標作為最終結果之前對錶執行一些過濾。

架構

                                                Table "public.projects"
          Column           |             Type              | Collation | Nullable |               Default
----------------------------+-------------------------------+-----------+----------+--------------------------------------
id                         | integer                       |           | not null | nextval('projects_id_seq'::regclass)
name                       | character varying(255)        |           |          |
description                | text                          |           |          |
due                        | timestamp without time zone   |           |          |
created_at                 | timestamp without time zone   |           | not null |
updated_at                 | timestamp without time zone   |           | not null |
client_id                  | integer                       |           |          |
company_id                 | integer                       |           |          |
repeat                     | boolean                       |           | not null | true
end_date                   | timestamp without time zone   |           |          |
prev_id                    | integer                       |           |          |
next_id                    | integer                       |           |          |
completed_tasks_count      | integer                       |           | not null | 0
tasks_count                | integer                       |           | not null | 0
done_at                    | timestamp without time zone   |           |          |
state                      | integer                       |           |          |
schedule                   | text                          |           |          |
start_date                 | timestamp without time zone   |           |          |
manager_id                 | integer                       |           |          |
partner_id                 | integer                       |           |          |
exschedule                 | text                          |           |          |
extdue                     | timestamp without time zone   |           |          |
is_template                | boolean                       |           | not null | false
predicted_duration         | integer                       |           |          | 0
budget                     | integer                       |           |          | 0
cached_effective_due_date  | timestamp without time zone   |           |          |
cached_manager_fullname    | character varying(255)        |           |          | ''::character varying
cached_partner_fullname    | character varying(255)        |           |          | ''::character varying
cached_staffs_fullnames    | text                          |           |          | ''::text
cached_staffs_ids          | text                          |           |          | ''::text
cached_label_ids           | character varying(255)        |           |          | ''::character varying
date_in                    | timestamp without time zone   |           |          |
cached_label_sum           | integer                       |           |          | 0
date_out                   | timestamp without time zone   |           |          |
turn_around_time           | integer                       |           |          | 0
dues_calculated_at         | timestamp without time zone   |           |          |
dues                       | timestamp without time zone[] |           |          |
dues_rewind                | integer[]                     |           |          |
quickbooks_item_id         | integer                       |           |          |
perform_final_review       | boolean                       |           | not null | false
quickbooks_desktop_item_id | integer                       |           |          |
billing_model_type         | character varying             |           | not null | 'staff'::character varying
series_id                  | integer                       |           |          |
shared                     | boolean                       |           |          | false
Indexes:
   "projects_pkey" PRIMARY KEY, btree (id)
   "index_projects_on_cached_effective_due_date" btree (cached_effective_due_date)
   "index_projects_on_client_id" btree (client_id)
   "index_projects_on_company_id" btree (company_id)
   "index_projects_on_manager_id" btree (manager_id)
   "index_projects_on_next_id" btree (next_id)
   "index_projects_on_partner_id" btree (partner_id)
   "index_projects_on_series_id" btree (series_id)
   "index_projects_on_shared_and_is_template" btree (shared, is_template) WHERE shared = true AND is_template = true
Foreign-key constraints:
   "fk_rails_243d23cb48" FOREIGN KEY (quickbooks_desktop_item_id) REFERENCES quickbooks_desktop_items(id)
   "fk_rails_33ba8711de" FOREIGN KEY (quickbooks_item_id) REFERENCES quickbooks_items(id)
   "fk_rails_fcf0ca7614" FOREIGN KEY (series_id) REFERENCES series(id) NOT VALID
Referenced by:
   TABLE "tasks" CONSTRAINT "tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)

projects表有 14,273,833 行。

  • 124,005 是is_template = true
                                              Table "public.tasks"
        Column          |            Type             | Collation | Nullable |              Default
-------------------------+-----------------------------+-----------+----------+-----------------------------------
id                      | integer                     |           | not null | nextval('tasks_id_seq'::regclass)
name                    | character varying(255)      |           |          |
description             | text                        |           |          |
duedate                 | timestamp without time zone |           |          |
created_at              | timestamp without time zone |           | not null |
updated_at              | timestamp without time zone |           | not null |
project_id              | integer                     |           | not null |
done                    | boolean                     |           | not null | false
position                | integer                     |           |          |
done_at                 | timestamp without time zone |           |          |
dueafter                | integer                     |           |          |
done_by_user_id         | integer                     |           |          |
predicted_duration      | integer                     |           |          |
auto_predicted_duration | integer                     |           |          | 0
assignable_id           | integer                     |           |          |
assignable_type         | character varying           |           |          |
will_assign_to_client   | boolean                     |           | not null | false
Indexes:
   "tasks_pkey" PRIMARY KEY, btree (id)
   "index_tasks_on_assignable_type_and_assignable_id" btree (assignable_type, assignable_id)
   "index_tasks_on_done_by_user_id" btree (done_by_user_id)
   "index_tasks_on_duedate" btree (duedate)
   "test_tasks_on_project" btree (project_id)
Foreign-key constraints:
   "tasks_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(id)

tasks表有 76,716,433 行。

系統規格

  • PostgreSQL 13.1
  • 2.9 GHz 6 核英特爾酷睿 i9
  • 32 GB 記憶體
  • macOS 10.15.7

不同查詢計劃的主要原因可能是 Postgres估計要從中返回的行數增加projects

(cost=0.00..42021.35 rows=10507 width=0) (actual time=35.642..35.642 rows=10507 loops=1)

對比

(cost=0.43..277961.56 rows=31322 width=4) (actual time=0.591..6970.696 rows=10507 loops=3)

高估了因子 3,這並不引人注目,但顯然足以支持不同的(劣質)查詢計劃。有關的:

假設projects.is_template主要是false,我建議使用這些多列索引:

CREATE INDEX ON projects(company_id, state);

先平等,後範圍。看:

您也可以嘗試增加、 和表的統計目標company_id,以獲得更好的估計。state``ANALYZE

和:

CREATE INDEX ON tasks (project_id, id);

加上增加統計目標tasks.project_idANALYZE

在這兩種情況下,多列索引都可以替換只是project.company_id/上的索引task.project_id。由於所有列都是integer,因此索引的大小將是相同的 - 除了索引重複數據刪除的影響(與 Postgres 13 一起添加),這在對高度重複的tasks.project_id. 看:

這個查詢:

SELECT t.id
FROM   projects p
JOIN   tasks t ON t.project_id = p.id
WHERE  p.company_id = 11171
AND    p.state < 6
AND    p.is_template = FALSE;

直接加入應該更快。

引用自:https://dba.stackexchange.com/questions/281150