使用大 IN 優化 Postgres 查詢

March 28, 2022

此查詢獲取您關注的人創建的文章列表。您可以關注無限數量的人，但大多數人關注的人數少於 1000 人。

使用這種查詢方式，最明顯的優化是記憶體"Post"id，但不幸的是我現在沒有時間。

EXPLAIN ANALYZE SELECT
   "Post"."id",
   "Post"."actionId",
   "Post"."commentCount",
   ...
FROM
   "Posts" AS "Post"
INNER JOIN "Users" AS "user" ON "Post"."userId" = "user"."id"
LEFT OUTER JOIN "ActivityLogs" AS "activityLog" ON "Post"."activityLogId" = "activityLog"."id"
LEFT OUTER JOIN "WeightLogs" AS "weightLog" ON "Post"."weightLogId" = "weightLog"."id"
LEFT OUTER JOIN "Workouts" AS "workout" ON "Post"."workoutId" = "workout"."id"
LEFT OUTER JOIN "WorkoutLogs" AS "workoutLog" ON "Post"."workoutLogId" = "workoutLog"."id"
LEFT OUTER JOIN "Workouts" AS "workoutLog.workout" ON "workoutLog"."workoutId" = "workoutLog.workout"."id"
WHERE
"Post"."userId" IN (
   201486,
   1825186,
   998608,
   340844,
   271909,
   308218,
   341986,
   216893,
   1917226,
   ...  -- many more
)
AND "Post"."private" IS NULL
ORDER BY
   "Post"."createdAt" DESC
LIMIT 10;

產量：

Limit  (cost=3.01..4555.20 rows=10 width=2601) (actual time=7923.011..7973.138 rows=10 loops=1)
 -&gt;  Nested Loop Left Join  (cost=3.01..9019264.02 rows=19813 width=2601) (actual time=7923.010..7973.133 rows=10 loops=1)
       -&gt;  Nested Loop Left Join  (cost=2.58..8935617.96 rows=19813 width=2376) (actual time=7922.995..7973.063 rows=10 loops=1)
             -&gt;  Nested Loop Left Join  (cost=2.15..8821537.89 rows=19813 width=2315) (actual time=7922.984..7961.868 rows=10 loops=1)
                   -&gt;  Nested Loop Left Join  (cost=1.71..8700662.11 rows=19813 width=2090) (actual time=7922.981..7961.846 rows=10 loops=1)
                         -&gt;  Nested Loop Left Join  (cost=1.29..8610743.68 rows=19813 width=2021) (actual time=7922.977..7961.816 rows=10 loops=1)
                               -&gt;  Nested Loop  (cost=0.86..8498351.81 rows=19813 width=1964) (actual time=7922.972..7960.723 rows=10 loops=1)
                                     -&gt;  Index Scan using posts_createdat_public_index on "Posts" "Post"  (cost=0.43..8366309.39 rows=20327 width=261) (actual time=7922.869..7960.509 rows=10 loops=1)
                                           Filter: ("userId" = ANY ('{201486,1825186,998608,340844,271909,308218,341986,216893,1917226, ... many more ...}'::integer[]))
                                           Rows Removed by Filter: 218360
                                     -&gt;  Index Scan using "Users_pkey" on "Users" "user"  (cost=0.43..6.49 rows=1 width=1703) (actual time=0.005..0.006 rows=1 loops=10)
                                           Index Cond: (id = "Post"."userId")
                               -&gt;  Index Scan using "ActivityLogs_pkey" on "ActivityLogs" "activityLog"  (cost=0.43..5.66 rows=1 width=57) (actual time=0.107..0.107 rows=0 loops=10)
                                     Index Cond: ("Post"."activityLogId" = id)
                         -&gt;  Index Scan using "WeightLogs_pkey" on "WeightLogs" "weightLog"  (cost=0.42..4.53 rows=1 width=69) (actual time=0.001..0.001 rows=0 loops=10)
                               Index Cond: ("Post"."weightLogId" = id)
                   -&gt;  Index Scan using "Workouts_pkey" on "Workouts" workout  (cost=0.43..6.09 rows=1 width=225) (actual time=0.001..0.001 rows=0 loops=10)
                         Index Cond: ("Post"."workoutId" = id)
             -&gt;  Index Scan using "WorkoutLogs_pkey" on "WorkoutLogs" "workoutLog"  (cost=0.43..5.75 rows=1 width=61) (actual time=1.118..1.118 rows=0 loops=10)
                   Index Cond: ("Post"."workoutLogId" = id)
       -&gt;  Index Scan using "Workouts_pkey" on "Workouts" "workoutLog.workout"  (cost=0.43..4.21 rows=1 width=225) (actual time=0.004..0.004 rows=0 loops=10)
             Index Cond: ("workoutLog"."workoutId" = id)
Total runtime: 7974.524 ms

暫時如何優化？

我有以下相關索引：

-- Gets used
CREATE INDEX  "posts_createdat_public_index" ON "public"."Posts" USING btree("createdAt" DESC) WHERE "private" IS null;
-- Don't get used
CREATE INDEX  "posts_userid_fk_index" ON "public"."Posts" USING btree("userId");
CREATE INDEX  "posts_following_index" ON "public"."Posts" USING btree("userId", "createdAt" DESC) WHERE "private" IS null;

也許這需要一個大的部分複合索引createdAt和userIdwhere private IS NULL？

不要使用巨大的列表IN，而是加入一個VALUES表達式，或者如果列表足夠大，使用一個臨時表，對其進行索引，然後加入它。
如果 PostgreSQL 可以在內部自動執行此操作會很好，但此時計劃者不知道如何操作。
類似主題：
https://stackoverflow.com/q/24647503/398670
https://stackoverflow.com/q/17813492/398670

IN在 Postgres中實際上有兩種不同的構造變體。一個使用子查詢表達式（返回一個set），另一個使用values 列表，這只是簡寫
expression = value1
OR
expression = value2
OR
...
您正在使用第二種形式，這對於短列表很好，但對於長列表要慢得多。改為提供您的值列表作為子查詢表達式。我最近知道了這個變種：
WHERE "Post"."userId" IN (VALUES (201486), (1825186), (998608), ... )
我喜歡傳遞一個數組，取消嵌套並加入它。類似的性能，但語法更短：
...
FROM   unnest('{201486,1825186,998608, ...}'::int[]) "userId"
JOIN   "Posts" "Post" USING ("userId")
只要提供的集合/數組中沒有重複項，就等效。否則，帶有 a 的第二種形式JOIN會返回重複的行，而帶有 a 的第一種形式IN只返回一個實例。這種細微的差異也會導致不同的查詢計劃。
顯然，您需要一個關於"Posts"."userId".
對於非常長的列表（數千個），請使用像 @Craig 建議的索引臨時表。這允許對兩個表進行組合點陣圖索引掃描，這通常會在每個數據頁有多個元組從磁碟獲取時更快。
有關的：
如何在 Rails 的 WHERE 子句中使用 ANY 而不是 IN？
另外：您的命名約定不是很有幫助，使您的程式碼冗長且難以閱讀。而是使用合法的、小寫的、不帶引號的標識符。

引用自：https://dba.stackexchange.com/questions/91247

使用大 IN 優化 Postgres 查詢

相關問答

大表中的慢速索引掃描

使用過濾器優化複雜的 Postgres 查詢

為什麼 PostgreSQL 9.5 不使用我最新的 ORDER BY 索引，即使它使用類似的索引就好了？

為什麼在子查詢中 ORDER BY 時沒有使用我的 PostgreSQL 表達式索引？

帶限制的索引查詢，對一列排序，對另一列進行謂詞

優化具有小 LIMIT 的查詢，以一列為謂詞並按另一列排序