Postgresql

相關實體的獨立聚合,同時也按相關排序(在單個語句中)

  • October 18, 2020

我有一個模型,books它們authors之間有一個多對多的關係(因為一本書可以有多個作者,一個作者可以寫多本書)通過我命名的表authorships

我的目標是查詢書籍的某個子集,並獲取相關作者的集合,以及相關作者的集合(即沒有重複),每個都以特定的順序排列。本質上,我想保留記錄的規範化/分離結構,我不想以任何方式去規範化(只是排序)。

通常,我認為您會使用多個語句來執行此操作,使用函式或外部程式碼為IN表達式或類似內容提供 ID。但是,我已經能夠在 PostgreSQL 中使用以下模式在單個語句中執行此操作:

WITH matched_books AS (
 SELECT id, title FROM books
 -- Could be any criteria:
 WHERE title LIKE 'The %'
),
related_authorships AS (
 SELECT authorships.id, book_id, author_id
 FROM authorships
 JOIN matched_books ON book_id = matched_books.id
),
related_authors AS (
 SELECT id, name
 FROM authors
 -- Could also use DISTINCT and do a join here, but I understand EXISTS is typically better for performance:
 WHERE EXISTS (SELECT 1 FROM related_authorships WHERE author_id = authors.id)
)
SELECT
 -- Scalar subqueries that each return a single JSON array of objects:
 -- JSON is completely fine for my purposes, but could also use array_agg.
 (SELECT json_agg(matched_books.* ORDER BY title) FROM matched_books) books,
 (SELECT json_agg(related_authorships.* ORDER BY id) FROM related_authorships) authorships,
 (SELECT json_agg(related_authors.* ORDER BY name) FROM related_authors) authors;

(旁注:在以前的嘗試中,我曾LEFT JOIN在頂層使用 s 和json_agg(DISTINCT ...),但這讓我無法ORDER BY有意義地使用,而且性能似乎更混亂/更差。)

雖然這種方法幾乎效果很好,但我現在想訂購books儲存在其相關authors和/或authorships. 作為一個明顯的例子,假設我希望它們按name作者的作者排序,或者如果有多個作者,則使用作者身份中的一列(id在這種情況下它可以只是最小的整數)來確定應該使用哪個作者.

我想不出一種方法可以允許這樣做,同時仍然獨立地返回集合,至少在沒有一些重複操作的情況下不會。你會如何解決這個問題?

我不確定我是否正確理解了您,但可能您需要將作者的其他欄位分組,以便只有一個 ORDER 值。它可能是 athours 的計數,或者 name 的 min / max 左右:

WITH books AS (
 SELECT 1 AS id, 'The 1' AS title
  UNION ALL
 SELECT 2, 'The 2'
  UNION ALL
 SELECT 3, '3'
), authorships AS (
 SELECT 1 AS id, 1 AS book_id, 1 AS author_id
  UNION ALL
 SELECT 2, 1, 2
  UNION ALL
 SELECT 3, 1, 3
  UNION ALL
 SELECT 4, 2, 1
  UNION ALL
 SELECT 5, 3, 1
), authors AS (
 SELECT 1 AS id, 'name1' AS name
  UNION ALL
 SELECT 2, 'name2'
  UNION ALL
 SELECT 3, 'name3'
  UNION ALL
 SELECT 4, 'name4'
), filtered AS (
 SELECT book_id, title, ba.id, author_id, name
   FROM books AS b
   JOIN authorships AS ba ON ba.book_id = b.id
   JOIN authors AS a ON a.id = ba.author_id WHERE title LIKE 'The %'
) 
SELECT (
        SELECT json_agg(b.* ORDER BY count)
          FROM (
                 SELECT book_id AS id, title, count(name)
                   FROM filtered AS f GROUP BY 1,2
               ) AS f
          JOIN LATERAL (SELECT id, title) AS b ON true)
      ) AS book 

          json_agg           
-----------------------------
[{"id":2,"title":"The 2"}, +
 {"id":1,"title":"The 1"}]
(1 row)

因此,經過多次反複試驗,我能夠提出兩種以可接受的速度工作的方法。

使用問題範例中使用的方法,我確定不幸的是沒有辦法重複連接:

WITH matched_books AS (
 SELECT id, title FROM books
 WHERE title LIKE 'The %'
),
related_authorships AS (
 SELECT authorships.id, book_id, author_id
 FROM authorships
 JOIN matched_books ON book_id = matched_books.id
),
related_authors AS (
 SELECT id, name
 FROM authors
 WHERE EXISTS (SELECT 1 FROM related_authorships WHERE author_id = authors.id)
)
SELECT
 (SELECT json_agg(matched_books.* ORDER BY first_author_name)
   FROM matched_books
   LEFT JOIN (
     SELECT DISTINCT ON (book_id) book_id, name AS first_author_name
     FROM related_authorships
     LEFT JOIN related_authors ON author_id = related_authors.id
     ORDER BY book_id, related_authorships.id
   ) sub ON book_id = matched_books.id
 ) books,
 (SELECT json_agg(related_authorships.* ORDER BY id) FROM related_authorships) authorships,
 (SELECT json_agg(related_authors.* ORDER BY name) FROM related_authors) authors;

EXPLAIN計劃如下所示:

Result  (cost=1344.93..1344.94 rows=1 width=96)
 CTE matched_books
   ->  Seq Scan on books  (cost=0.00..157.94 rows=1169 width=23)
         Filter: (title ~~ 'The %'::text)
 CTE related_authorships
   ->  Hash Join  (cost=128.05..173.70 rows=1204 width=12)
         Hash Cond: (matched_books.id = authorships.book_id)
         ->  CTE Scan on matched_books  (cost=0.00..23.38 rows=1169 width=4)
         ->  Hash  (cost=74.69..74.69 rows=4269 width=12)
               ->  Seq Scan on authorships  (cost=0.00..74.69 rows=4269 width=12)
 CTE related_authors
   ->  Hash Join  (cost=31.59..138.06 rows=1204 width=18)
         Hash Cond: (authors.id = related_authorships.author_id)
         ->  Seq Scan on authors  (cost=0.00..85.99 rows=2699 width=18)
         ->  Hash  (cost=29.09..29.09 rows=200 width=4)
               ->  HashAggregate  (cost=27.09..29.09 rows=200 width=4)
                     Group Key: related_authorships.author_id
                     ->  CTE Scan on related_authorships  (cost=0.00..24.08 rows=1204 width=4)
 InitPlan 4 (returns $3)
   ->  Aggregate  (cost=821.01..821.02 rows=1 width=32)
         ->  Hash Left Join  (cost=791.57..818.09 rows=1169 width=60)
               Hash Cond: (matched_books_1.id = sub.book_id)
               ->  CTE Scan on matched_books matched_books_1  (cost=0.00..23.38 rows=1169 width=32)
               ->  Hash  (cost=789.07..789.07 rows=200 width=36)
                     ->  Subquery Scan on sub  (cost=750.83..789.07 rows=200 width=36)
                           ->  Unique  (cost=750.83..787.07 rows=200 width=40)
                                 ->  Sort  (cost=750.83..768.95 rows=7248 width=40)
                                       Sort Key: related_authorships_1.book_id, related_authorships_1.id
                                       ->  Merge Left Join  (cost=171.37..286.11 rows=7248 width=40)
                                             Merge Cond: (related_authorships_1.author_id = related_authors.id)
                                             ->  Sort  (cost=85.69..88.70 rows=1204 width=12)
                                                   Sort Key: related_authorships_1.author_id
                                                   ->  CTE Scan on related_authorships related_authorships_1  (cost=0.00..24.08 rows=1204 width=12)
                                             ->  Sort  (cost=85.69..88.70 rows=1204 width=36)
                                                   Sort Key: related_authors.id
                                                   ->  CTE Scan on related_authors  (cost=0.00..24.08 rows=1204 width=36)
 InitPlan 5 (returns $4)
   ->  Aggregate  (cost=27.09..27.10 rows=1 width=32)
         ->  CTE Scan on related_authorships related_authorships_2  (cost=0.00..24.08 rows=1204 width=32)
 InitPlan 6 (returns $5)
   ->  Aggregate  (cost=27.09..27.10 rows=1 width=32)
         ->  CTE Scan on related_authors related_authors_1  (cost=0.00..24.08 rows=1204 width=88)

LIKE問題:除了初始條件之外,我還缺少任何明顯的索引或其他優化嗎?

第二種方法是先加入所有內容,然後提取每個實體類型,這肯定有點尷尬:

WITH joined AS (
 -- Use row/composite values to keep things separate
 SELECT books, authorships, authors
 FROM (SELECT id, title FROM books) books
 LEFT JOIN (SELECT id, book_id, author_id FROM authorships) authorships ON books.id = authorships.book_id
 LEFT JOIN (SELECT id, name FROM authors) authors ON authors.id = authorships.author_id
 WHERE title LIKE 'The %'
),
related_authorships AS (
 SELECT DISTINCT ON ((authorships).id) (authorships).*
 FROM joined
 WHERE (authorships).id IS NOT NULL
),
related_authors AS (
 SELECT DISTINCT ON ((authors).id) (authors).*
 FROM joined
 WHERE (authors).id IS NOT NULL
)
SELECT
 (SELECT json_agg(books ORDER BY first_author_name)
   FROM (
     SELECT DISTINCT ON ((books).id) books, (authors).name AS first_author_name
     FROM joined
     ORDER BY (books).id, (authorships).id
   ) sub
 ) books,
 (SELECT json_agg(related_authorships.* ORDER BY id) FROM related_authorships) authorships,
 (SELECT json_agg(related_authors.* ORDER BY name) FROM related_authors) authors;

我不會粘貼查詢計劃;成本因素低於第一個查詢,但實際上平均需要稍長的時間(我知道我可以使用這個特定版本進行一些微優化,但為了更清晰,我把它留成這樣)。結合尷尬的部分,我更喜歡第一種方法。

引用自:https://dba.stackexchange.com/questions/275125