在 Redshift 上評估不同的解釋

November 7, 2018

我試圖理解關於 Redshift 的解釋。如果我有這樣的數據

id | user_id | created_at
---|---------------------------------
1  | 1       | 2017-02-08 14:32:10.96
2  | 1       | 2017-02-07 14:32:10.96
3  | 2       | 2017-02-06 14:32:10.96
4  | 2       | 2017-02-05 14:32:10.96

我想要：

id | user_id | created_at
---|---------------------------------
1  | 1       | 2017-02-08 14:32:10.96
3  | 2       | 2017-02-06 14:32:10.96

我有這兩個查詢：

SELECT id,
      user_id,
      created_at
FROM
 ( SELECT user_id,
          created_at,
          row_number() OVER (PARTITION BY user_id
                             ORDER BY created_at) AS rownum
  FROM my_table) x
WHERE rownum = 1;

隨著解釋有：

XN Subquery Scan x  (cost=1000001263779.68..1000001513986.60 rows=50042 width=16)
 Filter: (rownum = 1)
 -&gt;  XN Window  (cost=1000001263779.68..1000001388883.14 rows=10008277 width=16)
       Partition: user_id
       Order: created_at
       -&gt;  XN Sort  (cost=1000001263779.68..1000001288800.37 rows=10008277 width=16)
             Sort Key: user_id, created_at
             -&gt;  XN Seq Scan on my_table  (cost=0.00..100082.77 rows=10008277 width=16)

然後是另一個查詢：

SELECT ac1.user_id, ac1.created_at FROM my_table ac1
JOIN 
(
  SELECT user_id, MAX(created_at) AS MAXDATE
  FROM my_table
  GROUP BY user_id
) ac2
ON ac1.user_id = ac2.user_id
AND ac1.created_at = ac2.MAXDATE;

和解釋：

XN Hash Join DS_DIST_NONE  (cost=150798.74..771939079.62 rows=7257 width=16)
 Hash Cond: (("outer".created_at = "inner".maxdate) AND ("outer".user_id = "inner".user_id))
 -&gt;  XN Seq Scan on my_table ac1  (cost=0.00..100082.77 rows=10008277 width=16)
 -&gt;  XN Hash  (cost=150606.01..150606.01 rows=38548 width=16)
       -&gt;  XN Subquery Scan ac2  (cost=150124.15..150606.01 rows=38548 width=16)
             -&gt;  XN HashAggregate  (cost=150124.15..150220.52 rows=38548 width=16)
                   -&gt;  XN Seq Scan on my_table  (cost=0.00..100082.77 rows=10008277 width=16)

第一個查詢的結果有點慢，但是當我嘗試理解解釋時我迷失了。使用的查詢似乎cost更高，ROW_NUMBER()但與rows.

但是我可以從這些解釋中提取什麼（遺憾的是我不能ANALYZE在 Redshift 上使用）？

第一個查詢計劃中代價高昂並解釋差異的步驟是對大量行的排序步驟。您正在對整個數據集進行排序（一個O(n log n)操作，其中 n 是您的分區大小），以便您可以選擇第一個條目。其他行 (#2 - #10,000,000) 仍然需要排序，即使您從未看過它們。另一方面，max 是一種O(n)操作，因為您只需要在傳遞數據時跟踪一個值

法比奧給出了一個很好的答案。
但是對於 Redshift，值得補充的是，您的數據物理佈局方式對 EXPLAIN 計劃成本有顯著影響。
創建一些虛擬數據：（
靈感來自https://stackoverflow.com/questions/38667215/redshift-how-can-i-generate-a-series-of-numbers-without-creating-a-table-call）
drop table if exists #my_table; create table #my_table as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24
複製類似的EXPLAIN計劃到問題：
explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table) x WHERE rownum = 1
EXPLAIN：
XN Subquery Scan x (cost=1000000000000.79..1000000000001.39 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=1000000000000.79..1000000000001.09 rows=24 width=16) Partition: user_id Order: created_at -> XN Sort (cost=1000000000000.79..1000000000000.85 rows=24 width=16) Sort Key: user_id, created_at -> XN Network (cost=0.00..0.24 rows=24 width=16) Distribute -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16)
Next：
explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE
說明：
XN Hash Join DS_DIST_INNER (cost=0.72..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.36..0.66 rows=24 width=16) -> XN HashAggregate (cost=0.36..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)
這確實快得多，但數據分佈在節點之間（DB_DIST_INNER）。
現在嘗試：
drop table #my_table_dist; create table #my_table_dist distkey(user_id) sortkey(user_id,created_at) as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24
現在執行解釋：
explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table_dist) x WHERE rownum = 1
說明：
XN Subquery Scan x (cost=0.00..0.78 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=0.00..0.48 rows=24 width=16) Partition: user_id Order: created_at -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16)
數據已經排序和分發，因此 Redshift 只需讀取答案。
同時：
explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table_dist GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE
說明：
XN Hash Join DS_DIST_INNER (cost=0.36..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.00..0.66 rows=24 width=16) -> XN GroupAggregate (cost=0.00..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)
請注意，由於節點之間的數據分佈（DB_DIST_INNER），成本沒有差異。

引用自：https://dba.stackexchange.com/questions/163588

在 Redshift 上評估不同的解釋

現在嘗試：

相關問答

如何在 Postgres 上對傾斜數據集進行更高效的查詢

Oracle 查詢 prod 和 staging 環境之間的不同性能

Oracle 如何獲得在 PLSQL 內部執行的 SQL 的執行計劃？

在 MySQL 中記錄慢查詢的 EXPLAIN 輸出

SQL |基於另一張表的 row_number 設置的 Shuffle 順序

解釋輸出中的 DISTINCT_SCAN 和 IXSCAN 有什麼區別？