Explain

在 Redshift 上評估不同的解釋

  • November 7, 2018

我試圖理解關於 Redshift 的解釋。如果我有這樣的數據

id | user_id | created_at
---|---------------------------------
1  | 1       | 2017-02-08 14:32:10.96
2  | 1       | 2017-02-07 14:32:10.96
3  | 2       | 2017-02-06 14:32:10.96
4  | 2       | 2017-02-05 14:32:10.96

我想要 :

id | user_id | created_at
---|---------------------------------
1  | 1       | 2017-02-08 14:32:10.96
3  | 2       | 2017-02-06 14:32:10.96

我有這兩個查詢:

SELECT id,
      user_id,
      created_at
FROM
 ( SELECT user_id,
          created_at,
          row_number() OVER (PARTITION BY user_id
                             ORDER BY created_at) AS rownum
  FROM my_table) x
WHERE rownum = 1;

隨著解釋有:

XN Subquery Scan x  (cost=1000001263779.68..1000001513986.60 rows=50042 width=16)
 Filter: (rownum = 1)
 ->  XN Window  (cost=1000001263779.68..1000001388883.14 rows=10008277 width=16)
       Partition: user_id
       Order: created_at
       ->  XN Sort  (cost=1000001263779.68..1000001288800.37 rows=10008277 width=16)
             Sort Key: user_id, created_at
             ->  XN Seq Scan on my_table  (cost=0.00..100082.77 rows=10008277 width=16)

然後是另一個查詢:

SELECT ac1.user_id, ac1.created_at FROM my_table ac1
JOIN 
(
  SELECT user_id, MAX(created_at) AS MAXDATE
  FROM my_table
  GROUP BY user_id
) ac2
ON ac1.user_id = ac2.user_id
AND ac1.created_at = ac2.MAXDATE;

和解釋:

XN Hash Join DS_DIST_NONE  (cost=150798.74..771939079.62 rows=7257 width=16)
 Hash Cond: (("outer".created_at = "inner".maxdate) AND ("outer".user_id = "inner".user_id))
 ->  XN Seq Scan on my_table ac1  (cost=0.00..100082.77 rows=10008277 width=16)
 ->  XN Hash  (cost=150606.01..150606.01 rows=38548 width=16)
       ->  XN Subquery Scan ac2  (cost=150124.15..150606.01 rows=38548 width=16)
             ->  XN HashAggregate  (cost=150124.15..150220.52 rows=38548 width=16)
                   ->  XN Seq Scan on my_table  (cost=0.00..100082.77 rows=10008277 width=16)

第一個查詢的結果有點慢,但是當我嘗試理解解釋時我迷失了。使用的查詢似乎cost更高,ROW_NUMBER()但與rows.

但是我可以從這些解釋中提取什麼(遺憾的是我不能ANALYZE在 Redshift 上使用)?

第一個查詢計劃中代價高昂並解釋差異的步驟是對大量行的排序步驟。您正在對整個數據集進行排序(一個O(n log n)操作,其中 n 是您的分區大小),以便您可以選擇第一個條目。其他行 (#2 - #10,000,000) 仍然需要排序,即使您從未看過它們。另一方面,max 是一種O(n)操作,因為您只需要在傳遞數據時跟踪一個值

法比奧給出了一個很好的答案。

但是對於 Redshift,值得補充的是,您的數據物理佈局方式對 EXPLAIN 計劃成本有顯著影響。

創建一些虛擬數據:(

靈感來自https://stackoverflow.com/questions/38667215/redshift-how-can-i-generate-a-series-of-numbers-without-creating-a-table-call

drop table if exists #my_table; create table #my_table as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24

複製類似的EXPLAIN計劃到問題:

explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table) x WHERE rownum = 1

EXPLAIN:

XN Subquery Scan x (cost=1000000000000.79..1000000000001.39 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=1000000000000.79..1000000000001.09 rows=24 width=16) Partition: user_id Order: created_at -> XN Sort (cost=1000000000000.79..1000000000000.85 rows=24 width=16) Sort Key: user_id, created_at -> XN Network (cost=0.00..0.24 rows=24 width=16) Distribute -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16)

Next:

explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE

說明:

XN Hash Join DS_DIST_INNER (cost=0.72..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.36..0.66 rows=24 width=16) -> XN HashAggregate (cost=0.36..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)

這確實快得多,但數據分佈在節點之間(DB_DIST_INNER)。

現在嘗試:

drop table #my_table_dist; create table #my_table_dist distkey(user_id) sortkey(user_id,created_at) as select (row_number() over (order by 1)) - 1 as user_id ,(current_date - user_id::int)::timestamp created_at from stl_load_errors limit 24

現在執行解釋:

explain SELECT user_id, created_at FROM ( SELECT user_id, created_at, row_number() OVER (PARTITION BY user_id ORDER BY created_at) AS rownum FROM #my_table_dist) x WHERE rownum = 1

說明:

XN Subquery Scan x (cost=0.00..0.78 rows=1 width=16) Filter: (rownum = 1) -> XN Window (cost=0.00..0.48 rows=24 width=16) Partition: user_id Order: created_at -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16)

數據已經排序和分發,因此 Redshift 只需讀取答案。

同時:

explain SELECT ac1.user_id, ac1.created_at FROM #my_table ac1 JOIN ( SELECT user_id, MAX(created_at) AS MAXDATE FROM #my_table_dist GROUP BY user_id ) ac2 ON ac1.user_id = ac2.user_id AND ac1.created_at = ac2.MAXDATE

說明:

XN Hash Join DS_DIST_INNER (cost=0.36..822858.77 rows=13 width=16) Inner Dist Key: ac1.user_id Hash Cond: (("outer".maxdate = "inner".created_at) AND ("outer".user_id = "inner".user_id)) -> XN Subquery Scan ac2 (cost=0.00..0.66 rows=24 width=16) -> XN GroupAggregate (cost=0.00..0.42 rows=24 width=16) -> XN Seq Scan on "#my_table_dist" (cost=0.00..0.24 rows=24 width=16) -> XN Hash (cost=0.24..0.24 rows=24 width=16) -> XN Seq Scan on "#my_table" ac1 (cost=0.00..0.24 rows=24 width=16)

請注意,由於節點之間的數據分佈(DB_DIST_INNER),成本沒有差異。

引用自:https://dba.stackexchange.com/questions/163588