按日期和小時確定組的正確索引和順序（2 列）

March 22, 2019

PostgreSQL 10.4

我有一個已經在date列上有索引的表。目前查詢計劃正在使用現有日期索引進行點陣圖堆掃描。我想為這個查詢添加一個新的索引，沒有參數被注入到查詢中，我從status列的部分索引開始，但我不知道如何處理分組和按部分排序。

select date, hour, sum(installs) as installs, sum(clicks) as clicks
from ho_aggregated_stats
where date &gt; (current_date - interval '2 day')
and (status='approved' or status is null)
group by date, hour
order by date, hour;

解釋https://explain.depesz.com/s/rnCW

QUERY PLAN                                                                                                                                                                            
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate  (cost=992433.95..992442.13 rows=1488 width=24) (actual time=3903.296..3903.337 rows=43 loops=1)                                                              
 Group Key: date, hour                                                                                                                                                               
 Buffers: shared hit=85314 read=11496                                                                                                                                                
 I/O Timings: read=2896.216                                                                                                                                                          
 -&gt;  Sort  (cost=992433.95..992434.69 rows=1488 width=24) (actual time=3903.290..3903.298 rows=86 loops=1)                                                                           
       Sort Key: date, hour                                                                                                                                                          
       Sort Method: quicksort  Memory: 31kB                                                                                                                                          
       Buffers: shared hit=85314 read=11496                                                                                                                                          
       I/O Timings: read=2896.216                                                                                                                                                    
       -&gt;  Gather  (cost=992265.00..992418.27 rows=1488 width=24) (actual time=3903.167..3903.233 rows=86 loops=1)                                                                   
             Workers Planned: 1                                                                                                                                                      
             Workers Launched: 1                                                                                                                                                     
             Buffers: shared hit=85314 read=11496                                                                                                                                    
             I/O Timings: read=2896.216                                                                                                                                              
             -&gt;  Partial HashAggregate  (cost=991265.00..991269.47 rows=1488 width=24) (actual time=3899.779..3899.808 rows=43 loops=2)                                              
                   Group Key: date, hour                                                                                                                                             
                   Buffers: shared hit=149987 read=16557                                                                                                                             
                   I/O Timings: read=4694.060                                                                                                                                        
                   -&gt;  Parallel Bitmap Heap Scan on ho_aggregated_stats  (cost=21995.80..990158.35 rows=553327 width=16) (actual time=1232.325..3623.710 rows=592709 loops=2)        
                         Recheck Cond: (date &gt; (CURRENT_DATE - '2 days'::interval))                                                                                                  
                         Filter: (((status)::text = 'approved'::text) OR (status IS NULL))                                                                                           
                         Rows Removed by Filter: 3946                                                                                                                                
                         Heap Blocks: exact=91807                                                                                                                                    
                         Buffers: shared hit=149987 read=16557                                                                                                                       
                         I/O Timings: read=4694.060                                                                                                                                  
                         -&gt;  Bitmap Index Scan on index_ho_aggregated_stats_on_date  (cost=0.00..21948.76 rows=1160433 width=0) (actual time=1194.685..1194.685 rows=1339010 loops=1)
                               Index Cond: (date &gt; (CURRENT_DATE - '2 days'::interval))                                                                                              
                               Buffers: shared read=5003                                                                                                                             
                               I/O Timings: read=1082.452                                                                                                                            
Planning time: 0.611 ms                                                                                                                                                               
Execution time: 3948.178 ms

表模式

CREATE TABLE public.stats (
 id bigserial NOT NULL,
 "date" date NOT NULL,
 "hour" int4 NOT NULL,
 status varchar NULL,
 installs int4 NULL DEFAULT 0,
 clicks int4 NULL DEFAULT 0,
 CONSTRAINT stats_pkey PRIMARY KEY (id)
)
CREATE INDEX index_stats_on_date ON public.stats USING btree (date);

估計行數：~40M

更新：我檢查了狀態列上的分佈，75% 為空，20% 批准，5% 拒絕，認為狀態索引沒有必要。

您在稍後的評論中提供了重要資訊：
這是一個非常簡單的過程，當一天中的一小時沒有數據或點擊或安裝為 0 時會發出警報
這允許完全不同的，可能更快的查詢：
SELECT ts AS no_installs
FROM   generate_series(date_trunc('day', localtimestamp - interval '2 day')
                    , localtimestamp
                    , interval '1 hour') ts
WHERE  NOT EXISTS (
  SELECT 
  FROM   stats
  WHERE  date   =  ts::date
  AND    hours  =  extract(hour FROM ts)::int  -- 0 to 23
  AND   (status = 'approved' or status is null)
  AND    installs &gt; 0
  );
和：
SELECT ts AS no_clicks
FROM   ...  -- like above
  ...
  AND    clicks &gt; 0
  );
與您的原始版本相反，這還將檢測到根本沒有行的時間，這應該是最令人擔憂的情況（我猜）。
關於generate_series()：
在 PostgreSQL 中生成兩個日期之間的時間序列
另外：您的查詢中潛伏著一個極端情況錯誤：目前日期取決於您目前會話的時區設置。因此，查詢可能會根據您執行它的*位置而給出不同的（可能是誤導性的）結果。*通常最好使用timestamptz以避免任何此類並發症。您可以(date, hour)用單列替換timestamptz。大小相同。從中得出日期/小時等很便宜。
指數
即使在上使用普通索引，這也應該會大大加快(date, hour)，從而使索引本身的成本保持在較低水平。產生的半反連接NOT EXISTS可以掃描索引，並在找到第一個匹配行時立即丟棄，不再查找。無需像原來那樣聚合所有符合條件的行。
我建議將您擁有的索引：index_ho_aggregated_stats_on_dateon替換為(date)一個 on (date, hour)，或者(date DESC, hour DESC)，對於這種情況幾乎不重要。它與舊索引的大小完全相同，因為date+integer一起佔用 8 個字節。它實際上完成了舊索引所做的一切，甚至更多。看：
複合索引是否也適用於第一個欄位的查詢？
當且僅當大多數行未通過附加條件WHERE (status = 'approved' or status is null)時，使用該條件添加部分索引可能是有意義的。否則不創建另一個索引並讓 PostgresFILTER在索引掃描中添加一個步驟會更便宜。
如果您的表很大（每天大約 050 萬行？），從索引中刪除大量舊數據可能是有意義的。看：
優化大表最近行的查詢性能
或者，如果表行主要按物理排序，則考慮使用BRIN 索引。(date, hour)有關的：
PostgreSQL 規劃器為少數結果行選擇 btree 或 gist 索引
（如果你有索引，可能不值得(date, hour)。）
最後，由於*“每 5 分鐘通過多個 UPSERT 查詢將數據寫入表”*，請考慮在此之後立即進行手動VACUUM ANALYZE（或僅VACUUM）以允許僅索引掃描與覆蓋所有相關列的部分多列索引相結合。您必須權衡成本和收益。

引用自：https://dba.stackexchange.com/questions/232749

按日期和小時確定組的正確索引和順序（2 列）

指數

相關問答

如何在 JSON_AGG 中使用 ORDER BY 和 LIMIT

Postgres中索引列的查詢非常慢

為什麼這個帶有 WHERE、ORDER BY 和 LIMIT 的查詢這麼慢？

過濾數組文本並按時間戳排序

group by、order by 並選擇每組中列的最後一個值

Postgres Order By，但僅是特定類型的最新