Postgresql

可怕的執行查詢,如何解決

  • October 10, 2019
select
   user_id,
   count(id) as unread_count
from
   notifications
where
   is_read = false
   and user_id in(select(unnest('{200 user IDs}' :: bigint[])))
group by
   user_id;

問題是,這個查詢執行了 1 分鐘,有時比這個時間略長。該表有 32gb 大,並且 user_id 欄位上已經有一個索引。

這是一個執行計劃

HashAggregate  (cost=123354.81..123629.64 rows=27483 width=16) (actual time=90823.880..90823.972 rows=188 loops=1)
 Group Key: user_id
 ->  Nested Loop  (cost=2.32..123217.40 rows=27483 width=16) (actual time=0.184..90752.136 rows=48571 loops=1)
       ->  HashAggregate  (cost=1.76..2.76 rows=100 width=8) (actual time=0.146..0.577 rows=200 loops=1)
             Group Key: unnest(200 user IDs)
             ->  Result  (cost=0.00..0.51 rows=100 width=8) (actual time=0.021..0.073 rows=200 loops=1)
       ->  Index Scan using ix_notification_user_id on notification  (cost=0.56..1229.40 rows=275 width=16) (actual time=119.659..453.533 rows=243 loops=200)
             Index Cond: (200 user IDs)
             Filter: (NOT is_read)
             Rows Removed by Filter: 368
Planning time: 0.189 ms
Execution time: 90824.196 ms

我嘗試了一個使用臨時表的解決方案,將未嵌套的值插入到臨時表中,然後進行比較。但性能根本沒有提高。

我已執行此查詢以查看索引統計資訊:

   schemaname,
   tablename,
   reltuples::bigint,
   relpages::bigint,
   otta,
   round(case when otta = 0 then 0.0 else sml.relpages / otta::numeric end, 1) as tbloat,
   relpages::bigint - otta as wastedpages,
   bs*(sml.relpages-otta)::bigint as wastedbytes,
   pg_size_pretty((bs*(relpages-otta))::bigint) as wastedsize,
   iname,
   ituples::bigint,
   ipages::bigint,
   iotta,
   round(case when iotta = 0 or ipages = 0 then 0.0 else ipages / iotta::numeric end, 1) as ibloat,
   case
       when ipages < iotta then 0
       else ipages::bigint - iotta
   end as wastedipages,
   case
       when ipages < iotta then 0
       else bs*(ipages-iotta)
   end as wastedibytes
   --CASE WHEN ipages < iotta THEN pg_size_pretty(0) ELSE pg_size_pretty((bs*(ipages-iotta))::bigint) END AS wastedisize

   from (
   select
       schemaname,
       tablename,
       cc.reltuples,
       cc.relpages,
       bs,
       ceil((cc.reltuples*((datahdr + ma- (case when datahdr % ma = 0 then ma else datahdr % ma end))+ nullhdr2 + 4))/(bs-20::float)) as otta,
       coalesce(c2.relname, '?') as iname,
       coalesce(c2.reltuples, 0) as ituples,
       coalesce(c2.relpages, 0) as ipages,
       coalesce(ceil((c2.reltuples*(datahdr-12))/(bs-20::float)), 0) as iotta
       -- very rough approximation, assumes all cols

       from (
       select
           ma,
           bs,
           schemaname,
           tablename,
           (datawidth +(hdr + ma-
           (
               case
               when hdr % ma = 0 then ma
               else hdr % ma
           end)))::numeric as datahdr,
           (maxfracsum*(nullhdr + ma-
           (
               case
               when nullhdr % ma = 0 then ma
               else nullhdr % ma
           end))) as nullhdr2
       from
           (
           select
               schemaname,
               tablename,
               hdr,
               ma,
               bs,
               sum((1-null_frac)* avg_width) as datawidth,
               max(null_frac) as maxfracsum,
               hdr +(
               select
                   1 + count(*)/ 8
               from
                   pg_stats s2
               where
                   null_frac <> 0
                   and s2.schemaname = s.schemaname
                   and s2.tablename = s.tablename ) as nullhdr
           from
               pg_stats s,
               (
               select
                   (
                   select
                       current_setting('block_size')::numeric) as bs,
                   case
                       when substring(v, 12, 3) in ('8.0',
                       '8.1',
                       '8.2') then 27
                       else 23
                   end as hdr,
                   case
                       when v ~ 'mingw32' then 8
                       else 4
                   end as ma
               from
                   (
                   select
                       version() as v) as foo ) as constants
           group by
               1,
               2,
               3,
               4,
               5 ) as foo ) as rs
   join pg_class cc on
       cc.relname = rs.tablename
   join pg_namespace nn on
       cc.relnamespace = nn.oid
       and nn.nspname = rs.schemaname
   left join pg_index i on
       indrelid = cc.oid
   left join pg_class c2 on
       c2.oid = i.indexrelid ) as sml
where
   sml.relpages - otta > 0
   or ipages - iotta > 10
order by
   wastedbytes desc,
   wastedibytes desc;

並且 PK 索引和 user_id 索引都有超過 5gbwastedsize和超過 500k+ wastedpages

我的問題是,有什麼解決方案呢?它純粹是一個需要的索引問題reindex還是我缺少的其他東西?

我不允許更改表的結構,我只需要對其進行優化,以某種方式從 1 分鐘以上到 1 秒以下

在 user_id where is_read = false 上添加部分索引後,查詢時間減少了大約 10-15 秒。但這顯然仍然需要很長時間。

編輯:此表中共有 3250 萬行。執行此查詢:

SELECT t.user_id, COALESCE(unread_count, 0) AS unread_count
FROM   unnest('{200 user_ids}'::bigint[]) t(user_id)
LEFT   JOIN LATERAL (
  SELECT count(*) AS unread_count
  FROM   notification n
  WHERE  n.user_id = t.user_id
  AND    n.is_read = false
  ) sub ON true
;     

導致這個執行計劃(很有趣,昨天執行了超過一分鐘,今天執行了大約 30 秒或更短):

Nested Loop Left Join  (cost=1209.05..120908.50 rows=100 width=16) (actual time=333.088..27260.557 rows=200 loops=1)
 Buffers: shared hit=1981 read=20396 dirtied=7
 I/O Timings: read=27023.896
 ->  Function Scan on unnest t  (cost=0.00..1.00 rows=100 width=8) (actual time=0.022..0.360 rows=200 loops=1)
 ->  Aggregate  (cost=1209.04..1209.05 rows=1 width=8) (actual time=136.292..136.293 rows=1 loops=200)
       Buffers: shared hit=1981 read=20396 dirtied=7
       I/O Timings: read=27023.896
       ->  Index Only Scan using ix_test on notification n  (cost=0.44..1208.29 rows=300 width=0) (actual time=2.153..136.170 rows=105 loops=200)
             Index Cond: (user_id = t.user_id)
             Heap Fetches: 21088
             Buffers: shared hit=1981 read=20396 dirtied=7
             I/O Timings: read=27023.896
Planning time: 0.135 ms
Execution time: 27260.745 ms

您的解釋計劃有點令人困惑,因為看起來索引掃描一次獲取所有 200 個 user_id 的數據,但隨後執行了 200 次。但是做實驗,這不是它正在做的事情,嵌套循環的每次迭代都是從該列表中獲取一個 user_id 的數據,而不是整個列表。所以這只是 EXPLAIN 輸出中的一個展示問題。

如果您set track_io_timing = on執行 EXPLAIN (ANALYZE, BUFFERS),我相信您會發現大部分時間都花在了從磁碟讀取數據上。讀取隨機分佈在 32 GB 上的 48571 行並不快,除非所有數據都已記憶體在記憶體中,或者數據位於速度極快的 PCIe SSD 上。

你最好的選擇是讓它使用僅索引掃描,而不是投入一些嚴肅的硬體。對於您顯示的查詢,這將需要這樣的索引:

create index on notifications (user_id , is_read, id);

在嘗試之前用吸塵器吸塵。如果它有效,您將需要考慮如何保持表格良好的吸塵,因為預設的 autovac 設置可能不夠。

我不會擔心報告的臃腫。該查詢(您從哪裡得到它?)報告索引中的大量浪費字節,即使是在新重新索引的表上也是如此。此外,它報告的浪費頁面不是完全空的頁面,而是浪費的字節除以頁面大小。這對我來說似乎很愚蠢。

引用自:https://dba.stackexchange.com/questions/250508