Postgresql-Performance

使用索引的 Postgres 查詢永遠不會完成

  • July 17, 2021

所以我有以下3張表:

create table imdb_dev.dim_title
(
   id                 uuid        not null,
   tconst             varchar(10) not null,
   title_type_id      uuid        not null,
   primary_title      text,
   original_title     text,
   is_adult           boolean,
   start_year         integer,
   end_year           integer,
   runtime_in_minutes integer,
   constraint dim_title_pkey
       primary key (id),
   constraint dim_title_title_type_id_fkey
       foreign key (title_type_id) references imdb_dev.dim_title_type
);

create table imdb_dev.dim_title_type
(
   id         uuid not null,
   title_type text not null,
   constraint dim_title_type_pkey
       primary key (id)
);

create table staging.title_basics
(
   tconst         varchar(10) not null,
   titletype      varchar(20) not null,
   primarytitle   text,
   originaltitle  text,
   isadult        boolean,
   startyear      integer,
   endyear        integer,
   runtimeminutes integer,
   genres         text,
   constraint title_basics_pkey
       primary key (tconst, titletype)
)
   partition by LIST (titletype);

我想選擇存在的行,title_basics然後插入dim_title. 兩者dim_title都有title_basics大約 800 萬行,而dim_title_type只是一個只有 12 行的映射表。

這是我用來選擇我需要插入的行的查詢。由於某種原因,它只是繼續執行並且永遠不會完成:

SELECT
   md5(tconst)::UUID AS id,
   tconst,
   tt.id AS title_type_id,
   tb.primarytitle AS primary_title,
   tb.originaltitle AS original_title,
   tb.isadult AS is_adult,
   tb.startyear AS start_year,
   tb.endyear AS end_year,
   tb.runtimeminutes AS runtime_in_minutes
FROM
   "funbro"."staging"."title_basics" tb
LEFT JOIN
   "funbro"."imdb_dev"."dim_title_type" tt ON tt.title_type = tb.titleType

WHERE md5(tconst)::UUID NOT IN (select id from "funbro"."imdb_dev"."dim_title")
;

我肯定做錯了什麼,但我看不到那是什麼。這是查詢計劃:

Gather  (cost=1001.27..533568642515.39 rows=4040707 width=97)
 Workers Planned: 2
 ->  Hash Left Join  (cost=1.27..533568237444.69 rows=1683628 width=97)
       Hash Cond: ((tb.titletype)::text = tt.title_type)
       ->  Parallel Append  (cost=0.00..533568197457.27 rows=1683627 width=73)
             ->  Parallel Seq Scan on title_basics_tvepisode tb_4  (cost=0.00..381512705366.11 rows=1231620 width=75)
                   Filter: (NOT (SubPlan 1))
                   SubPlan 1
                     ->  Materialize  (cost=0.00..289561.20 rows=8081413 width=16)
                           ->  Seq Scan on dim_title  (cost=0.00..209693.13 rows=8081413 width=16)
             ->  Parallel Seq Scan on title_basics_short tb_3  (cost=0.00..52946232635.92 rows=170924 width=63)
                   Filter: (NOT (SubPlan 1))
             ->  Parallel Seq Scan on title_basics_movie tb_1  (cost=0.00..37574087716.28 rows=121299 width=65)
                   Filter: (NOT (SubPlan 1))
             ->  Parallel Seq Scan on title_basics_video tb_7  (cost=0.00..20256814175.33 rows=65394 width=81)
                   Filter: (NOT (SubPlan 1))
             ->  Parallel Seq Scan on title_basics_tvseries tb_6  (cost=0.00..19194394831.39 rows=61965 width=68)
                   Filter: (NOT (SubPlan 1))
             ->  Parallel Seq Scan on title_basics_tvmovie tb_5  (cost=0.00..12033907879.19 rows=38848 width=77)
                   Filter: (NOT (SubPlan 1))
             ->  Parallel Seq Scan on title_basics_others tb_2  (cost=0.00..10050046434.90 rows=32444 width=79)
                   Filter: (NOT (SubPlan 1))
       ->  Hash  (cost=1.12..1.12 rows=12 width=25)
             ->  Seq Scan on dim_title_type tt  (cost=0.00..1.12 rows=12 width=25)

這些是執行 Postgres 13 的機器的規格。磁碟是 SSD。Postgres 配置是預設配置:

在此處輸入圖像描述

好的,找出問題所在。

首先,我認為按照建議添加索引作為表達式staging.title_basics會使查詢規劃器使用索引,但我這樣做了,但仍然面臨同樣的問題:Postgres 將繼續進行 Seq Scan:

create index title_basics_id_uuid on staging.title_basics
(( md5(tconst)::UUID ));

真正的問題是NOT INwhere謂詞中使用效率非常低。為 id 為 null 的 LEFT JOIN 更改它會使查詢在不到 1 分鐘的時間內完成。參考:wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_NOT_IN

最終查詢:

SELECT
   md5(tb.tconst)::UUID as id,
   tt.id AS title_type_id,
   tb.primarytitle AS primary_title,
   tb.originaltitle AS original_title,
   tb.isadult AS is_adult,
   tb.startyear AS start_year,
   tb.endyear AS end_year,
   tb.runtimeminutes AS runtime_in_minutes
FROM
   "funbro"."staging"."title_basics" tb
LEFT JOIN
   "funbro"."imdb_dev"."dim_title_type" tt ON tt.title_type = tb.titleType
LEFT JOIN
   imdb_dev.dim_title t ON md5(tb.tconst)::UUID = t.id

WHERE t.id is null
;

引用自:https://dba.stackexchange.com/questions/295765