Postgresql

優化postgres功能

  • June 28, 2021

我有以下功能,我想用它來更新狀態和距離列。

CREATE OR REPLACE FUNCTION public.check_deviation_table(
   )
   RETURNS void
   LANGUAGE 'plpgsql'

   COST 100
   VOLATILE 
   
AS $BODY$
declare 
status integer;
geoms integer;
distance float;
selected_rotue_name text;
Begin
   update ping_data set distance=sub.distance from
   (select min(st_distance(st_setsrid(st_transform(rl.geom,3857),3857),st_setsrid(st_transform(ST_SetSRID(ST_MakePoint(pd.longitude,pd.latitude),4326),3857),3857))) as distance
   ,pd.id from ping_data pd inner join order_data od on od."orderNumber"=pd.ordernumber
   inner join route_line rl on od."pickupLocName"||'-'||od."deliveryLocName"=rl."route name"
   where pd.distance is null group by pd.id)sub
   where ping_data.id=sub.id;
   update ping_data set status='No deviation' where ping_data.status is null and ping_data.distance<='3000'::int::float;
   update ping_data set status='Deviated' where ping_data.status is null and ping_data.distance>'3000'::int::float;
End
$BODY$;

它在幾千條記錄上執行良好,但是當我在ping_data有 642810 條記錄的整個表上執行它時,它只會繼續執行。我還在所涉及的表上創建了以下索引,但無濟於事。

create index btree_ping on ping_data using btree(id);
create index gist_route on route_line using gist(geom);

請參考以下詳情

Postgres 版本:PostgreSQL 12.4,Visual C++ build 1914,64-bit 編譯

PostGIS 版本:3.0 USE_GEOS=1 USE_PROJ=1 USE_STATS=1

機器詳情:Intel i5 16GB RAM Windows 10

下面共享的是所涉及的表的模式

CREATE TABLE public.route_line
(
   "route name" character varying COLLATE pg_catalog."default",
   geom geometry,
   routeid integer NOT NULL DEFAULT nextval('route_line_routeid_seq'::regclass),
   "pickup location" text COLLATE pg_catalog."default",
   "destination location" text COLLATE pg_catalog."default",
   CONSTRAINT route_line_pkey PRIMARY KEY (routeid)
)
CREATE TABLE public.ping_data
(
   latitude double precision,
   longitude double precision,
   pingdt character varying COLLATE pg_catalog."default",
   shipmentid integer,
   addedat timestamp with time zone DEFAULT now(),
   ordernumber bigint,
   id integer NOT NULL DEFAULT nextval('ping_data_id_seq'::regclass),
   ping_dt timestamp with time zone,
   status text COLLATE pg_catalog."default",
   distance double precision
)
CREATE TABLE public.order_data
(
   "assetNumber" text COLLATE pg_catalog."default",
   "createDt" timestamp with time zone,
   "deliveryLocName" text COLLATE pg_catalog."default",
   "orderNumber" bigint,
   "pickupLocName" text COLLATE pg_catalog."default",
   "releaseNumber" bigint,
   assigned character varying(1) COLLATE pg_catalog."default" DEFAULT '0'::character varying,
   city text COLLATE pg_catalog."default",
   transportername text COLLATE pg_catalog."default"
)

@ErwinBrandstetter 請參考以下對您共享的更新查詢的解釋分析結果。

Update on ping_data p  (cost=2199987.43..6027475.29 rows=97076 width=162) (actual time=6014063.492..6014063.861 rows=0 loops=1)
 ->  Hash Join  (cost=2199987.43..6027475.29 rows=97076 width=162) (actual time=168649.223..5998253.555 rows=545483 loops=1)
       Hash Cond: (sub.id = p.id)
       ->  Subquery Scan on sub  (cost=2144323.35..5958823.04 rows=97076 width=48) (actual time=167552.663..5987759.840 rows=545483 loops=1)
             ->  GroupAggregate  (cost=2144323.35..5957852.28 rows=97076 width=12) (actual time=167552.647..5984534.784 rows=545483 loops=1)
                   Group Key: pd.id
                   ->  Sort  (cost=2144323.35..2144630.74 rows=122956 width=13212) (actual time=165047.817..177245.569 rows=891149 loops=1)
                         Sort Key: pd.id
                         Sort Method: external merge  Disk: 2274960kB
                         ->  Hash Join  (cost=3815.85..47337.28 rows=122956 width=13212) (actual time=1134.751..2381.401 rows=891149 loops=1)
                               Hash Cond: (pd.ordernumber = od.orderNumber)
                               ->  Seq Scan on ping_data pd  (cost=0.00..38132.37 rows=97076 width=28) (actual time=0.122..381.017 rows=642810 loops=1)
                                     Filter: (distance IS NULL)
                               ->  Hash  (cost=1451.69..1451.69 rows=1453 width=13200) (actual time=1134.316..1134.399 rows=1632 loops=1)
                                     Buckets: 512  Batches: 8  Memory Usage: 576kB
                                     ->  Hash Join  (cost=46.93..1451.69 rows=1453 width=13200) (actual time=159.161..1119.843 rows=1632 loops=1)
                                           Hash Cond: ((rl.route name)::text = ((od.pickupLocName || '-'::text) || od.deliveryLocName))
                                           ->  Seq Scan on route_line rl  (cost=0.00..958.54 rows=954 width=13210) (actual time=25.129..121.613 rows=954 loops=1)
                                           ->  Hash  (cost=31.97..31.97 rows=1197 width=51) (actual time=11.256..11.259 rows=1197 loops=1)
                                                 Buckets: 2048  Batches: 1  Memory Usage: 114kB
                                                 ->  Seq Scan on order_data od  (cost=0.00..31.97 rows=1197 width=51) (actual time=9.379..10.686 rows=1197 loops=1)
       ->  Hash  (cost=38132.37..38132.37 rows=645737 width=89) (actual time=1042.800..1042.801 rows=642810 loops=1)
             Buckets: 32768  Batches: 32  Memory Usage: 2704kB
             ->  Seq Scan on ping_data p  (cost=0.00..38132.37 rows=645737 width=89) (actual time=0.135..674.748 rows=642810 loops=1)
Planning Time: 155.156 ms
Execution Time: 6016315.903 ms

有人可以幫我優化這個功能嗎?

這與您的原始版本完全相同:

CREATE OR REPLACE FUNCTION public.check_deviation_table()
 RETURNS void
 LANGUAGE sql AS
$func$
UPDATE ping_data p
SET    distance = sub.distance
    , status = COALESCE(p.status
                      , CASE WHEN sub.distance <= 3000 THEN 'No deviation'
                             WHEN sub.distance >  3000 THEN 'Deviated' END)
FROM  (
  SELECT min(st_distance(st_setsrid(st_transform(rl.geom,3857),3857)
                        ,st_setsrid(st_transform(st_setsrid(st_makepoint(pd.longitude, pd.latitude),4326),3857),3857))) AS distance
       , pd.id
  FROM   ping_data  pd
  JOIN   order_data od ON od."orderNumber" = pd.ordernumber
  JOIN   route_line rl ON od."pickupLocName" || '-' || od."deliveryLocName" = rl."route name"
  WHERE  pd.distance IS NULL
  GROUP  BY pd.id
  ) sub
WHERE  ping_data.id = sub.id;
$func$;

應該已經快了很多。

刪除所有無用的聲明變數。那麼就沒有什麼需要PL/pgSQL了。使用更簡單的 SQL 函式,或者只是簡單的查詢。

最重要的是,使用一個 UPDATE而不是三個。您的原件將編寫兩個新的行版本,而不僅僅是一個。有幾個順序掃描而不是一個。(索引可能對此毫無用處或使其更加昂貴,但這取決於…)

可能還有更多的潛力,但我會在沒有詳細資訊的情況下停在這裡。

引用自:https://dba.stackexchange.com/questions/294758