Postgresql

子查詢中的慢左連接橫向

  • February 27, 2019

我有三張桌子:(units公寓單位)leases、 和line_items(更好的名字應該是經常性租金)。一個單元有許多租約,而租約有許多行項目(與每次租金變化相關)。我的目標是,給定幾個月的時間,列出每個單元及其最近的租金。


                                         Table public.units
    Column     |            Type             | Collation | Nullable |              Default
----------------+-----------------------------+-----------+----------+-----------------------------------
id             | integer                     |           | not null | nextval('units_id_seq'::regclass)
number         | character varying           |           |          |
bedrooms       | integer                     |           |          |
bathrooms      | integer                     |           |          |
square_footage | integer                     |           |          |
community_id   | integer                     |           |          |
building_id    | integer                     |           |          |
created_at     | timestamp without time zone |           | not null |
updated_at     | timestamp without time zone |           | not null |
slug           | character varying           |           |          |
status         | integer                     |           | not null | 0
note           | text                        |           |          |
half_bathrooms | integer                     |           |          |
display_number | character varying           |           |          |
Indexes:
   units_pkey PRIMARY KEY, btree (id)
   index_units_on_building_id btree (building_id)
   index_units_on_community_id btree (community_id)
   index_units_on_slug btree (slug)
Foreign-key constraints:
   fk_rails_5850136a38 FOREIGN KEY (building_id) REFERENCES buildings(id)
   fk_rails_b860cf198b FOREIGN KEY (community_id) REFERENCES communities(id)
Referenced by:
   TABLE leases CONSTRAINT fk_rails_29210439a5 FOREIGN KEY (unit_id) REFERENCES units(id)
   TABLE market_rents CONSTRAINT fk_rails_7270410e47 FOREIGN KEY (unit_id) REFERENCES units(id)
   TABLE incomes_unit_budgets CONSTRAINT fk_rails_791330e8fe FOREIGN KEY (unit_id) REFERENCES units(id)

                                                 Table public.leases
      Column        |            Type             | Collation | Nullable |              Default               | Storage  | Stats target | Description
---------------------+-----------------------------+-----------+----------+------------------------------------+----------+--------------+-------------
id                  | integer                     |           | not null | nextval('leases_id_seq'::regclass) | plain    |              |
start               | date                        |           |          |                                    | plain    |              |
unit_id             | integer                     |           |          |                                    | plain    |              |
created_at          | timestamp without time zone |           | not null |                                    | plain    |              |
updated_at          | timestamp without time zone |           | not null |                                    | plain    |              |
vacate              | date                        |           |          |                                    | plain    |              |
community_id        | bigint                      |           | not null |                                    | plain    |              |
Indexes:
   leases_pkey PRIMARY KEY, btree (id)
   index_leases_on_community_id btree (community_id)
   index_leases_on_unit_id btree (unit_id)
Foreign-key constraints:
   fk_rails_29210439a5 FOREIGN KEY (unit_id) REFERENCES units(id)
   fk_rails_46ad3e9d34 FOREIGN KEY (community_id) REFERENCES communities(id)
Referenced by:
   TABLE line_items CONSTRAINT fk_rails_02720bca5b FOREIGN KEY (lease_id) REFERENCES leases(id)
   TABLE terms CONSTRAINT fk_rails_49f65f38eb FOREIGN KEY (lease_id) REFERENCES leases(id)

                                          Table public.line_items
    Column      |            Type             | Collation | Nullable |                Default
-----------------+-----------------------------+-----------+----------+----------------------------------------
id              | integer                     |           | not null | nextval('line_items_id_seq'::regclass)
name            | character varying           |           |          |
lease_id        | integer                     |           |          |
amount_cents    | integer                     |           | not null | 0
amount_currency | character varying           |           | not null | 'USD'::character varying
created_at      | timestamp without time zone |           | not null |
updated_at      | timestamp without time zone |           | not null |
start           | date                        |           |          |
expiration      | date                        |           |          |
Indexes:
   line_items_pkey PRIMARY KEY, btree (id)
   index_line_items_on_lease_id btree (lease_id)
   index_line_items_on_start btree (start)
Foreign-key constraints:
   fk_rails_02720bca5b FOREIGN KEY (lease_id) REFERENCES leases(id)

以下查詢完全符合我的期望,但在相對較小的數據集(~1500 個單位,~5000 個租約,~15000 個行項目)上非常慢(8 秒):

with last_rent as (
 SELECT line_items.* FROM (SELECT line_items.*, leases.unit_id, row_number() over (
 partition by unit_id, date_trunc('month', line_items.start)
 order by line_items.start desc
)
FROM line_items INNER JOIN leases ON leases.id = line_items.lease_id WHERE line_items.name = 'RNT' AND leases.community_id IN (X)) as line_items WHERE line_items.row_number = 1
), month_series as (
 SELECT id as unit_id, community_id, bedrooms, bathrooms, generate_series( '2019-03-01', '2019-03-01'::date, interval '1 month' ) - interval '1 day' dt
FROM units WHERE units.community_id IN (X) ORDER BY units.id ASC
), unit_rent_month as (
 select unit_id, community_id, bedrooms, bathrooms, amount_cents, date_trunc('month', dt)::date as period
 from month_series
 left join lateral (
   select amount_cents
   from last_rent
   where
     last_rent.unit_id = month_series.unit_id
     and last_rent.start <= month_series.dt
   order by unit_id, last_rent.start desc
   limit 1
 ) last_rent on true
 order by dt, unit_id
)
select *
from unit_rent_month;

我的策略是通過使用視窗函式來隔離每個月的最後租金(以防提前搬出或取消)。我創建了另一個表 (month_series),其中列出了所有 unit_ids 和月份。我嘗試在最後一次租金和月份系列之間進行橫向連接。

analyze看起來像這樣:

CTE Scan on unit_rent_month  (cost=2563655.29..2593175.29 rows=1476000 width=24) (actual time=8019.924..8020.607 rows=1476 loops=1)
 Output: unit_rent_month.unit_id, unit_rent_month.community_id, unit_rent_month.bedrooms, unit_rent_month.bathrooms, unit_rent_month.amount_cents, unit_rent_month.period
 Buffers: shared hit=990
 CTE last_rent
   ->  Subquery Scan on line_items  (cost=1682.58..2163.78 rows=60 width=56) (actual time=44.584..63.921 rows=12010 loops=1)
         Output: line_items.id, line_items.name, line_items.lease_id, line_items.amount_cents, line_items.amount_currency, line_items.created_at, line_items.updated_at, line_items.start, line_items.expiration, line_items.unit_id, line_items.row_number
         Filter: (line_items.row_number = 1)
         Rows Removed by Filter: 63
         Buffers: shared hit=269
         ->  WindowAgg  (cost=1682.58..2013.40 rows=12030 width=68) (actual time=44.582..60.093 rows=12073 loops=1)
               Output: line_items_1.id, line_items_1.name, line_items_1.lease_id, line_items_1.amount_cents, line_items_1.amount_currency, line_items_1.created_at, line_items_1.updated_at, line_items_1.start, line_items_1.expiration, leases.unit_id, row_number() OVER (?), leases.unit_id, (date_trunc('month'::text, (line_items_1.start)::timestamp with time zone))
               Buffers: shared hit=269
               ->  Sort  (cost=1682.58..1712.65 rows=12030 width=56) (actual time=44.524..45.478 rows=12073 loops=1)
                     Output: line_items_1.start, leases.unit_id, (date_trunc('month'::text, (line_items_1.start)::timestamp with time zone)), line_items_1.id, line_items_1.name, line_items_1.lease_id, line_items_1.amount_cents, line_items_1.amount_currency, line_items_1.created_at, line_items_1.updated_at, line_items_1.expiration
                     Sort Key: leases.unit_id, (date_trunc('month'::text, (line_items_1.start)::timestamp with time zone)), line_items_1.start DESC
                     Sort Method: quicksort  Memory: 2082kB
                     Buffers: shared hit=269
                     ->  Hash Join  (cost=378.43..867.28 rows=12030 width=56) (actual time=7.626..33.901 rows=12073 loops=1)
                           Output: line_items_1.start, leases.unit_id, date_trunc('month'::text, (line_items_1.start)::timestamp with time zone), line_items_1.id, line_items_1.name, line_items_1.lease_id, line_items_1.amount_cents, line_items_1.amount_currency, line_items_1.created_at, line_items_1.updated_at, line_items_1.expiration
                           Inner Unique: true
                           Hash Cond: (line_items_1.lease_id = leases.id)
                           Buffers: shared hit=269
                           ->  Seq Scan on public.line_items line_items_1  (cost=0.00..395.25 rows=12736 width=44) (actual time=0.012..5.813 rows=12736 loops=1)
                                 Output: line_items_1.id, line_items_1.name, line_items_1.lease_id, line_items_1.amount_cents, line_items_1.amount_currency, line_items_1.created_at, line_items_1.updated_at, line_items_1.start, line_items_1.expiration
                                 Filter: ((line_items_1.name)::text = 'RNT'::text)
                                 Rows Removed by Filter: 4884
                                 Buffers: shared hit=175
                           ->  Hash  (cost=292.99..292.99 rows=6835 width=8) (actual time=7.537..7.537 rows=6837 loops=1)
                                 Output: leases.unit_id, leases.id
                                 Buckets: 8192  Batches: 1  Memory Usage: 332kB
                                 Buffers: shared hit=94
                                 ->  Seq Scan on public.leases  (cost=0.00..292.99 rows=6835 width=8) (actual time=0.010..4.954 rows=6837 loops=1)
                                       Output: leases.unit_id, leases.id
                                       Filter: (leases.community_id = ANY ('{X}'::bigint[]))
                                       Rows Removed by Filter: 401
                                       Buffers: shared hit=94
 CTE month_series
   ->  Result  (cost=0.28..33371.24 rows=1476000 width=24) (actual time=0.065..11.553 rows=1476 loops=1)
         Output: units.id, units.community_id, units.bedrooms, units.bathrooms, ((generate_series('2019-03-01 00:00:00-08'::timestamp with time zone, ('2019-03-01'::date)::timestamp with time zone, '1 mon'::interval)) - '1 day'::interval)
         Buffers: shared hit=712
         ->  ProjectSet  (cost=0.28..7541.24 rows=1476000 width=24) (actual time=0.062..9.869 rows=1476 loops=1)
               Output: generate_series('2019-03-01 00:00:00-08'::timestamp with time zone, ('2019-03-01'::date)::timestamp with time zone, '1 mon'::interval), units.id, units.community_id, units.bedrooms, units.bathrooms
               Buffers: shared hit=712
               ->  Index Scan using units_pkey on public.units  (cost=0.28..146.48 rows=1476 width=16) (actual time=0.038..4.365 rows=1476 loops=1)
                     Output: units.id, units.number, units.bedrooms, units.bathrooms, units.square_footage, units.community_id, units.building_id, units.created_at, units.updated_at, units.slug, units.status, units.note, units.half_bathrooms, units.display_number
                     Filter: (units.community_id = ANY ('{X}'::integer[]))
                     Rows Removed by Filter: 201
                     Buffers: shared hit=712
 CTE unit_rent_month
   ->  Sort  (cost=2524430.27..2528120.27 rows=1476000 width=32) (actual time=8019.918..8019.998 rows=1476 loops=1)
         Output: month_series.unit_id, month_series.community_id, month_series.bedrooms, month_series.bathrooms, last_rent.amount_cents, ((date_trunc('month'::text, month_series.dt))::date), month_series.dt
         Sort Key: month_series.dt, month_series.unit_id
         Sort Method: quicksort  Memory: 164kB
         Buffers: shared hit=990
         ->  Nested Loop Left Join  (cost=1.51..2302560.00 rows=1476000 width=32) (actual time=78.693..8018.611 rows=1476 loops=1)
               Output: month_series.unit_id, month_series.community_id, month_series.bedrooms, month_series.bathrooms, last_rent.amount_cents, (date_trunc('month'::text, month_series.dt))::date, month_series.dt
               Buffers: shared hit=984
               ->  CTE Scan on month_series  (cost=0.00..29520.00 rows=1476000 width=24) (actual time=0.067..13.332 rows=1476 loops=1)
                     Output: month_series.unit_id, month_series.community_id, month_series.bedrooms, month_series.bathrooms, month_series.dt
                     Buffers: shared hit=712
               ->  Limit  (cost=1.51..1.51 rows=1 width=12) (actual time=5.419..5.420 rows=1 loops=1476)
                     Output: last_rent.amount_cents, last_rent.unit_id, last_rent.start
                     Buffers: shared hit=272
                     ->  Sort  (cost=1.51..1.51 rows=1 width=12) (actual time=5.418..5.418 rows=1 loops=1476)
                           Output: last_rent.amount_cents, last_rent.unit_id, last_rent.start
                           Sort Key: last_rent.start DESC
                           Sort Method: top-N heapsort  Memory: 25kB
                           Buffers: shared hit=272
                           ->  CTE Scan on last_rent  (cost=0.00..1.50 rows=1 width=12) (actual time=3.395..5.411 rows=8 loops=1476)
                                 Output: last_rent.amount_cents, last_rent.unit_id, last_rent.start
                                 Filter: ((last_rent.start <= month_series.dt) AND (last_rent.unit_id = month_series.unit_id))
                                 Rows Removed by Filter: 12002
                                 Buffers: shared hit=269
Planning time: 3.219 ms
Execution time: 8021.388 ms

根據analyze所說的,我認為問題出在子查詢和橫向連接上。有人能幫我弄清楚如何優化嗎?如果需要,很高興分享更多資訊。

編輯:這是我的最終查詢:

SELECT id as unit_id, community_id, bedrooms, bathrooms, date_trunc('month', mon.dt)::date AS period, last_rent.amount_cents
FROM "units" 
CROSS  JOIN generate_series(timestamp '2018-01-01'
                          , timestamp '2018-12-01'
                          , interval '1 month') mon(dt) 

  left join lateral (
    SELECT li.amount_cents
    FROM   leases l
    JOIN   line_items li ON l.id = li.lease_id
    WHERE  l."unit_id" = "units"."id"
      AND li.start < mon.dt + interval '1 month'
      AND    li.name  = 'RNT'
    order by li.start desc
    limit 1
  ) last_rent on true
WHERE "units"."community_id" IN (X)
order by period, unit_id

編輯2:

作為參考,這是一個包含行和所需輸出的測試案例

Line Item 1:
 id: 1
 name: "RNT"
 lease_id: 1
 amount_cents: 100,000
 start: January 1, 2018

Line Item 2:
 id: 2
 name: "RNT"
 lease_id: 2
 amount_cents: 110,000
 start: March 5, 2018

Lease 1:
  id: 1
  start: January 1, 2018
  vacate: March 4, 2018
  unit_id: 1

Lease 2:
  id: 2
  start: March 5, 2018
  vacate: null
  unit_id: 1

Unit 1:
 id: 1
 community_id: 1
 number: 101
 bedrooms: 1
 bathrooms: 1

輸出:

| unit_id | community_id | bedrooms | bathrooms | period      | amount_cents
---------------------------------------------------------------------------
|    1    |       1      |    1     |     1     | Jan 1, 2018 | 100,0000
|    1    |       1      |    1     |     1     | Feb 1, 2018 | 100,0000
|    1    |       1      |    1     |     1     | Mar 1, 2018 | 110,0000
|    1    |       1      |    1     |     1     | Apr 1, 2018 | 110,0000
... Repeating 110,000 for the rest of 2018.

您的查詢中有很多多餘的工作。第一個 CTElast_rent每月檢索整個表中每個unit_rent_month單元的“第一”行並將結果具體化,而在後面的 CTE 中僅重用其中的一小部分。另外,你在那裡重複工作。

您基本上可以last_rent完全刪除第一個 CTE。同時,刪除所有CTE。您真正需要的只是一個LATERAL子查詢。這個徹底重寫的查詢應該做同樣的事情,更快(未經測試):

SELECT u.id AS unit_id, u.community_id, u.bedrooms, u.bathrooms
    , last_rent.amount_cents
    , date_trunc('month', mon.dt)::date AS period
FROM   units u
CROSS  JOIN generate_series(timestamp '2019-02-01'  -- switched to lower bound
                         , timestamp '2019-02-01'  -- and see below
                         , interval '1 month') mon(dt)
LEFT   JOIN LATERAL (
  SELECT li.amount_cents
  FROM   leases     l
  JOIN   line_items li ON l.id = li.lease_id
  WHERE  l.unit_id = u.id
  AND    l.community_id = u.community_id  -- redundant?
  AND    li.name  = 'RNT'
  AND    li.start <  mon.dt + interval '1 month'
  AND    li.start >= mon.dt  -- also required, see below
  ORDER  BY li.start DESC    -- NULLS LAST ?
  LIMIT  1
  ) last_rent ON true
WHERE  u.community_id IN (X)  -- ?
ORDER  BY mon.dt, u.id;

您的範例僅產生一個月。其意圖顯然是有選擇地一次生產連續幾個月。我巧妙地改變了generate_series()表達方式——原因如下:

我也切換到提供下限,就像您在更新的查詢中所做的那樣。這樣方便一點。

如果 IN 子句具有多個表達式,則重複條件的方式community_id IN (X)可能會導致不同 community_id的匹配。我收緊了它以與 AND 完全匹配 。如果不符合要求,請進行調整。似乎該列一開始是多餘的,那麼您可以完全刪除此謂詞。units``leases``l.community_id = u.community_id``leases.community_id

這個謂詞有一個邏輯上的區別:

  AND    li.start >= mon.dt

您在更新的查詢中刪除了它,但也許您應該保留它以始終獲取amount_cents結果行月份的最新資訊 -NULL如果沒有的話。這就是您的原始查詢所做的。

如果您放棄它,您將獲得amount_cents給定日期範圍內的最新資訊,可能來自上個月。即,如果您更改日期範圍的下限,您可以獲得一個月的不同結果。

有關的:

索引

根據未公開的細節,您可能會獲得更快的結果,但是,使用不同的查詢樣式,或者通過添加一個或多個多列部分索引。在黑暗中拍攝,這可能會讓您非常快速地進行僅索引掃描:

CREATE INDEX foo ON leases (unit_id, id);

CREATE INDEX bar ON line_items (lease_id, start DESC, amount_cents)
WHERE name = 'RNT';

或者,在 Postgres 11中稍微好一點:

CREATE INDEX bar ON line_items (lease_id, start DESC) INCLUDING (amount_cents)
WHERE name = 'RNT';

WHERE name = 'RNT'通常只有在“RNT”不太常見的情況下才會付費。

相關(考慮那裡提到的先決條件):

引用自:https://dba.stackexchange.com/questions/230516