Query

在 Postgresql 查詢中有效地選擇多個連續範圍的開始和結束

  • June 1, 2019

我在一個表中有大約 10 億行數據,其名稱和整數範圍在 1-288 之間。對於給定的name,每個int都是唯一的,並且範圍內並非所有可能的整數都存在——因此存在間隙。

此查詢生成一個範例案例:

--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
             ('foo', 3),
             ('foo', 4),
             ('foo', 10),
             ('foo', 11),
             ('foo', 13),
             ('bar', 1),
             ('bar', 2),
             ('bar', 3)
    ) AS baz ("name", "int")

我想為每個名稱和連續整數序列生成一個查找表。每個這樣的行將包含:

namename列的值

start – 連續序列中的第一個整數

end – 連續序列中的最終值

spanend - start + 1

此查詢為上述範例生成範例輸出:

--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
             ('foo', 10, 11, 2),
             ('foo', 13, 13, 1),
             ('bar', 1, 3, 3)
    ) AS contiguous_ranges ("name", "start", "end", span)

因為我有這麼多行,所以效率越高越好。也就是說,我只需要執行一次此查詢,因此這不是絕對要求。

提前致謝!

編輯:

我應該補充一點,歡迎 PL/pgSQL 解決方案(請解釋任何花哨的技巧——我還是 PL/pgSQL 的新手)。

怎麼用with recursive

測試視圖:

create view v as 
select *
from ( values ('foo', 2),
             ('foo', 3),
             ('foo', 4),
             ('foo', 10),
             ('foo', 11),
             ('foo', 13),
             ('bar', 1),
             ('bar', 2),
             ('bar', 3)
    ) as baz ("name", "int");

詢問:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                    union all
                                    select "name", v."int", t.span+1 as span
                                    from v join t using ("name")
                                    where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
     from ( select "name", "int", max(span) as span 
            from t
            group by "name", "int" ) z
     group by "name", ("int"-span+1) ) z;

結果:

name | start | end | span
------+-------+-----+------
foo  |     2 |   4 |    3
foo  |    13 |  13 |    1
bar  |     1 |   3 |    3
foo  |    10 |  11 |    2
(4 rows)

我很想知道它在你的十億行表上的表現。

您可以使用視窗函式來做到這一點。基本思想是使用leadlag視窗函式將行拉到目前行的前面和後面。然後我們可以計算我們是否有序列的開始或結束:

create temp view temp_view as
   select
       n,
       val,
       (lead <> val + 1 or lead is null) as islast,
       (lag <> val - 1 or lag is null) as isfirst,
       (lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan
   from
   (
       select
           n,
           lead(val, 1) over( partition by n order by n, val),
           lag(val, 1) over(partition by n order by n, val ),
           val
       from test
       order by n, val
   ) as t
;  
select * from temp_view;
n  | val | islast | isfirst | orphan 
-----+-----+--------+---------+--------
bar |   1 | f      | t       | f
bar |   2 | f      | f       | f
bar |   3 | t      | f       | f
bar |  24 | t      | t       | t
bar |  42 | t      | t       | t
foo |   2 | f      | t       | f
foo |   3 | f      | f       | f
foo |   4 | t      | f       | f
foo |  10 | f      | t       | f
foo |  11 | t      | f       | f
foo |  13 | t      | t       | t
(11 rows)

(我使用了一個視圖,所以下面的邏輯會更容易理解。)所以現在我們知道該行是開始還是結束。我們必須將其折疊成一行:

select
   n as "name",
   first,
   coalesce (last, first) as last,
   coalesce (last - first + 1, 1) as span
from
(
   select
   n,
   val as first,
   -- this will not be excellent perf. since were calling the view
   -- for each row sequence found. Changing view into temp table 
   -- will probably help with lots of values.
   (
       select min(val)
       from temp_view as last
       where islast = true
       -- need this since isfirst=true, islast=true on an orphan sequence
       and last.orphan = false
       and first.val < last.val
       and first.n = last.n
   ) as last
   from
       (select * from temp_view where isfirst = true) as first
) as t
;

name | first | last | span 
------+-------+------+------
bar  |     1 |    3 |    3
bar  |    24 |   24 |    1
bar  |    42 |   42 |    1
foo  |     2 |    4 |    3
foo  |    10 |   11 |    2
foo  |    13 |   13 |    1
(6 rows)

對我來說看起來是正確的:)

引用自:https://dba.stackexchange.com/questions/17045