Mysql

獲取連接表中聚合值的增量計數

  • May 16, 2018

我在 MySQL 5.7.22 數據庫中有兩個表:postsreasons. 每個文章行都有並屬於許多原因行。每個原因都有一個與之關聯的權重,因此每個文章都有一個與之關聯的聚合權重。

對於 10 個權重點的每個增量(即 0、10、20、30 等),我想獲得總權重小於或等於該增量的文章的計數。我希望結果看起來像這樣:

weight | post_count
--------+------------
     0 | 0
    10 | 5
    20 | 12
    30 | 18
   ... | ...
   280 | 20918
   290 | 21102
   ... | ...
  1250 | 118005
  1260 | 118039
  1270 | 118040

總權重大致呈正態分佈,有一些非常低的值和一些非常高的值(目前最大值為 1277),但大多數位於中間。有不到 120,000 行posts,大約 120行reasons。每個文章平均有 5 或 6 個原因。

表格的相關部分如下所示:

CREATE TABLE `posts` (
 id BIGINT PRIMARY KEY
);

CREATE TABLE `reasons` (
 id BIGINT PRIMARY KEY,
 weight INT(11) NOT NULL
);

CREATE TABLE `posts_reasons` (
 post_id BIGINT NOT NULL,
 reason_id BIGINT NOT NULL,
 CONSTRAINT fk_posts_reasons_posts (post_id) REFERENCES posts(id),
 CONSTRAINT fk_posts_reasons_reasons (reason_id) REFERENCES reasons(id)
);

到目前為止,我已經嘗試將文章 ID 和權重放到一個視圖中,然後將該視圖連接到自身以獲得聚合計數:

CREATE VIEW `post_weights` AS (
   SELECT 
       posts.id,
       SUM(reasons.weight) AS reason_weight
   FROM posts
   INNER JOIN posts_reasons ON posts.id = posts_reasons.post_id
   INNER JOIN reasons ON posts_reasons.reason_id = reasons.id
   GROUP BY posts.id
);

SELECT
   FLOOR(p1.reason_weight / 10) AS weight,
   COUNT(DISTINCT p2.id) AS cumulative
FROM post_weights AS p1
INNER JOIN post_weights AS p2 ON FLOOR(p2.reason_weight / 10) <= FLOOR(p1.reason_weight / 10)
GROUP BY FLOOR(p1.reason_weight / 10)
ORDER BY FLOOR(p1.reason_weight / 10) ASC;

然而,這非常慢 - 我讓它執行了 15 分鐘而不終止,這在生產中是做不到的。

有沒有更有效的方法來做到這一點?

如果您有興趣測試整個數據集,可在此處下載。該文件大約 60MB,它擴展到大約 250MB。或者,這裡的 GitHub gist 中有 12,000 行。

在 JOIN 條件中使用函式或表達式通常是一個壞主意,我說這通常是因為一些優化器可以很好地處理它並無論如何利用索引。我建議為權重創建一個表。就像是:

CREATE TABLE weights
( weight int not null primary key 
);

INSERT INTO weights (weight) VALUES (0),(10),(20),...(1270);

確保你有索引posts_reasons

CREATE UNIQUE INDEX ... ON posts_reasons (reason_id, post_id);

像這樣的查詢:

SELECT w.weight
    , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
      FROM reasons r
      JOIN posts_reasons pr
            ON r.id = pr.reason_id
      GROUP BY pr.post_id
    ) as x
   ON w.weight > x.sum_weight
GROUP BY w.weight;

我家的機器可能有 5-6 年的歷史,它有一個 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz 和 8Gb 記憶體。

uname -a Linuxdustbite 4.16.6-302.fc28.x86_64 #1 SMP Wed May 2 00:07:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

我測試了:

https://drive.google.com/open?id=1q3HZXW_qIZ01gU-Krms7qMJW3GCsOUP5

MariaDB [test3]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.2.14-MariaDB |
+-----------------+
1 row in set (0.00 sec)


SELECT w.weight
    , COUNT(1) as post_count
FROM weights w
JOIN ( SELECT pr.post_id, SUM(r.weight) as sum_weight     
      FROM reasons r
      JOIN posts_reasons pr
            ON r.id = pr.reason_id
      GROUP BY pr.post_id
    ) as x
   ON w.weight > x.sum_weight
GROUP BY w.weight;

+--------+------------+
| weight | post_count |
+--------+------------+
|      0 |          1 |
|     10 |       2591 |
|     20 |       4264 |
|     30 |       4386 |
|     40 |       5415 |
|     50 |       7499 |
[...]   
|   1270 |     119283 |
|   1320 |     119286 |
|   1330 |     119286 |
[...]
|   2590 |     119286 |
+--------+------------+
256 rows in set (9.89 sec)

如果性能至關重要且沒有其他幫助,您可以為以下內容創建匯總表:

SELECT pr.post_id, SUM(r.weight) as sum_weight     
FROM reasons r
JOIN posts_reasons pr
   ON r.id = pr.reason_id
GROUP BY pr.post_id

您可以通過觸發器維護此表

由於重量中的每個重量都需要完成一定的工作量,因此限制此表可能是有益的。

   ON w.weight > x.sum_weight 
WHERE w.weight <= (select MAX(sum_weights) 
                  from (SELECT SUM(weight) as sum_weights 
                  FROM reasons r        
                  JOIN posts_reasons pr
                      ON r.id = pr.reason_id 
                  GROUP BY pr.post_id) a
                 ) 
GROUP BY w.weight

由於我的權重表中有很多不必要的行(最多 2590 行),因此上述限制將執行時間從 9 秒減少到 4 秒。

引用自:https://dba.stackexchange.com/questions/206934