Postgresql

編寫此查詢的更好方法?

  • September 14, 2019

我的數據庫(Postgresql 10)有以下架構:

CREATE TABLE "PulledTexts" (
"Id" serial PRIMARY KEY,
"BaseText" TEXT,
"CleanText" TEXT
);

CREATE TABLE "UniqueWords" (
"Id" serial PRIMARY KEY,
"WordText" TEXT
);

CREATE TABLE "WordTexts" (
"Id" serial PRIMARY KEY,
"TextIdId" INTEGER REFERENCES "PulledTexts",
"WordIdId" INTEGER REFERENCES "UniqueWords"
);
CREATE INDEX "IX_WordTexts_TextIdId" ON "WordTexts" ("TextIdId");
CREATE INDEX "IX_WordTexts_WordIdId" ON "WordTexts" ("WordIdId");

一些樣本數據:

INSERT INTO public."PulledTexts" ("Id", "BaseText", "CleanText") VALUES
(1, 'automate business audit', null),
(2, 'audit trial', null),
(3, 'trial', null),
(4, 'audit', null),
(5, 'fresh report', null),
(6, 'fresh audit', null),
(7, 'automate this script', null),
(8, 'im trying here', null),
(9, 'automate this business', null),
(10, 'lateral', null);

INSERT INTO public."UniqueWords" ("Id", "WordText") VALUES
(1, 'trial'),
(2, 'audit'),
(3, 'creation'),
(4, 'business'),
(5, 'automate');

INSERT INTO public."WordTexts" ("Id", "TextIdId", "WordIdId") VALUES
(1, 1, 2),
(2, 1, 4),
(3, 1, 5),
(4, 2, 1),
(5, 3, 1),
(6, 4, 2),
(7, 6, 2),
(8, 7, 5),
(9, 9, 4),
(10, 9, 5),
(11, 2, 2);

數據庫本身是通過實體框架遷移創建的。

我想知道是否有更好、更高效的方式來編寫此查詢,因為該WordTexts表將包含數十萬條記錄,最終包含數百萬條記錄。如果對這類查詢更有效,我也願意採用 NoSql 路線。

SELECT *
FROM "PulledTexts"
WHERE "Id" IN (
SELECT "TextIdId"
FROM "WordTexts" AS "wordTexts"
LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
WHERE "wordTexts.WordId"."WordText" = 'automate'

OR "TextIdId" IN (
 SELECT "TextIdId" and1
 from "WordTexts" AS "wordTexts"
 LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
 where "wordTexts.WordId"."WordText" = 'audit' INTERSECT

 SELECT "TextIdId" and2
 from "WordTexts" AS "wordTexts"
 LEFT JOIN "UniqueWords" AS "wordTexts.WordId" ON "wordTexts"."WordIdId" = "wordTexts.WordId"."Id"
 WHERE "wordTexts.WordId"."WordText" = 'trial'
)
);

在較高級別上,此查詢應該返回其 ID 與子查詢返回的 ID 匹配的 PulledTexts。子查詢只是為了返回包含('audit' AND 'trial') OR 'automate'來自 UniqueWords 表的 PulledTexts 的 ID 列表。這正是我給出的範例查詢所做的。WordTexts 表是 UniqueWord 到 PulledText 的簡單映射。

雖然您的查詢是有效的,但我有很多不同的地方。

  1. 如果可以避免的話,不要在 Postgres 中使用 CaMeL-case 名稱。您的未命名實體框架可能會強迫您使用這種廢話,但我不想處理雙引號混亂,因此我在刪除所有雙引號後使用您的架構進行了測試 - 有效地使所有標識符小寫。
  1. 不要使用難以辨認或非法的列和表別名(如"wordTexts.WordId")。這是品味和風格(以及理智)的問題,但您也在AS應該的地方省略了關鍵字,並將其保留在可以省略的地方。
  1. 我還格式化了更多格式,以便我更容易理解查詢。最後一部分是完全可選的。但是使用一些一致的格式樣式。

到達這裡

SELECT *
FROM   PulledTexts
WHERE  Id IN (
  SELECT w.TextIdId
  FROM   WordTexts w  -- AS can be omitted for table alias
  LEFT   JOIN UniqueWords u ON w.WordIdId = u.Id  -- LEFT JOIN might be necessary here
  WHERE  u.WordText = 'automate'
  OR     w.TextIdId IN (
     SELECT w.TextIdId  -- AS and1 -- column alias only documentation here, not visible
     FROM   WordTexts w
     JOIN   UniqueWords u ON w.WordIdId = u.Id  -- LEFT JOIN misleading here
     WHERE  u.WordText = 'audit'

     INTERSECT
     SELECT w.TextIdId  -- AS and2  -- but don't omit AS for column alias
     FROM   WordTexts w
     JOIN   UniqueWords u ON w.WordIdId = u.Id
     WHERE  u.WordText = 'trial'
     )
  );

可以簡化為:

SELECT *
FROM  (
  SELECT w.TextIdId AS Id
  FROM   WordTexts   w
  JOIN   UniqueWords u ON w.WordIdId = u.Id  -- now we don't need LEFT any more
  WHERE  u.WordText = 'automate'

  UNION
  SELECT w.TextIdId
  FROM   WordTexts w
  JOIN   UniqueWords u ON w.WordIdId = u.Id
  WHERE  u.WordText = 'audit'

  INTERSECT
  SELECT w.TextIdId
  FROM   WordTexts w
  JOIN   UniqueWords u ON w.WordIdId = u.Id
  WHERE  u.WordText = 'trial'
  ) w
JOIN   PulledTexts p USING (Id)

我們不需要額外的括號,因為根據手冊

INTERSECT比 結合得更緊UNION。也就是說,A UNION B INTERSECT C將被讀作A UNION (B INTERSECT C).

但這在替換多個相交的子查詢時會更快:

SELECT *
FROM  (
  SELECT w.TextIdId AS Id
  FROM   WordTexts   w
  JOIN   UniqueWords u ON w.WordIdId = u.Id
  WHERE  u.WordText = 'automate'

  UNION
  SELECT TextIdId
  FROM   WordTexts w1
  JOIN   WordTexts w2 USING (TextIdId)
  WHERE  w1.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'audit')
  AND    w2.WordIdId = (SELECT Id FROM UniqueWords WHERE WordText = 'trial')
  ) w
JOIN   PulledTexts p USING (Id)

INTERSECT部分可以轉換為關係劃分問題。就在昨天的相關答案中的解釋:

db<>在這裡擺弄

對性能最重要的是擁有正確的索引。您可能應該對in table有一個**UNIQUE約束,它**按此順序在這兩個列上實現目前缺少的索引。(WordIdId, TextIdId)``WordTexts

引用自:https://dba.stackexchange.com/questions/190146