如何加快字元串清理功能？

August 2, 2016

我需要清理一個字元串，以便將某些 ASCII 程式碼字元排除在字元串之外，並替換其他字元。

我是 Postgres 的新手。我的功能ufn_cie_easy()執行得太慢了：

DECLARE
 letter char = '';
 str_result TEXT = '';
 x integer;
 y integer;
 asc_code int;
BEGIN
 y:=1;
 x:=char_length(arg);
 LOOP
   letter=substring(arg from y for 1);
   asc_code=ascii(letter);
   IF (asc_code BETWEEN 47 and 58) or (asc_code BETWEEN 65 and 90) or (
       asc_code BETWEEN 97 and 122) THEN
     str_result := str_result || letter;
     ELSIF (asc_code BETWEEN 192 and 197) THEN
     str_result := str_result || 'A';
     ELSIF (asc_code BETWEEN 200 and 203) THEN
     str_result := str_result || 'E';
     ELSIF (asc_code BETWEEN 204 and 207) THEN
     str_result := str_result || 'I';
     ELSIF (asc_code BETWEEN 210 and 214) OR (asc_code=216) THEN
     str_result := str_result || 'O';
     ELSIF (asc_code BETWEEN 217 and 220) THEN
     str_result := str_result || 'U';
     ELSIF (asc_code BETWEEN 224 and 229) THEN
     str_result := str_result || 'a';
     ELSIF (asc_code BETWEEN 232 and 235) THEN
     str_result := str_result || 'e';
     ELSIF (asc_code BETWEEN 236 and 239) THEN
     str_result := str_result || 'i';
     ELSIF (asc_code BETWEEN 242 and 246) OR (asc_code=248) THEN
     str_result := str_result || 'o';
     ELSIF (asc_code BETWEEN 249 and 252) THEN
     str_result := str_result || 'u';
     ELSE
     CASE asc_code
       WHEN 352 THEN str_result := str_result || 'S';
       WHEN 338 THEN str_result := str_result || 'OE';
       WHEN 381 THEN str_result := str_result || 'Z';
       WHEN 353 THEN str_result := str_result || 's';
       WHEN 339 THEN str_result := str_result || 'oe';
       WHEN 382 THEN str_result := str_result || 'z';
       WHEN 162 THEN str_result := str_result || 'c';
       WHEN 198 THEN str_result := str_result || 'AE';
       WHEN 199 THEN str_result := str_result || 'C';
       WHEN 208 THEN str_result := str_result || 'D';
       WHEN 209 THEN str_result := str_result || 'N';
       WHEN 223 THEN str_result := str_result || 'ss';
       WHEN 230 THEN str_result := str_result || 'ae';
       WHEN 231 THEN str_result := str_result || 'c';
       WHEN 241 THEN str_result := str_result || 'n';
       WHEN 376 THEN str_result := str_result || 'Y';
       WHEN 221 THEN str_result := str_result || 'Y';
       WHEN 253 THEN str_result := str_result || 'y';
       WHEN 255 THEN str_result := str_result || 'y';
       ELSE str_result := str_result;
     END CASE;
     END IF;    
   y:=y+1;
   exit when  y=x+1;
 END LOOP;
 return str_result;
END;

該函式應該明顯更快，分配更少（實際上，在這個更新版本中沒有）：
CREATE FUNCTION ufn_cie_easy(text)
 RETURNS text AS
$func$
BEGIN
RETURN replace(replace(replace(replace(replace(replace(
        translate($1,'ŠŽšžŸÝÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜàáâãäåçèéêëìíîïñòóôõöøùúûüýÿ!"#$%&()*+,-./:;&lt;=&gt;?@[\]^_`{|}~€‚ƒ„…†‡ˆ‰‹‘’“”•–—˜™›¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿×Þð÷þÐ'
                    ,'SZszYYAAAAAACEEEEIIIINOOOOOOUUUUaaaaaaceeeeiiiinoooooouuuuyy')
        ,'Œ','OE')
        ,'Æ','AE')
        ,'œ','oe')
        ,'æ','ae')
        ,'ß','ss')
        ,'''','');
END
$func$  LANGUAGE plpgsql;
plpgsql 中的賦值相對昂貴。
但我懷疑您真的想刪除所有重音符號（變音符號）。Postgres 通過unaccent()附加模組unaccent提供該功能：
CREATE OR REPLACE FUNCTION ufn_cie_easy(text)
 RETURNS text AS
$func$
SELECT translate(unaccent(
       replace(replace(replace(replace(replace(
         $1
        ,'Œ','OE')
        ,'Æ','AE')
        ,'œ','oe')
        ,'æ','ae')
        ,'ß','ss')
      ), '!"#$%&()*+,-./:;&lt;=&gt;?@[\]^_`{|}~€‚ƒ„…†‡ˆ‰‹‘’“”•–—˜™›¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿×Þð÷þÐ''', '');
$func$  LANGUAGE sql IMMUTABLE;
明顯更快，但是。如果您真的想刪除所有變音符號，則在各個方面都更勝一籌。
除了unaccent()：
直接包含轉義的'( '') translate()。
使其成為一個簡單的 SQL 函式。
製作函式IMMUTABLE。
IMMUTABLE如果您需要具有以下功能的功能，請考慮詳細說明和更多相關問題unaccent()：
PostgreSQL 是否支持“不區分重音”排序規則？
在 Postgres 9.5 或更早版本中，我們需要手動擴展像 ‘Œ’ 或 ‘ß’ 這樣的連字，因為unaccent()總是替換一個字母：
   SELECT unaccent('Œ Æ œ æ ß');

   unaccent
   ----------
   E A e a S
我會在 Postgres 9.6 中做什麼
在此更新為 unaccent之後：
擴展contrib/unaccent的標准unaccent.rules文件以處理 Unicode 已知的所有變音符號，並正確擴展連字(Thomas Munro, Léonard Benedetti)
大膽強調我的。使用unaccent()上面連結答案中的指示：
CREATE OR REPLACE FUNCTION ufn_cie_easy(text)
 RETURNS text AS
$func$
SELECT trim(regexp_replace(regexp_replace(public.unaccent('public.unaccent', $1)
                                       , '[^a-zA-Z\d\s]', '', 'g')
                        , '\s+', ' ', 'g'));
$func$  LANGUAGE sql IMMUTABLE;
幾乎，但不完全 100% 與您所擁有的相同。更好的恕我直言：
替換所有重音符號並將所有連字擴展為unaccent().
刪除所有雜訊字元（除 ASCII 字母、數字和空格之外的所有字元）。
將空格折疊成一個空格 ( ' ')。
修剪前導和尾隨空格。

據我了解，您可以使用它translate()來實現您想要的。
一個小示範：
SELECT translate('Some ţext with ひ inţereşt平ing chăracters', 'ţşăひ平', 'tsa');
              translate                
────────────────────────────────────────
Some text with  interesting characters
因此，首先將要替換的字元放入函式的第三個參數中，每個匹配（按給定順序！）它們的替換。然後只列出所有必須刪除的字元，替換列表中沒有匹配項。
對於那些需要多字元替換的字元，您仍然可以replace()以類似的方式使用，但每個字元呼叫一次：
SELECT replace(replace('Æ small blæblæ', 'Æ', 'AE'), 'æ', 'ae');
     replace      
───────────────────
AE small blaeblae

引用自：https://dba.stackexchange.com/questions/145187

如何加快字元串清理功能？

我會在 Postgres 9.6 中做什麼

相關問答

帶有 SELECT 的 SQL 函式與帶有 RETURN QUERY SELECT 的 PLPGSQL 函式？

優化postgres功能

PostgreSQL UDF（使用者定義函式）成本

X時如何結束功能

我們可以在第一次執行 PL/pgSQL 函式時執行最優計劃而不是通用計劃嗎？

使用 := 運算符賦值