修剪Unicode Whitespace在PHP 5.2中

https://stackoverflow.com/questions/4166896

09-10-2019
|

题

我该如何修剪 string(6) " page", ，第一个空格是0xc2a0非破坏空间？

我试过了 trim() 和 preg_match('/^\s*(.*)\s*$/u', $key, $m);.

另一个问题：如何可靠地复制这些字符？它们似乎转换为“正常”的空间，这很难调试。

解决方案

preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$str);

其他提示

PCRE Unicode属性属性可用于实现这一目标

这是我玩过的代码，似乎可以按照您的意愿来完成您的意愿：

<?php
function unicode_trim ($str) {
    return preg_replace('/^[\pZ\pC]+([\PZ\PC]*)[\pZ\pC]+$/u', '$1', $str);
}

$key = chr(0xc2) . chr(0xa0) . '#page#' . chr(0xc2) . chr(0xa0);

var_dump(unicode_trim($key));

结果

[~]> php e.php
string(6) "#page#"

解释：

p {xx}一个带有xx属性 p {xx}的字符没有xx属性

如果xx只有一个字符，则可以删除{}，例如 p {z}与 pz相同

Z代表所有分离器，C代表所有“其他”字符（例如控制字符）

现有的解决方案仅提及 \pZ 人物。但是，有六个Unicode Whitespace字符属于该属性的权限：

% unichars '\p{WhiteSpace}' '\PZ'
 --    9 0009 CHARACTER TABULATION
 --   10 000A LINE FEED (LF)
 --   11 000B LINE TABULATION
 --   12 000C FORM FEED (FF)
 --   13 000D CARRIAGE RETURN (CR)
 --  133 0085 NEXT LINE (NEL)

那六个都是类型 \pC, ，特别是类型 \p{Cc}. 。但是，还有五十九个非空格字符也是 \p{Cc}:

% unichars '\P{WhiteSpace}' '\p{Cc}' | wc -l
      59

我自己的测试的简单版本，说明某物是否是可打印的字符只是简单的 [\pZ\pC];就是这样 unichars 例如，用途。

更仔细的测试将考虑是否应占用0、1或2个打印位置。这需要考虑是否是合并标记，这是财产 \pM, ，以及它具有一半宽度还是全宽性特性。例如：

% uniprops ff5e ffeb
U+FF5E ‹～› \N{ FULLWIDTH TILDE }:
    \pS \p{Sm}
    All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
       CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
       Math_Symbol Print Symbol
U+FFEB ‹￫› \N{ HALFWIDTH RIGHTWARDS ARROW }:
    \pS \p{Sm}
    All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
       CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
       Math_Symbol Print Symbol

对于这些，您需要使用非二进制东亚宽度属性。这些适用：

% uniprops -l | grep -i width
Block:Halfwidth_And_Fullwidth_Forms
InHalfwidthAndFullwidthForms
East_Asian_Width:A
East_Asian_Width=Ambiguous
East_Asian_Width:Ambiguous
East_Asian_Width:F
East_Asian_Width=Fullwidth
East_Asian_Width:Fullwidth
East_Asian_Width:H
East_Asian_Width=Halfwidth
East_Asian_Width:Halfwidth
East_Asian_Width=Neutral
East_Asian_Width:Na
East_Asian_Width=Narrow
East_Asian_Width:Narrow
East_Asian_Width:Neutral
East_Asian_Width:W
East_Asian_Width=Wide
East_Asian_Width:Wide

这些缩写如 \p{Ea=F} 和 \p{Ea=H}. 。有很多：

% uninames '(FULL|HALF)WIDTH' | wc -l
     454

当然，您绝不能为这些东西命名，而是在属性上：

% unichars '[\p{Ea=F}\p{Ea=H}]' | wc -l
     227
% unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}]' | wc -l
     338
% unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}\pM]' | wc -l
    1488

向你展示多少个许多这些东西确实具有属性，这是三个不同字符的完整属性转储，与Unicode 5.2：

% uniprops -ga NEL "COMBINING TILDE" ff5e 
U+0085 ‹U+0085› \N{ NEXT LINE (NEL) }:
    \s \v \R \pC \p{Cc}
    All Any Assigned InLatin1 C Other Cc Cntrl Common Zyyy Control Pat_WS Pattern_White_Space PatWS Space SpacePerl VertSpace
       White_Space WSpace
    Age:1.1 Bidi_Class:B Bidi_Class=Paragraph_Separator Bidi_Class:Paragraph_Separator Bc=B Block:Latin_1
       Block=Latin_1_Supplement Block:Latin_1_Supplement Blk=Latin1 General_Category=Other Canonical_Combining_Class:0
       Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR
       General_Category=Control Script=Common Decomposition_Type:None Dt=None East_Asian_Width=Neutral East_Asian_Width:Neutral
       General_Category:C General_Category:Cc General_Category:Cntrl General_Category:Control Gc=Cc General_Category:Other Gc=C
       Grapheme_Cluster_Break:CN Grapheme_Cluster_Break=Control Grapheme_Cluster_Break:Control GCB=CN Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
       Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:Next_Line Lb=NL
       Line_Break:NL Line_Break=Next_Line Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
       Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
       Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:SE Sentence_Break=Sep Sentence_Break:Sep SB=SE Word_Break:Newline WB=NL
       Word_Break:NL Word_Break=Newline
U+0303 ‹̃› \N{ COMBINING TILDE }:
    \w \pM \p{Mn}
    All Any Assigned InCombiningDiacriticalMarks Case_Ignorable CI Dia Diacritic M Mn Gr_Ext Grapheme_Extend Graph GrExt
       ID_Continue IDC Inherited Zinh Mark Nonspacing_Mark Print Qaai Word XID_Continue XIDC
    Age:1.1 Bidi_Class:Nonspacing_Mark Bc=NSM Bidi_Class:NSM Bidi_Class=Nonspacing_Mark Block:Combining_Diacritical_Marks
       Canonical_Combining_Class:230 Canonical_Combining_Class=Above Canonical_Combining_Class:A
       Canonical_Combining_Class:Above Ccc=A Decomposition_Type:None Dt=None East_Asian_Width:A East_Asian_Width=Ambiguous
       East_Asian_Width:Ambiguous Ea=A General_Category:M General_Category=Mark General_Category:Mark Gc=M General_Category:Mn
       General_Category=Nonspacing_Mark General_Category:Nonspacing_Mark Gc=Mn Grapheme_Cluster_Break:EX
       Grapheme_Cluster_Break=Extend Grapheme_Cluster_Break:Extend GCB=EX Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Script=Inherited
       Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:T Joining_Type=Transparent Joining_Type:Transparent Jt=T
       Line_Break:CM Line_Break=Combining_Mark Line_Break:Combining_Mark Lb=CM NFC_Quick_Check:M NFC_Quick_Check=Maybe
       NFC_Quick_Check:Maybe NFCQC=M NFKC_Quick_Check:M NFKC_Quick_Check=Maybe NFKC_Quick_Check:Maybe NFKCQC=M
       Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1
       In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1
       Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Inherited Sc=Zinh Script:Qaai Script:Zinh
       Sentence_Break:EX Sentence_Break=Extend Sentence_Break:Extend SB=EX Word_Break:Extend WB=Extend
U+FF5E ‹～› \N{ FULLWIDTH TILDE }:
    \pS \p{Sm}
    All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base
       Graph GrBase Math Math_Symbol Print Symbol
    Age:1.1 Bidi_Class:ON Bidi_Class=Other_Neutral Bidi_Class:Other_Neutral Bc=ON Block:Halfwidth_And_Fullwidth_Forms
       Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR
       Canonical_Combining_Class:NR Script=Common Decomposition_Type:Non_Canon Decomposition_Type=Non_Canonical
       Decomposition_Type:Non_Canonical Dt=NonCanon Decomposition_Type:Wide Dt=Wide East_Asian_Width:F
       East_Asian_Width=Fullwidth East_Asian_Width:Fullwidth Ea=F General_Category:Math_Symbol Gc=Sm General_Category:S
       General_Category=Symbol General_Category:Sm General_Category=Math_Symbol General_Category:Symbol Gc=S
       Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
       Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:ID
       Line_Break=Ideographic Line_Break:Ideographic Lb=ID Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1
       Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2
       In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
       Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:Other SB=XX Sentence_Break:XX Sentence_Break=Other Word_Break:Other
       WB=XX Word_Break:XX Word_Break=Other

很棒，嗯？

如果您阅读了这本书，并且想知道在哪里获取上面说明的三个Unicode实用程序， uniprops, unichars, ，和 uninames, ，请给我发送邮件，因为当前链接现在无法正常工作。

也许来自多型字符串函数集中的东西？ http://php.net/manual/en/function.mb-ereg.php 看不到MB_TRIM，但是有一组MB安全的正则函数。

此页面可能会有所帮助：

http://nadeausoftware.com/articles/2007/9/php_tip_how_how_strip_punctuation_characters_web_page

但是这里是 我唯一的解决方案, ，因为有时有UTF8空间：

$stringg = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$stringg);
$stringg = preg_replace('/\s+/u', '', $stringg);

上面的任何答案实际上都没有用来删除UTF-8字符串中的尾随白色空间。

找到该解决方案这里运作完美，是最短的：

trim($str, "\t\n\r\0\x0B\xC2\xA0");

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow