Enumerar las propiedades unicode de un personaje en Ruby?
-
27-10-2019 - |
Pregunta
¿Hay alguna forma de enumerar todas las propiedades unicode de un personaje en Ruby? Puedo usar la clase REGEXP de Ruby 1.9 para probar si un personaje determinado tiene una propiedad en particular (por ejemplo, some_char =~ /\p{P}/
Para probar si some_char
es puntuación, etc.) ... pero dado que los caracteres pueden tener múltiples propiedades ((
, por ejemplo, es ambas puntuación y ASCII, etc.), sería bueno poder obtener una lista de todas las propiedades de un personaje.
Probablemente podría hacer esto a mano usando unicode_data.txt
, o como se llame, pero este parece ser el tipo de cosas que probablemente ya se han hecho en alguna parte. UnicodeUtils
No parece tener nada en este sentido, y Googling no apareció nada obvio. ¡Gracias!
Solución
Puedes llamar a mi Script Uniprops.
$ uniprops -p delta greek:delta Greek:Delta
U+1E9F ‹ẟ› \N{ LATIN SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+03B4 ‹δ› \N{ GREEK SMALL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
U+0394 ‹Δ› \N{ GREEK CAPITAL LETTER DELTA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
$ uniprops \# ç π
U+0023 ‹#› \N{ NUMBER SIGN }:
\pP \p{Po}
All Any ASCII Assigned Common Zyyy Po P Gr_Base
Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
Print Punctuation
U+00E7 ‹ç› \N{ LATIN SMALL LETTER C WITH CEDILLA }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
Lowercase Print Word XID_Continue XIDC XID_Start XIDS
U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS
$ uniprops -a 'MICRO SIGN'
U+00B5 ‹µ› \N{MICRO SIGN}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM
Changes_When_NFKC_Casefolded CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased CWU Common Zyyy Ll L Gr_Base Grapheme_Base
Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Latin_1 Latin_1_Supplement Lowercase_Letter Lower Lowercase Print Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Lower X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Latin_1 Block=Latin_1_Supplement BLK=Latin1 Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Com
Decomposition_Type=Compat DT=Com Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic
LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1
Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0
Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=LO Sentence_Break=Lower SB=LO
Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
$ uniprops -a 2011
U+2011 ‹‑› \N{NON-BREAKING HYPHEN}
\pP \p{Pd}
All Any Assigned InGeneralPunctuation Changes_When_NFKC_Casefolded CWKCF Common Zyyy Dash Dash_Punctuation Pd P General_Punctuation
Gr_Base Grapheme_Base Graph GrBase Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation X_POSIX_Graph X_POSIX_Print X_POSIX_Punct
Age=1.1 Bidi_Class=ON Bidi_Class=Other_Neutral BC=ON Block=General_Punctuation Canonical_Combining_Class=0
Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=Nb
Decomposition_Type=Nobreak DT=Nb Decomposition_Type=Non_Canon Decomposition_Type=Non_Canonical DT=NonCanon East_Asian_Width=Neutral
Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA
Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=GL Line_Break=Glue LB=GL
Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0
IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1
IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other
WB=XX Word_Break=XX _X_Begin
$ uniprops -l | grep Greek | sort -dfu
Blk=Greek
Block:Ancient_Greek_Musical_Notation
Block:Ancient_Greek_Numbers
Block:Greek
Block=Greek_And_Coptic
Block:Greek_Extended
Greek
Greek_And_Coptic
InAncientGreekMusicalNotation
InAncientGreekNumbers
InGreek
InGreekExtended
Is_Greek
Script=Greek
Probablemente también quieras obtener unichars Entonces puedes ir al otro lado. Estos son solo los ejemplos de llamarlo:
$ unichars -gns '\p{Cased}' '\p{Number}'
$ unichars '\R'
$ unichars '\S' '[\v\h]'
$ unichars '\S' '\p{space}'
$ unichars '\pL' '\p{Greek}'
$ unichars '\pL' '\p{Greek}' | um
$ unichars '\p{Age=6.0}' | um
$ unichars '\p{Lowercase}' '\P{Lowercase_Letter}'
$ unichars '\p{Lower}' '\P{Ll}' # same but easier to type
$ unichars -a '\p{alphabetic}' '\P{Letter}' | wc -l # 1006 code points
$ unichars -gas '\PL' '\p{Cased}'
$ unichars -gas '\P{MARK}' '\p{diacritic}' # 209 code points
$ unichars -gas '\pM' '\P{BC=NSM}'
$ unichars -gas '\p{Cased}' '[^\p{CWL}\p{CWT}\p{CWU}]'
$ unichars -gas '\p{Dash}'
$ unichars -gas '\p{mark}' '\P{DIACRITIC}' # 1068 code points
$ unichars -gas 'grep { length > 1 } lc, ucfirst, uc'
$ unichars -gas 'uc ne ucfirst'
$ unichars -gasn NUM
Aquí hay un ejemplo de la salida:
$ unichars -gsn NUM 'int NUM ne NUM'
0 U+0030 GC=Nd 0=NV SC=Common DIGIT ZERO
¼ U+00BC GC=No 1/4=NV SC=Common VULGAR FRACTION ONE QUARTER
½ U+00BD GC=No 1/2=NV SC=Common VULGAR FRACTION ONE HALF
¾ U+00BE GC=No 3/4=NV SC=Common VULGAR FRACTION THREE QUARTERS
٠ U+0660 GC=Nd 0=NV SC=Common ARABIC-INDIC DIGIT ZERO
۰ U+06F0 GC=Nd 0=NV SC=Arabic EXTENDED ARABIC-INDIC DIGIT ZERO
߀ U+07C0 GC=Nd 0=NV SC=Nko NKO DIGIT ZERO
० U+0966 GC=Nd 0=NV SC=Devanagari DEVANAGARI DIGIT ZERO
০ U+09E6 GC=Nd 0=NV SC=Bengali BENGALI DIGIT ZERO
৴ U+09F4 GC=No 1/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE
৵ U+09F5 GC=No 1/8=NV SC=Bengali BENGALI CURRENCY NUMERATOR TWO
৶ U+09F6 GC=No 3/16=NV SC=Bengali BENGALI CURRENCY NUMERATOR THREE
৷ U+09F7 GC=No 1/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR FOUR
৸ U+09F8 GC=No 3/4=NV SC=Bengali BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR
੦ U+0A66 GC=Nd 0=NV SC=Gurmukhi GURMUKHI DIGIT ZERO
૦ U+0AE6 GC=Nd 0=NV SC=Gujarati GUJARATI DIGIT ZERO
୦ U+0B66 GC=Nd 0=NV SC=Oriya ORIYA DIGIT ZERO
୲ U+0B72 GC=No 1/4=NV SC=Oriya ORIYA FRACTION ONE QUARTER
୳ U+0B73 GC=No 1/2=NV SC=Oriya ORIYA FRACTION ONE HALF
୴ U+0B74 GC=No 3/4=NV SC=Oriya ORIYA FRACTION THREE QUARTERS
୵ U+0B75 GC=No 1/16=NV SC=Oriya ORIYA FRACTION ONE SIXTEENTH
୶ U+0B76 GC=No 1/8=NV SC=Oriya ORIYA FRACTION ONE EIGHTH
୷ U+0B77 GC=No 3/16=NV SC=Oriya ORIYA FRACTION THREE SIXTEENTHS
etc.
Describo estos el primero de mi Habla de Oscon Unicode. Esas son solo dos de las herramientas en un conjunto de un par de docenas de ellas.
Otros consejos
Hay un interfaz unicode_data.txt por runPaint, que funciona bien, pero se describe a sí mismo como un "borrador muy temprano".