9acadc2b |
1 | \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham} |
2 | \cfg{man-mindepth}{1} |
3 | |
4 | \C{cvt-utf8-manpage} Man page for \cw{cvt-utf8} |
5 | |
6 | \H{cvt-utf8-manpage-name} NAME |
7 | |
8 | \cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode |
9 | |
10 | \H{cvt-utf8-manpage-synopsis} SYNOPSIS |
11 | |
12 | \c cvt-utf8 [flags] [hex UTF-8 bytes and/or U+codepoints] |
13 | \e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii |
14 | |
15 | \H{cvt-utf8-manpage-description} DESCRIPTION |
16 | |
17 | \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and |
18 | Unicode data. Its functions include: |
19 | |
20 | \b Given a sequence of Unicode code points, convert them to the |
21 | corresponding sequence of bytes in the UTF-8 encoding. |
22 | |
23 | \b Given a sequence of UTF-8 bytes, convert them back into Unicode |
24 | code points. |
25 | |
26 | \b Given any combination of the above inputs, look up each Unicode |
27 | code point in the Unicode character database and identify it. |
28 | |
29 | \b Look up Unified Han characters in the \q{Unihan} database and |
30 | provide their translation text. |
31 | |
32 | By default, \cw{cvt-utf8} expects to receive hex numbers (either |
33 | UTF-8 bytes or Unicode code points) on the command line, and it will |
34 | print out a verbose analysis of the input data. If you need it to |
35 | read UTF-8 from standard input or to write pure UTF-8 to standard |
36 | output, you can do so using command-line options. |
37 | |
38 | \H{cvt-utf8-manpage-options} OPTIONS |
39 | |
40 | \dt \cw{-i} |
41 | |
42 | \dd Read UTF-8 data from standard input and analyse that, instead of |
43 | expecting hex numbers on the command line. |
44 | |
45 | \dt \cw{-o} |
46 | |
47 | \dd Write well-formed UTF-8 to standard output, instead of writing a |
48 | long analysis of the input data. |
49 | |
50 | \dt \cw{-h} |
51 | |
52 | \dd Look up each code point in the Unihan database as well as the |
53 | main Unicode character database. |
54 | |
55 | \H{cvt-utf8-manpage-examples} EXAMPLES |
56 | |
57 | In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or |
58 | UTF-8 data. For example, you can give a list of Unicode code |
59 | points... |
60 | |
61 | \c $ cvt-utf8 U+20ac U+31 U+30 |
62 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
63 | \c U-000020AC E2 82 AC EURO SIGN |
64 | \c U-00000031 31 DIGIT ONE |
65 | \c U-00000030 30 DIGIT ZERO |
66 | |
67 | ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the |
68 | character definitions. |
69 | |
70 | Alternatively, you can supply a list of UTF-8 bytes... |
71 | |
72 | \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9 |
73 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
74 | \c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER |
75 | \c U-00000443 D1 83 CYRILLIC SMALL LETTER U |
76 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
77 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
78 | \c U-0000043A D0 BA CYRILLIC SMALL LETTER KA |
79 | \c U-00000438 D0 B8 CYRILLIC SMALL LETTER I |
80 | \c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I |
81 | |
82 | ... and you get back the same output format, including the UTF-8 |
83 | code points. |
84 | |
85 | If you supply malformed data, \cw{cvt-utf8} will break it down for |
86 | you and identify the malformed pieces and any correctly formed |
87 | characters: |
88 | |
89 | \c $ cvt-utf8 A9 FE 45 C2 80 90 0A |
90 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
91 | \c A9 (unexpected continuation byte) |
92 | \c FE (invalid UTF-8 byte) |
93 | \c U-00000045 45 LATIN CAPITAL LETTER E |
94 | \c U-00000080 C2 80 <control> |
95 | \c 90 (unexpected continuation byte) |
96 | \c U-0000000A 0A <control> |
97 | |
98 | If you need the UTF-8 encoding of a particular character, you can |
99 | use the \cw{-o} option to cause the UTF-8 to be written to standard |
100 | output: |
101 | |
102 | \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt |
103 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
104 | |
105 | If you have UTF-8 data in a file or output from another program, you |
106 | can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This |
107 | works particularly well if you also have my \cw{xcopy} program, |
108 | which can be told to extract UTF-8 data from the X selection and |
109 | write it to its standard output. With these two programs working |
110 | together, if you ever have trouble identifying some text in a |
111 | UTF-8-supporting web browser such as Mozilla, you can simply select |
112 | the text in question, switch to a terminal window, and type |
113 | |
114 | \c $ xcopy -u -r | cvt-utf8 -i |
115 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
116 | |
117 | If the text is in Chinese, you can get at least a general idea of |
118 | its meaning by using the \cw{-h} option to print the meaning of each |
119 | ideograph from the Unihan database. For example, if you pass in the |
120 | Chinese text meaning \q{Traditional Chinese}: |
121 | |
122 | \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587 |
123 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
124 | \c U-00007E41 E7 B9 81 <han> complicated, complex, difficult |
125 | \c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit |
126 | \c U-00004E2D E4 B8 AD <han> central; center, middle; in the |
127 | \c midst of; hit (target); attain |
128 | \c U-00006587 E6 96 87 <han> literature, culture, writing |
129 | |
130 | \H{cvt-utf8-manpage-bugs} BUGS |
131 | |
132 | Command-line option processing is very basic. In particular, \cw{-h} |
133 | must come before \cw{-i} or it will not be recognised. |