9acadc2b |
1 | \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham} |
8a48d402 |
2 | \cfg{html-chapter-numeric}{yes} |
9acadc2b |
3 | |
8a48d402 |
4 | \title Man page for \cw{cvt-utf8} |
9acadc2b |
5 | |
8a48d402 |
6 | \U NAME |
9acadc2b |
7 | |
8 | \cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode |
9 | |
8a48d402 |
10 | \U SYNOPSIS |
9acadc2b |
11 | |
12 | \c cvt-utf8 [flags] [hex UTF-8 bytes and/or U+codepoints] |
13 | \e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii |
14 | |
8a48d402 |
15 | \U DESCRIPTION |
9acadc2b |
16 | |
17 | \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and |
18 | Unicode data. Its functions include: |
19 | |
20 | \b Given a sequence of Unicode code points, convert them to the |
21 | corresponding sequence of bytes in the UTF-8 encoding. |
22 | |
23 | \b Given a sequence of UTF-8 bytes, convert them back into Unicode |
24 | code points. |
25 | |
26 | \b Given any combination of the above inputs, look up each Unicode |
27 | code point in the Unicode character database and identify it. |
28 | |
29 | \b Look up Unified Han characters in the \q{Unihan} database and |
30 | provide their translation text. |
31 | |
32 | By default, \cw{cvt-utf8} expects to receive hex numbers (either |
33 | UTF-8 bytes or Unicode code points) on the command line, and it will |
34 | print out a verbose analysis of the input data. If you need it to |
35 | read UTF-8 from standard input or to write pure UTF-8 to standard |
36 | output, you can do so using command-line options. |
37 | |
8a48d402 |
38 | \U OPTIONS |
9acadc2b |
39 | |
40 | \dt \cw{-i} |
41 | |
42 | \dd Read UTF-8 data from standard input and analyse that, instead of |
43 | expecting hex numbers on the command line. |
44 | |
45 | \dt \cw{-o} |
46 | |
47 | \dd Write well-formed UTF-8 to standard output, instead of writing a |
48 | long analysis of the input data. |
49 | |
50 | \dt \cw{-h} |
51 | |
52 | \dd Look up each code point in the Unihan database as well as the |
53 | main Unicode character database. |
54 | |
8a48d402 |
55 | \U EXAMPLES |
9acadc2b |
56 | |
57 | In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or |
58 | UTF-8 data. For example, you can give a list of Unicode code |
59 | points... |
60 | |
61 | \c $ cvt-utf8 U+20ac U+31 U+30 |
62 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
63 | \c U-000020AC E2 82 AC EURO SIGN |
64 | \c U-00000031 31 DIGIT ONE |
65 | \c U-00000030 30 DIGIT ZERO |
66 | |
67 | ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the |
68 | character definitions. |
69 | |
70 | Alternatively, you can supply a list of UTF-8 bytes... |
71 | |
72 | \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9 |
73 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
74 | \c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER |
75 | \c U-00000443 D1 83 CYRILLIC SMALL LETTER U |
76 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
77 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
78 | \c U-0000043A D0 BA CYRILLIC SMALL LETTER KA |
79 | \c U-00000438 D0 B8 CYRILLIC SMALL LETTER I |
80 | \c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I |
81 | |
82 | ... and you get back the same output format, including the UTF-8 |
83 | code points. |
84 | |
85 | If you supply malformed data, \cw{cvt-utf8} will break it down for |
86 | you and identify the malformed pieces and any correctly formed |
87 | characters: |
88 | |
89 | \c $ cvt-utf8 A9 FE 45 C2 80 90 0A |
90 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
91 | \c A9 (unexpected continuation byte) |
92 | \c FE (invalid UTF-8 byte) |
93 | \c U-00000045 45 LATIN CAPITAL LETTER E |
94 | \c U-00000080 C2 80 <control> |
95 | \c 90 (unexpected continuation byte) |
96 | \c U-0000000A 0A <control> |
97 | |
98 | If you need the UTF-8 encoding of a particular character, you can |
99 | use the \cw{-o} option to cause the UTF-8 to be written to standard |
100 | output: |
101 | |
102 | \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt |
103 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
104 | |
105 | If you have UTF-8 data in a file or output from another program, you |
106 | can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This |
107 | works particularly well if you also have my \cw{xcopy} program, |
108 | which can be told to extract UTF-8 data from the X selection and |
109 | write it to its standard output. With these two programs working |
110 | together, if you ever have trouble identifying some text in a |
111 | UTF-8-supporting web browser such as Mozilla, you can simply select |
112 | the text in question, switch to a terminal window, and type |
113 | |
114 | \c $ xcopy -u -r | cvt-utf8 -i |
115 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
116 | |
117 | If the text is in Chinese, you can get at least a general idea of |
118 | its meaning by using the \cw{-h} option to print the meaning of each |
119 | ideograph from the Unihan database. For example, if you pass in the |
120 | Chinese text meaning \q{Traditional Chinese}: |
121 | |
122 | \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587 |
123 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
124 | \c U-00007E41 E7 B9 81 <han> complicated, complex, difficult |
125 | \c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit |
126 | \c U-00004E2D E4 B8 AD <han> central; center, middle; in the |
127 | \c midst of; hit (target); attain |
128 | \c U-00006587 E6 96 87 <han> literature, culture, writing |
129 | |
8a48d402 |
130 | \U ADMINISTRATION |
9acadc2b |
131 | |
da0f8522 |
132 | In order to print the \cw{unicode.org} official name of each |
133 | character, \cw{cvt-utf8} requires file mapping code points to names. |
134 | This file is in DBM database format, for rapid lookup. |
135 | |
136 | This database file is accessed using the Python \cw{anydbm} module, |
137 | so its precise file name will vary depending on what flavours of DBM |
138 | you have installed. The name Python knows it by is \cq{unicode}; it |
139 | may actually be called \cq{unicode.db} or something similar. |
140 | |
141 | \cw{cvt-utf8} generates this DBM file itself starting from the |
142 | Unicode Character Database, in the form of the file |
143 | \cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two |
144 | administrative options for this purpose: |
145 | |
146 | \c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode |
147 | |
148 | Given a copy of \cw{UnicodeData.txt} on disk, this mode will create |
149 | the DBM file and store it in a place of your choice. |
150 | |
151 | \c cvt-utf8 --fetch-build /path/to/unicode |
152 | |
153 | If you have a direct Internet connection, this will automatically |
154 | download the text file from \cw{unicode.org} and process it straight |
155 | into the DBM file. |
156 | |
157 | There is a second DBM file, known to Python as \cw{unihan}, which is |
158 | required to support the \cw{-h} option. This one is built from the |
159 | Unihan Database, distributed by \cw{unicode.org} as a zip file |
160 | containing a text file \cw{Unihan.txt}. |
161 | |
162 | If you already have \cw{Unihan.txt} on your system, you can build |
163 | \cw{cvt-utf8}'s \cw{unihan} DBM file like this: |
164 | |
165 | \c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan |
166 | |
167 | Or, again, \cw{cvt-utf8} can automatically download it from |
168 | \cw{unicode.org}, unpack the zip file on the fly, and write the DBM |
169 | straight out: |
170 | |
171 | \c cvt-utf8 --fetch-build-unihan /path/to/unihan |
172 | |
173 | \cw{cvt-utf8} expects to find these database files in one of the |
174 | following locations: |
175 | |
176 | \c /usr/share/unicode |
177 | \c /usr/lib/unicode |
178 | \c /usr/local/share/unicode |
179 | \c /usr/local/lib/unicode |
180 | \c $HOME/share/unicode |
181 | \e iiiii |
182 | \c $HOME/lib/unicode |
183 | \e iiiii |
184 | |
185 | If either of these files is not found, \cw{cvt-utf8} will still |
186 | perform the rest of its functions. |
187 | |
8a48d402 |
188 | \U LICENCE |
da0f8522 |
189 | |
190 | \cw{cvt-utf8} is free software, distributed under the MIT licence. |
191 | Type \cw{cvt-utf8 --licence} to see the full licence text. |