9acadc2b |
1 | \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham} |
9acadc2b |
2 | |
8a48d402 |
3 | \title Man page for \cw{cvt-utf8} |
9acadc2b |
4 | |
8a48d402 |
5 | \U NAME |
9acadc2b |
6 | |
7 | \cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode |
8 | |
8a48d402 |
9 | \U SYNOPSIS |
9acadc2b |
10 | |
11 | \c cvt-utf8 [flags] [hex UTF-8 bytes and/or U+codepoints] |
12 | \e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii |
13 | |
8a48d402 |
14 | \U DESCRIPTION |
9acadc2b |
15 | |
16 | \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and |
17 | Unicode data. Its functions include: |
18 | |
19 | \b Given a sequence of Unicode code points, convert them to the |
20 | corresponding sequence of bytes in the UTF-8 encoding. |
21 | |
22 | \b Given a sequence of UTF-8 bytes, convert them back into Unicode |
23 | code points. |
24 | |
25 | \b Given any combination of the above inputs, look up each Unicode |
26 | code point in the Unicode character database and identify it. |
27 | |
28 | \b Look up Unified Han characters in the \q{Unihan} database and |
29 | provide their translation text. |
30 | |
31 | By default, \cw{cvt-utf8} expects to receive hex numbers (either |
32 | UTF-8 bytes or Unicode code points) on the command line, and it will |
33 | print out a verbose analysis of the input data. If you need it to |
34 | read UTF-8 from standard input or to write pure UTF-8 to standard |
35 | output, you can do so using command-line options. |
36 | |
8a48d402 |
37 | \U OPTIONS |
9acadc2b |
38 | |
39 | \dt \cw{-i} |
40 | |
41 | \dd Read UTF-8 data from standard input and analyse that, instead of |
42 | expecting hex numbers on the command line. |
43 | |
44 | \dt \cw{-o} |
45 | |
46 | \dd Write well-formed UTF-8 to standard output, instead of writing a |
47 | long analysis of the input data. |
48 | |
49 | \dt \cw{-h} |
50 | |
51 | \dd Look up each code point in the Unihan database as well as the |
52 | main Unicode character database. |
53 | |
8a48d402 |
54 | \U EXAMPLES |
9acadc2b |
55 | |
56 | In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or |
57 | UTF-8 data. For example, you can give a list of Unicode code |
58 | points... |
59 | |
60 | \c $ cvt-utf8 U+20ac U+31 U+30 |
61 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
62 | \c U-000020AC E2 82 AC EURO SIGN |
63 | \c U-00000031 31 DIGIT ONE |
64 | \c U-00000030 30 DIGIT ZERO |
65 | |
66 | ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the |
67 | character definitions. |
68 | |
69 | Alternatively, you can supply a list of UTF-8 bytes... |
70 | |
71 | \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9 |
72 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
73 | \c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER |
74 | \c U-00000443 D1 83 CYRILLIC SMALL LETTER U |
75 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
76 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
77 | \c U-0000043A D0 BA CYRILLIC SMALL LETTER KA |
78 | \c U-00000438 D0 B8 CYRILLIC SMALL LETTER I |
79 | \c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I |
80 | |
81 | ... and you get back the same output format, including the UTF-8 |
82 | code points. |
83 | |
84 | If you supply malformed data, \cw{cvt-utf8} will break it down for |
85 | you and identify the malformed pieces and any correctly formed |
86 | characters: |
87 | |
88 | \c $ cvt-utf8 A9 FE 45 C2 80 90 0A |
89 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
90 | \c A9 (unexpected continuation byte) |
91 | \c FE (invalid UTF-8 byte) |
92 | \c U-00000045 45 LATIN CAPITAL LETTER E |
93 | \c U-00000080 C2 80 <control> |
94 | \c 90 (unexpected continuation byte) |
95 | \c U-0000000A 0A <control> |
96 | |
97 | If you need the UTF-8 encoding of a particular character, you can |
98 | use the \cw{-o} option to cause the UTF-8 to be written to standard |
99 | output: |
100 | |
101 | \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt |
102 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
103 | |
104 | If you have UTF-8 data in a file or output from another program, you |
105 | can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This |
106 | works particularly well if you also have my \cw{xcopy} program, |
107 | which can be told to extract UTF-8 data from the X selection and |
108 | write it to its standard output. With these two programs working |
109 | together, if you ever have trouble identifying some text in a |
110 | UTF-8-supporting web browser such as Mozilla, you can simply select |
111 | the text in question, switch to a terminal window, and type |
112 | |
113 | \c $ xcopy -u -r | cvt-utf8 -i |
114 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
115 | |
116 | If the text is in Chinese, you can get at least a general idea of |
117 | its meaning by using the \cw{-h} option to print the meaning of each |
118 | ideograph from the Unihan database. For example, if you pass in the |
119 | Chinese text meaning \q{Traditional Chinese}: |
120 | |
121 | \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587 |
122 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
123 | \c U-00007E41 E7 B9 81 <han> complicated, complex, difficult |
124 | \c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit |
125 | \c U-00004E2D E4 B8 AD <han> central; center, middle; in the |
126 | \c midst of; hit (target); attain |
127 | \c U-00006587 E6 96 87 <han> literature, culture, writing |
128 | |
8a48d402 |
129 | \U ADMINISTRATION |
9acadc2b |
130 | |
da0f8522 |
131 | In order to print the \cw{unicode.org} official name of each |
f2cae604 |
132 | character, \cw{cvt-utf8} requires a file mapping code points to |
133 | names. This file is in DBM database format, for rapid lookup. |
da0f8522 |
134 | |
135 | This database file is accessed using the Python \cw{anydbm} module, |
136 | so its precise file name will vary depending on what flavours of DBM |
137 | you have installed. The name Python knows it by is \cq{unicode}; it |
138 | may actually be called \cq{unicode.db} or something similar. |
139 | |
140 | \cw{cvt-utf8} generates this DBM file itself starting from the |
141 | Unicode Character Database, in the form of the file |
142 | \cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two |
143 | administrative options for this purpose: |
144 | |
145 | \c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode |
146 | |
147 | Given a copy of \cw{UnicodeData.txt} on disk, this mode will create |
148 | the DBM file and store it in a place of your choice. |
149 | |
150 | \c cvt-utf8 --fetch-build /path/to/unicode |
151 | |
152 | If you have a direct Internet connection, this will automatically |
153 | download the text file from \cw{unicode.org} and process it straight |
154 | into the DBM file. |
155 | |
156 | There is a second DBM file, known to Python as \cw{unihan}, which is |
157 | required to support the \cw{-h} option. This one is built from the |
158 | Unihan Database, distributed by \cw{unicode.org} as a zip file |
159 | containing a text file \cw{Unihan.txt}. |
160 | |
161 | If you already have \cw{Unihan.txt} on your system, you can build |
162 | \cw{cvt-utf8}'s \cw{unihan} DBM file like this: |
163 | |
164 | \c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan |
165 | |
166 | Or, again, \cw{cvt-utf8} can automatically download it from |
167 | \cw{unicode.org}, unpack the zip file on the fly, and write the DBM |
168 | straight out: |
169 | |
170 | \c cvt-utf8 --fetch-build-unihan /path/to/unihan |
171 | |
172 | \cw{cvt-utf8} expects to find these database files in one of the |
173 | following locations: |
174 | |
175 | \c /usr/share/unicode |
176 | \c /usr/lib/unicode |
177 | \c /usr/local/share/unicode |
178 | \c /usr/local/lib/unicode |
179 | \c $HOME/share/unicode |
180 | \e iiiii |
181 | \c $HOME/lib/unicode |
182 | \e iiiii |
183 | |
184 | If either of these files is not found, \cw{cvt-utf8} will still |
185 | perform the rest of its functions. |
186 | |
8a48d402 |
187 | \U LICENCE |
da0f8522 |
188 | |
189 | \cw{cvt-utf8} is free software, distributed under the MIT licence. |
190 | Type \cw{cvt-utf8 --licence} to see the full licence text. |
1166ff62 |
191 | |
192 | \versionid $Id$ |