9acadc2b |
1 | \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham} |
9acadc2b |
2 | |
92dccb8d |
3 | \define{dash} \u2013{-} |
4 | |
8a48d402 |
5 | \title Man page for \cw{cvt-utf8} |
9acadc2b |
6 | |
8a48d402 |
7 | \U NAME |
9acadc2b |
8 | |
92dccb8d |
9 | \cw{cvt-utf8} \dash convert between UTF-8 and Unicode, and analyse Unicode |
9acadc2b |
10 | |
8a48d402 |
11 | \U SYNOPSIS |
9acadc2b |
12 | |
337e121d |
13 | \c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities] |
14 | \e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii |
9acadc2b |
15 | |
8a48d402 |
16 | \U DESCRIPTION |
9acadc2b |
17 | |
18 | \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and |
19 | Unicode data. Its functions include: |
20 | |
21 | \b Given a sequence of Unicode code points, convert them to the |
22 | corresponding sequence of bytes in the UTF-8 encoding. |
23 | |
24 | \b Given a sequence of UTF-8 bytes, convert them back into Unicode |
25 | code points. |
26 | |
27 | \b Given any combination of the above inputs, look up each Unicode |
28 | code point in the Unicode character database and identify it. |
29 | |
30 | \b Look up Unified Han characters in the \q{Unihan} database and |
31 | provide their translation text. |
32 | |
337e121d |
33 | By default, \cw{cvt-utf8} expects to receive character data on the |
34 | command line (as a mixture of UTF-8 bytes, Unicode code points and |
35 | SGML numeric character entities), and it will print out a verbose |
36 | analysis of the input data. If you need it to read UTF-8 from |
37 | standard input or to write pure UTF-8 to standard output, you can do |
38 | so using command-line options. |
9acadc2b |
39 | |
8a48d402 |
40 | \U OPTIONS |
9acadc2b |
41 | |
42 | \dt \cw{-i} |
43 | |
44 | \dd Read UTF-8 data from standard input and analyse that, instead of |
45 | expecting hex numbers on the command line. |
46 | |
47 | \dt \cw{-o} |
48 | |
49 | \dd Write well-formed UTF-8 to standard output, instead of writing a |
50 | long analysis of the input data. |
51 | |
52 | \dt \cw{-h} |
53 | |
54 | \dd Look up each code point in the Unihan database as well as the |
55 | main Unicode character database. |
56 | |
8a48d402 |
57 | \U EXAMPLES |
9acadc2b |
58 | |
59 | In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or |
60 | UTF-8 data. For example, you can give a list of Unicode code |
61 | points... |
62 | |
63 | \c $ cvt-utf8 U+20ac U+31 U+30 |
64 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
65 | \c U-000020AC E2 82 AC EURO SIGN |
66 | \c U-00000031 31 DIGIT ONE |
67 | \c U-00000030 30 DIGIT ZERO |
68 | |
69 | ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the |
70 | character definitions. |
71 | |
337e121d |
72 | If it's more convenient, you can specify those characters as SGML |
73 | numeric entity references (for example if you're cutting and pasting |
74 | out of a web page): |
75 | |
76 | \c $ cvt-utf8 '€' '–' |
77 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
78 | \c U-000020AC E2 82 AC EURO SIGN |
79 | \c U-00002013 E2 80 93 EN DASH |
80 | |
9acadc2b |
81 | Alternatively, you can supply a list of UTF-8 bytes... |
82 | |
83 | \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9 |
84 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
85 | \c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER |
86 | \c U-00000443 D1 83 CYRILLIC SMALL LETTER U |
87 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
88 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
89 | \c U-0000043A D0 BA CYRILLIC SMALL LETTER KA |
90 | \c U-00000438 D0 B8 CYRILLIC SMALL LETTER I |
91 | \c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I |
92 | |
93 | ... and you get back the same output format, including the UTF-8 |
94 | code points. |
95 | |
96 | If you supply malformed data, \cw{cvt-utf8} will break it down for |
97 | you and identify the malformed pieces and any correctly formed |
98 | characters: |
99 | |
100 | \c $ cvt-utf8 A9 FE 45 C2 80 90 0A |
101 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
102 | \c A9 (unexpected continuation byte) |
103 | \c FE (invalid UTF-8 byte) |
104 | \c U-00000045 45 LATIN CAPITAL LETTER E |
105 | \c U-00000080 C2 80 <control> |
106 | \c 90 (unexpected continuation byte) |
107 | \c U-0000000A 0A <control> |
108 | |
109 | If you need the UTF-8 encoding of a particular character, you can |
110 | use the \cw{-o} option to cause the UTF-8 to be written to standard |
111 | output: |
112 | |
113 | \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt |
114 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
115 | |
116 | If you have UTF-8 data in a file or output from another program, you |
117 | can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This |
118 | works particularly well if you also have my \cw{xcopy} program, |
119 | which can be told to extract UTF-8 data from the X selection and |
120 | write it to its standard output. With these two programs working |
121 | together, if you ever have trouble identifying some text in a |
122 | UTF-8-supporting web browser such as Mozilla, you can simply select |
123 | the text in question, switch to a terminal window, and type |
124 | |
125 | \c $ xcopy -u -r | cvt-utf8 -i |
126 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
127 | |
128 | If the text is in Chinese, you can get at least a general idea of |
129 | its meaning by using the \cw{-h} option to print the meaning of each |
130 | ideograph from the Unihan database. For example, if you pass in the |
131 | Chinese text meaning \q{Traditional Chinese}: |
132 | |
133 | \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587 |
134 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
135 | \c U-00007E41 E7 B9 81 <han> complicated, complex, difficult |
136 | \c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit |
137 | \c U-00004E2D E4 B8 AD <han> central; center, middle; in the |
138 | \c midst of; hit (target); attain |
139 | \c U-00006587 E6 96 87 <han> literature, culture, writing |
140 | |
8a48d402 |
141 | \U ADMINISTRATION |
9acadc2b |
142 | |
da0f8522 |
143 | In order to print the \cw{unicode.org} official name of each |
f2cae604 |
144 | character, \cw{cvt-utf8} requires a file mapping code points to |
145 | names. This file is in DBM database format, for rapid lookup. |
da0f8522 |
146 | |
147 | This database file is accessed using the Python \cw{anydbm} module, |
148 | so its precise file name will vary depending on what flavours of DBM |
149 | you have installed. The name Python knows it by is \cq{unicode}; it |
150 | may actually be called \cq{unicode.db} or something similar. |
151 | |
152 | \cw{cvt-utf8} generates this DBM file itself starting from the |
153 | Unicode Character Database, in the form of the file |
154 | \cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two |
155 | administrative options for this purpose: |
156 | |
157 | \c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode |
158 | |
159 | Given a copy of \cw{UnicodeData.txt} on disk, this mode will create |
160 | the DBM file and store it in a place of your choice. |
161 | |
162 | \c cvt-utf8 --fetch-build /path/to/unicode |
163 | |
164 | If you have a direct Internet connection, this will automatically |
165 | download the text file from \cw{unicode.org} and process it straight |
166 | into the DBM file. |
167 | |
168 | There is a second DBM file, known to Python as \cw{unihan}, which is |
169 | required to support the \cw{-h} option. This one is built from the |
170 | Unihan Database, distributed by \cw{unicode.org} as a zip file |
171 | containing a text file \cw{Unihan.txt}. |
172 | |
173 | If you already have \cw{Unihan.txt} on your system, you can build |
174 | \cw{cvt-utf8}'s \cw{unihan} DBM file like this: |
175 | |
176 | \c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan |
177 | |
178 | Or, again, \cw{cvt-utf8} can automatically download it from |
179 | \cw{unicode.org}, unpack the zip file on the fly, and write the DBM |
180 | straight out: |
181 | |
182 | \c cvt-utf8 --fetch-build-unihan /path/to/unihan |
183 | |
184 | \cw{cvt-utf8} expects to find these database files in one of the |
185 | following locations: |
186 | |
187 | \c /usr/share/unicode |
188 | \c /usr/lib/unicode |
189 | \c /usr/local/share/unicode |
190 | \c /usr/local/lib/unicode |
191 | \c $HOME/share/unicode |
192 | \e iiiii |
193 | \c $HOME/lib/unicode |
194 | \e iiiii |
195 | |
196 | If either of these files is not found, \cw{cvt-utf8} will still |
197 | perform the rest of its functions. |
198 | |
8a48d402 |
199 | \U LICENCE |
da0f8522 |
200 | |
201 | \cw{cvt-utf8} is free software, distributed under the MIT licence. |
202 | Type \cw{cvt-utf8 --licence} to see the full licence text. |
1166ff62 |
203 | |
204 | \versionid $Id$ |