Oops; let's leave off `-v' from the tar command line, or my nightly
[sgt/utils] / cvt-utf8 / cvt-utf8.but
CommitLineData
9acadc2b 1\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
8a48d402 2\cfg{html-chapter-numeric}{yes}
9acadc2b 3
8a48d402 4\title Man page for \cw{cvt-utf8}
9acadc2b 5
8a48d402 6\U NAME
9acadc2b 7
8\cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode
9
8a48d402 10\U SYNOPSIS
9acadc2b 11
12\c cvt-utf8 [flags] [hex UTF-8 bytes and/or U+codepoints]
13\e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
14
8a48d402 15\U DESCRIPTION
9acadc2b 16
17\cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
18Unicode data. Its functions include:
19
20\b Given a sequence of Unicode code points, convert them to the
21corresponding sequence of bytes in the UTF-8 encoding.
22
23\b Given a sequence of UTF-8 bytes, convert them back into Unicode
24code points.
25
26\b Given any combination of the above inputs, look up each Unicode
27code point in the Unicode character database and identify it.
28
29\b Look up Unified Han characters in the \q{Unihan} database and
30provide their translation text.
31
32By default, \cw{cvt-utf8} expects to receive hex numbers (either
33UTF-8 bytes or Unicode code points) on the command line, and it will
34print out a verbose analysis of the input data. If you need it to
35read UTF-8 from standard input or to write pure UTF-8 to standard
36output, you can do so using command-line options.
37
8a48d402 38\U OPTIONS
9acadc2b 39
40\dt \cw{-i}
41
42\dd Read UTF-8 data from standard input and analyse that, instead of
43expecting hex numbers on the command line.
44
45\dt \cw{-o}
46
47\dd Write well-formed UTF-8 to standard output, instead of writing a
48long analysis of the input data.
49
50\dt \cw{-h}
51
52\dd Look up each code point in the Unihan database as well as the
53main Unicode character database.
54
8a48d402 55\U EXAMPLES
9acadc2b 56
57In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
58UTF-8 data. For example, you can give a list of Unicode code
59points...
60
61\c $ cvt-utf8 U+20ac U+31 U+30
62\e bbbbbbbbbbbbbbbbbbbbbbbbb
63\c U-000020AC E2 82 AC EURO SIGN
64\c U-00000031 31 DIGIT ONE
65\c U-00000030 30 DIGIT ZERO
66
67... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
68character definitions.
69
70Alternatively, you can supply a list of UTF-8 bytes...
71
72\c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
73\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
74\c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER
75\c U-00000443 D1 83 CYRILLIC SMALL LETTER U
76\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
77\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
78\c U-0000043A D0 BA CYRILLIC SMALL LETTER KA
79\c U-00000438 D0 B8 CYRILLIC SMALL LETTER I
80\c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I
81
82... and you get back the same output format, including the UTF-8
83code points.
84
85If you supply malformed data, \cw{cvt-utf8} will break it down for
86you and identify the malformed pieces and any correctly formed
87characters:
88
89\c $ cvt-utf8 A9 FE 45 C2 80 90 0A
90\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
91\c A9 (unexpected continuation byte)
92\c FE (invalid UTF-8 byte)
93\c U-00000045 45 LATIN CAPITAL LETTER E
94\c U-00000080 C2 80 <control>
95\c 90 (unexpected continuation byte)
96\c U-0000000A 0A <control>
97
98If you need the UTF-8 encoding of a particular character, you can
99use the \cw{-o} option to cause the UTF-8 to be written to standard
100output:
101
102\c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
103\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
104
105If you have UTF-8 data in a file or output from another program, you
106can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
107works particularly well if you also have my \cw{xcopy} program,
108which can be told to extract UTF-8 data from the X selection and
109write it to its standard output. With these two programs working
110together, if you ever have trouble identifying some text in a
111UTF-8-supporting web browser such as Mozilla, you can simply select
112the text in question, switch to a terminal window, and type
113
114\c $ xcopy -u -r | cvt-utf8 -i
115\e bbbbbbbbbbbbbbbbbbbbbbbbb
116
117If the text is in Chinese, you can get at least a general idea of
118its meaning by using the \cw{-h} option to print the meaning of each
119ideograph from the Unihan database. For example, if you pass in the
120Chinese text meaning \q{Traditional Chinese}:
121
122\c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
123\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
124\c U-00007E41 E7 B9 81 <han> complicated, complex, difficult
125\c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit
126\c U-00004E2D E4 B8 AD <han> central; center, middle; in the
127\c midst of; hit (target); attain
128\c U-00006587 E6 96 87 <han> literature, culture, writing
129
8a48d402 130\U ADMINISTRATION
9acadc2b 131
da0f8522 132In order to print the \cw{unicode.org} official name of each
133character, \cw{cvt-utf8} requires file mapping code points to names.
134This file is in DBM database format, for rapid lookup.
135
136This database file is accessed using the Python \cw{anydbm} module,
137so its precise file name will vary depending on what flavours of DBM
138you have installed. The name Python knows it by is \cq{unicode}; it
139may actually be called \cq{unicode.db} or something similar.
140
141\cw{cvt-utf8} generates this DBM file itself starting from the
142Unicode Character Database, in the form of the file
143\cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
144administrative options for this purpose:
145
146\c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
147
148Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
149the DBM file and store it in a place of your choice.
150
151\c cvt-utf8 --fetch-build /path/to/unicode
152
153If you have a direct Internet connection, this will automatically
154download the text file from \cw{unicode.org} and process it straight
155into the DBM file.
156
157There is a second DBM file, known to Python as \cw{unihan}, which is
158required to support the \cw{-h} option. This one is built from the
159Unihan Database, distributed by \cw{unicode.org} as a zip file
160containing a text file \cw{Unihan.txt}.
161
162If you already have \cw{Unihan.txt} on your system, you can build
163\cw{cvt-utf8}'s \cw{unihan} DBM file like this:
164
165\c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
166
167Or, again, \cw{cvt-utf8} can automatically download it from
168\cw{unicode.org}, unpack the zip file on the fly, and write the DBM
169straight out:
170
171\c cvt-utf8 --fetch-build-unihan /path/to/unihan
172
173\cw{cvt-utf8} expects to find these database files in one of the
174following locations:
175
176\c /usr/share/unicode
177\c /usr/lib/unicode
178\c /usr/local/share/unicode
179\c /usr/local/lib/unicode
180\c $HOME/share/unicode
181\e iiiii
182\c $HOME/lib/unicode
183\e iiiii
184
185If either of these files is not found, \cw{cvt-utf8} will still
186perform the rest of its functions.
187
8a48d402 188\U LICENCE
da0f8522 189
190\cw{cvt-utf8} is free software, distributed under the MIT licence.
191Type \cw{cvt-utf8 --licence} to see the full licence text.