Remove spurious (and sometimes harmful) Halibut config directives.
[sgt/utils] / cvt-utf8 / cvt-utf8.but
CommitLineData
9acadc2b 1\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
9acadc2b 2
8a48d402 3\title Man page for \cw{cvt-utf8}
9acadc2b 4
8a48d402 5\U NAME
9acadc2b 6
7\cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode
8
8a48d402 9\U SYNOPSIS
9acadc2b 10
11\c cvt-utf8 [flags] [hex UTF-8 bytes and/or U+codepoints]
12\e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
13
8a48d402 14\U DESCRIPTION
9acadc2b 15
16\cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
17Unicode data. Its functions include:
18
19\b Given a sequence of Unicode code points, convert them to the
20corresponding sequence of bytes in the UTF-8 encoding.
21
22\b Given a sequence of UTF-8 bytes, convert them back into Unicode
23code points.
24
25\b Given any combination of the above inputs, look up each Unicode
26code point in the Unicode character database and identify it.
27
28\b Look up Unified Han characters in the \q{Unihan} database and
29provide their translation text.
30
31By default, \cw{cvt-utf8} expects to receive hex numbers (either
32UTF-8 bytes or Unicode code points) on the command line, and it will
33print out a verbose analysis of the input data. If you need it to
34read UTF-8 from standard input or to write pure UTF-8 to standard
35output, you can do so using command-line options.
36
8a48d402 37\U OPTIONS
9acadc2b 38
39\dt \cw{-i}
40
41\dd Read UTF-8 data from standard input and analyse that, instead of
42expecting hex numbers on the command line.
43
44\dt \cw{-o}
45
46\dd Write well-formed UTF-8 to standard output, instead of writing a
47long analysis of the input data.
48
49\dt \cw{-h}
50
51\dd Look up each code point in the Unihan database as well as the
52main Unicode character database.
53
8a48d402 54\U EXAMPLES
9acadc2b 55
56In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
57UTF-8 data. For example, you can give a list of Unicode code
58points...
59
60\c $ cvt-utf8 U+20ac U+31 U+30
61\e bbbbbbbbbbbbbbbbbbbbbbbbb
62\c U-000020AC E2 82 AC EURO SIGN
63\c U-00000031 31 DIGIT ONE
64\c U-00000030 30 DIGIT ZERO
65
66... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
67character definitions.
68
69Alternatively, you can supply a list of UTF-8 bytes...
70
71\c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
72\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
73\c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER
74\c U-00000443 D1 83 CYRILLIC SMALL LETTER U
75\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
76\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
77\c U-0000043A D0 BA CYRILLIC SMALL LETTER KA
78\c U-00000438 D0 B8 CYRILLIC SMALL LETTER I
79\c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I
80
81... and you get back the same output format, including the UTF-8
82code points.
83
84If you supply malformed data, \cw{cvt-utf8} will break it down for
85you and identify the malformed pieces and any correctly formed
86characters:
87
88\c $ cvt-utf8 A9 FE 45 C2 80 90 0A
89\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
90\c A9 (unexpected continuation byte)
91\c FE (invalid UTF-8 byte)
92\c U-00000045 45 LATIN CAPITAL LETTER E
93\c U-00000080 C2 80 <control>
94\c 90 (unexpected continuation byte)
95\c U-0000000A 0A <control>
96
97If you need the UTF-8 encoding of a particular character, you can
98use the \cw{-o} option to cause the UTF-8 to be written to standard
99output:
100
101\c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
102\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
103
104If you have UTF-8 data in a file or output from another program, you
105can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
106works particularly well if you also have my \cw{xcopy} program,
107which can be told to extract UTF-8 data from the X selection and
108write it to its standard output. With these two programs working
109together, if you ever have trouble identifying some text in a
110UTF-8-supporting web browser such as Mozilla, you can simply select
111the text in question, switch to a terminal window, and type
112
113\c $ xcopy -u -r | cvt-utf8 -i
114\e bbbbbbbbbbbbbbbbbbbbbbbbb
115
116If the text is in Chinese, you can get at least a general idea of
117its meaning by using the \cw{-h} option to print the meaning of each
118ideograph from the Unihan database. For example, if you pass in the
119Chinese text meaning \q{Traditional Chinese}:
120
121\c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
122\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
123\c U-00007E41 E7 B9 81 <han> complicated, complex, difficult
124\c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit
125\c U-00004E2D E4 B8 AD <han> central; center, middle; in the
126\c midst of; hit (target); attain
127\c U-00006587 E6 96 87 <han> literature, culture, writing
128
8a48d402 129\U ADMINISTRATION
9acadc2b 130
da0f8522 131In order to print the \cw{unicode.org} official name of each
132character, \cw{cvt-utf8} requires file mapping code points to names.
133This file is in DBM database format, for rapid lookup.
134
135This database file is accessed using the Python \cw{anydbm} module,
136so its precise file name will vary depending on what flavours of DBM
137you have installed. The name Python knows it by is \cq{unicode}; it
138may actually be called \cq{unicode.db} or something similar.
139
140\cw{cvt-utf8} generates this DBM file itself starting from the
141Unicode Character Database, in the form of the file
142\cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
143administrative options for this purpose:
144
145\c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
146
147Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
148the DBM file and store it in a place of your choice.
149
150\c cvt-utf8 --fetch-build /path/to/unicode
151
152If you have a direct Internet connection, this will automatically
153download the text file from \cw{unicode.org} and process it straight
154into the DBM file.
155
156There is a second DBM file, known to Python as \cw{unihan}, which is
157required to support the \cw{-h} option. This one is built from the
158Unihan Database, distributed by \cw{unicode.org} as a zip file
159containing a text file \cw{Unihan.txt}.
160
161If you already have \cw{Unihan.txt} on your system, you can build
162\cw{cvt-utf8}'s \cw{unihan} DBM file like this:
163
164\c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
165
166Or, again, \cw{cvt-utf8} can automatically download it from
167\cw{unicode.org}, unpack the zip file on the fly, and write the DBM
168straight out:
169
170\c cvt-utf8 --fetch-build-unihan /path/to/unihan
171
172\cw{cvt-utf8} expects to find these database files in one of the
173following locations:
174
175\c /usr/share/unicode
176\c /usr/lib/unicode
177\c /usr/local/share/unicode
178\c /usr/local/lib/unicode
179\c $HOME/share/unicode
180\e iiiii
181\c $HOME/lib/unicode
182\e iiiii
183
184If either of these files is not found, \cw{cvt-utf8} will still
185perform the rest of its functions.
186
8a48d402 187\U LICENCE
da0f8522 188
189\cw{cvt-utf8} is free software, distributed under the MIT licence.
190Type \cw{cvt-utf8 --licence} to see the full licence text.
1166ff62 191
192\versionid $Id$