New utility for the collection: 'buildrun', a rewrite of my previous
[sgt/utils] / cvt-utf8 / cvt-utf8.but
CommitLineData
9acadc2b 1\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
9acadc2b 2
8a48d402 3\title Man page for \cw{cvt-utf8}
9acadc2b 4
8a48d402 5\U NAME
9acadc2b 6
7\cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode
8
8a48d402 9\U SYNOPSIS
9acadc2b 10
337e121d 11\c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities]
12\e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
9acadc2b 13
8a48d402 14\U DESCRIPTION
9acadc2b 15
16\cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
17Unicode data. Its functions include:
18
19\b Given a sequence of Unicode code points, convert them to the
20corresponding sequence of bytes in the UTF-8 encoding.
21
22\b Given a sequence of UTF-8 bytes, convert them back into Unicode
23code points.
24
25\b Given any combination of the above inputs, look up each Unicode
26code point in the Unicode character database and identify it.
27
28\b Look up Unified Han characters in the \q{Unihan} database and
29provide their translation text.
30
337e121d 31By default, \cw{cvt-utf8} expects to receive character data on the
32command line (as a mixture of UTF-8 bytes, Unicode code points and
33SGML numeric character entities), and it will print out a verbose
34analysis of the input data. If you need it to read UTF-8 from
35standard input or to write pure UTF-8 to standard output, you can do
36so using command-line options.
9acadc2b 37
8a48d402 38\U OPTIONS
9acadc2b 39
40\dt \cw{-i}
41
42\dd Read UTF-8 data from standard input and analyse that, instead of
43expecting hex numbers on the command line.
44
45\dt \cw{-o}
46
47\dd Write well-formed UTF-8 to standard output, instead of writing a
48long analysis of the input data.
49
50\dt \cw{-h}
51
52\dd Look up each code point in the Unihan database as well as the
53main Unicode character database.
54
8a48d402 55\U EXAMPLES
9acadc2b 56
57In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
58UTF-8 data. For example, you can give a list of Unicode code
59points...
60
61\c $ cvt-utf8 U+20ac U+31 U+30
62\e bbbbbbbbbbbbbbbbbbbbbbbbb
63\c U-000020AC E2 82 AC EURO SIGN
64\c U-00000031 31 DIGIT ONE
65\c U-00000030 30 DIGIT ZERO
66
67... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
68character definitions.
69
337e121d 70If it's more convenient, you can specify those characters as SGML
71numeric entity references (for example if you're cutting and pasting
72out of a web page):
73
74\c $ cvt-utf8 '€' '–'
75\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
76\c U-000020AC E2 82 AC EURO SIGN
77\c U-00002013 E2 80 93 EN DASH
78
9acadc2b 79Alternatively, you can supply a list of UTF-8 bytes...
80
81\c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
82\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
83\c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER
84\c U-00000443 D1 83 CYRILLIC SMALL LETTER U
85\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
86\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
87\c U-0000043A D0 BA CYRILLIC SMALL LETTER KA
88\c U-00000438 D0 B8 CYRILLIC SMALL LETTER I
89\c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I
90
91... and you get back the same output format, including the UTF-8
92code points.
93
94If you supply malformed data, \cw{cvt-utf8} will break it down for
95you and identify the malformed pieces and any correctly formed
96characters:
97
98\c $ cvt-utf8 A9 FE 45 C2 80 90 0A
99\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
100\c A9 (unexpected continuation byte)
101\c FE (invalid UTF-8 byte)
102\c U-00000045 45 LATIN CAPITAL LETTER E
103\c U-00000080 C2 80 <control>
104\c 90 (unexpected continuation byte)
105\c U-0000000A 0A <control>
106
107If you need the UTF-8 encoding of a particular character, you can
108use the \cw{-o} option to cause the UTF-8 to be written to standard
109output:
110
111\c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
112\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
113
114If you have UTF-8 data in a file or output from another program, you
115can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
116works particularly well if you also have my \cw{xcopy} program,
117which can be told to extract UTF-8 data from the X selection and
118write it to its standard output. With these two programs working
119together, if you ever have trouble identifying some text in a
120UTF-8-supporting web browser such as Mozilla, you can simply select
121the text in question, switch to a terminal window, and type
122
123\c $ xcopy -u -r | cvt-utf8 -i
124\e bbbbbbbbbbbbbbbbbbbbbbbbb
125
126If the text is in Chinese, you can get at least a general idea of
127its meaning by using the \cw{-h} option to print the meaning of each
128ideograph from the Unihan database. For example, if you pass in the
129Chinese text meaning \q{Traditional Chinese}:
130
131\c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
132\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
133\c U-00007E41 E7 B9 81 <han> complicated, complex, difficult
134\c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit
135\c U-00004E2D E4 B8 AD <han> central; center, middle; in the
136\c midst of; hit (target); attain
137\c U-00006587 E6 96 87 <han> literature, culture, writing
138
8a48d402 139\U ADMINISTRATION
9acadc2b 140
da0f8522 141In order to print the \cw{unicode.org} official name of each
f2cae604 142character, \cw{cvt-utf8} requires a file mapping code points to
143names. This file is in DBM database format, for rapid lookup.
da0f8522 144
145This database file is accessed using the Python \cw{anydbm} module,
146so its precise file name will vary depending on what flavours of DBM
147you have installed. The name Python knows it by is \cq{unicode}; it
148may actually be called \cq{unicode.db} or something similar.
149
150\cw{cvt-utf8} generates this DBM file itself starting from the
151Unicode Character Database, in the form of the file
152\cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
153administrative options for this purpose:
154
155\c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
156
157Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
158the DBM file and store it in a place of your choice.
159
160\c cvt-utf8 --fetch-build /path/to/unicode
161
162If you have a direct Internet connection, this will automatically
163download the text file from \cw{unicode.org} and process it straight
164into the DBM file.
165
166There is a second DBM file, known to Python as \cw{unihan}, which is
167required to support the \cw{-h} option. This one is built from the
168Unihan Database, distributed by \cw{unicode.org} as a zip file
169containing a text file \cw{Unihan.txt}.
170
171If you already have \cw{Unihan.txt} on your system, you can build
172\cw{cvt-utf8}'s \cw{unihan} DBM file like this:
173
174\c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
175
176Or, again, \cw{cvt-utf8} can automatically download it from
177\cw{unicode.org}, unpack the zip file on the fly, and write the DBM
178straight out:
179
180\c cvt-utf8 --fetch-build-unihan /path/to/unihan
181
182\cw{cvt-utf8} expects to find these database files in one of the
183following locations:
184
185\c /usr/share/unicode
186\c /usr/lib/unicode
187\c /usr/local/share/unicode
188\c /usr/local/lib/unicode
189\c $HOME/share/unicode
190\e iiiii
191\c $HOME/lib/unicode
192\e iiiii
193
194If either of these files is not found, \cw{cvt-utf8} will still
195perform the rest of its functions.
196
8a48d402 197\U LICENCE
da0f8522 198
199\cw{cvt-utf8} is free software, distributed under the MIT licence.
200Type \cw{cvt-utf8 --licence} to see the full licence text.
1166ff62 201
202\versionid $Id$