Import my usual \dash macro into all these man pages, and use it for
[sgt/utils] / cvt-utf8 / cvt-utf8.but
CommitLineData
9acadc2b 1\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
9acadc2b 2
92dccb8d 3\define{dash} \u2013{-}
4
8a48d402 5\title Man page for \cw{cvt-utf8}
9acadc2b 6
8a48d402 7\U NAME
9acadc2b 8
92dccb8d 9\cw{cvt-utf8} \dash convert between UTF-8 and Unicode, and analyse Unicode
9acadc2b 10
8a48d402 11\U SYNOPSIS
9acadc2b 12
337e121d 13\c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities]
14\e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
9acadc2b 15
8a48d402 16\U DESCRIPTION
9acadc2b 17
18\cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
19Unicode data. Its functions include:
20
21\b Given a sequence of Unicode code points, convert them to the
22corresponding sequence of bytes in the UTF-8 encoding.
23
24\b Given a sequence of UTF-8 bytes, convert them back into Unicode
25code points.
26
27\b Given any combination of the above inputs, look up each Unicode
28code point in the Unicode character database and identify it.
29
30\b Look up Unified Han characters in the \q{Unihan} database and
31provide their translation text.
32
337e121d 33By default, \cw{cvt-utf8} expects to receive character data on the
34command line (as a mixture of UTF-8 bytes, Unicode code points and
35SGML numeric character entities), and it will print out a verbose
36analysis of the input data. If you need it to read UTF-8 from
37standard input or to write pure UTF-8 to standard output, you can do
38so using command-line options.
9acadc2b 39
8a48d402 40\U OPTIONS
9acadc2b 41
42\dt \cw{-i}
43
44\dd Read UTF-8 data from standard input and analyse that, instead of
45expecting hex numbers on the command line.
46
47\dt \cw{-o}
48
49\dd Write well-formed UTF-8 to standard output, instead of writing a
50long analysis of the input data.
51
52\dt \cw{-h}
53
54\dd Look up each code point in the Unihan database as well as the
55main Unicode character database.
56
8a48d402 57\U EXAMPLES
9acadc2b 58
59In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
60UTF-8 data. For example, you can give a list of Unicode code
61points...
62
63\c $ cvt-utf8 U+20ac U+31 U+30
64\e bbbbbbbbbbbbbbbbbbbbbbbbb
65\c U-000020AC E2 82 AC EURO SIGN
66\c U-00000031 31 DIGIT ONE
67\c U-00000030 30 DIGIT ZERO
68
69... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
70character definitions.
71
337e121d 72If it's more convenient, you can specify those characters as SGML
73numeric entity references (for example if you're cutting and pasting
74out of a web page):
75
76\c $ cvt-utf8 '€' '–'
77\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
78\c U-000020AC E2 82 AC EURO SIGN
79\c U-00002013 E2 80 93 EN DASH
80
9acadc2b 81Alternatively, you can supply a list of UTF-8 bytes...
82
83\c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
84\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
85\c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER
86\c U-00000443 D1 83 CYRILLIC SMALL LETTER U
87\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
88\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
89\c U-0000043A D0 BA CYRILLIC SMALL LETTER KA
90\c U-00000438 D0 B8 CYRILLIC SMALL LETTER I
91\c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I
92
93... and you get back the same output format, including the UTF-8
94code points.
95
96If you supply malformed data, \cw{cvt-utf8} will break it down for
97you and identify the malformed pieces and any correctly formed
98characters:
99
100\c $ cvt-utf8 A9 FE 45 C2 80 90 0A
101\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
102\c A9 (unexpected continuation byte)
103\c FE (invalid UTF-8 byte)
104\c U-00000045 45 LATIN CAPITAL LETTER E
105\c U-00000080 C2 80 <control>
106\c 90 (unexpected continuation byte)
107\c U-0000000A 0A <control>
108
109If you need the UTF-8 encoding of a particular character, you can
110use the \cw{-o} option to cause the UTF-8 to be written to standard
111output:
112
113\c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
114\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
115
116If you have UTF-8 data in a file or output from another program, you
117can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
118works particularly well if you also have my \cw{xcopy} program,
119which can be told to extract UTF-8 data from the X selection and
120write it to its standard output. With these two programs working
121together, if you ever have trouble identifying some text in a
122UTF-8-supporting web browser such as Mozilla, you can simply select
123the text in question, switch to a terminal window, and type
124
125\c $ xcopy -u -r | cvt-utf8 -i
126\e bbbbbbbbbbbbbbbbbbbbbbbbb
127
128If the text is in Chinese, you can get at least a general idea of
129its meaning by using the \cw{-h} option to print the meaning of each
130ideograph from the Unihan database. For example, if you pass in the
131Chinese text meaning \q{Traditional Chinese}:
132
133\c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
134\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
135\c U-00007E41 E7 B9 81 <han> complicated, complex, difficult
136\c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit
137\c U-00004E2D E4 B8 AD <han> central; center, middle; in the
138\c midst of; hit (target); attain
139\c U-00006587 E6 96 87 <han> literature, culture, writing
140
8a48d402 141\U ADMINISTRATION
9acadc2b 142
da0f8522 143In order to print the \cw{unicode.org} official name of each
f2cae604 144character, \cw{cvt-utf8} requires a file mapping code points to
145names. This file is in DBM database format, for rapid lookup.
da0f8522 146
147This database file is accessed using the Python \cw{anydbm} module,
148so its precise file name will vary depending on what flavours of DBM
149you have installed. The name Python knows it by is \cq{unicode}; it
150may actually be called \cq{unicode.db} or something similar.
151
152\cw{cvt-utf8} generates this DBM file itself starting from the
153Unicode Character Database, in the form of the file
154\cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
155administrative options for this purpose:
156
157\c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
158
159Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
160the DBM file and store it in a place of your choice.
161
162\c cvt-utf8 --fetch-build /path/to/unicode
163
164If you have a direct Internet connection, this will automatically
165download the text file from \cw{unicode.org} and process it straight
166into the DBM file.
167
168There is a second DBM file, known to Python as \cw{unihan}, which is
169required to support the \cw{-h} option. This one is built from the
170Unihan Database, distributed by \cw{unicode.org} as a zip file
171containing a text file \cw{Unihan.txt}.
172
173If you already have \cw{Unihan.txt} on your system, you can build
174\cw{cvt-utf8}'s \cw{unihan} DBM file like this:
175
176\c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
177
178Or, again, \cw{cvt-utf8} can automatically download it from
179\cw{unicode.org}, unpack the zip file on the fly, and write the DBM
180straight out:
181
182\c cvt-utf8 --fetch-build-unihan /path/to/unihan
183
184\cw{cvt-utf8} expects to find these database files in one of the
185following locations:
186
187\c /usr/share/unicode
188\c /usr/lib/unicode
189\c /usr/local/share/unicode
190\c /usr/local/lib/unicode
191\c $HOME/share/unicode
192\e iiiii
193\c $HOME/lib/unicode
194\e iiiii
195
196If either of these files is not found, \cw{cvt-utf8} will still
197perform the rest of its functions.
198
8a48d402 199\U LICENCE
da0f8522 200
201\cw{cvt-utf8} is free software, distributed under the MIT licence.
202Type \cw{cvt-utf8 --licence} to see the full licence text.
1166ff62 203
204\versionid $Id$