\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham} \define{dash} \u2013{-} \title Man page for \cw{cvt-utf8} \U NAME \cw{cvt-utf8} \dash convert between UTF-8 and Unicode, and analyse Unicode \U SYNOPSIS \c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities] \e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii \U DESCRIPTION \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and Unicode data. Its functions include: \b Given a sequence of Unicode code points, convert them to the corresponding sequence of bytes in the UTF-8 encoding. \b Given a sequence of UTF-8 bytes, convert them back into Unicode code points. \b Given any combination of the above inputs, look up each Unicode code point in the Unicode character database and identify it. \b Look up Unified Han characters in the \q{Unihan} database and provide their translation text. By default, \cw{cvt-utf8} expects to receive character data on the command line (as a mixture of UTF-8 bytes, Unicode code points and SGML numeric character entities), and it will print out a verbose analysis of the input data. If you need it to read UTF-8 from standard input or to write pure UTF-8 to standard output, you can do so using command-line options. \U OPTIONS \dt \cw{-i} \dd Read UTF-8 data from standard input and analyse that, instead of expecting hex numbers on the command line. \dt \cw{-o} \dd Write well-formed UTF-8 to standard output, instead of writing a long analysis of the input data. \dt \cw{-h} \dd Look up each code point in the Unihan database as well as the main Unicode character database. \U EXAMPLES In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or UTF-8 data. For example, you can give a list of Unicode code points... \c $ cvt-utf8 U+20ac U+31 U+30 \e bbbbbbbbbbbbbbbbbbbbbbbbb \c U-000020AC E2 82 AC EURO SIGN \c U-00000031 31 DIGIT ONE \c U-00000030 30 DIGIT ZERO ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the character definitions. If it's more convenient, you can specify those characters as SGML numeric entity references (for example if you're cutting and pasting out of a web page): \c $ cvt-utf8 '€' '–' \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb \c U-000020AC E2 82 AC EURO SIGN \c U-00002013 E2 80 93 EN DASH Alternatively, you can supply a list of UTF-8 bytes... \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9 \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb \c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER \c U-00000443 D1 83 CYRILLIC SMALL LETTER U \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES \c U-0000043A D0 BA CYRILLIC SMALL LETTER KA \c U-00000438 D0 B8 CYRILLIC SMALL LETTER I \c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I ... and you get back the same output format, including the UTF-8 code points. If you supply malformed data, \cw{cvt-utf8} will break it down for you and identify the malformed pieces and any correctly formed characters: \c $ cvt-utf8 A9 FE 45 C2 80 90 0A \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb \c A9 (unexpected continuation byte) \c FE (invalid UTF-8 byte) \c U-00000045 45 LATIN CAPITAL LETTER E \c U-00000080 C2 80 \c 90 (unexpected continuation byte) \c U-0000000A 0A If you need the UTF-8 encoding of a particular character, you can use the \cw{-o} option to cause the UTF-8 to be written to standard output: \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb If you have UTF-8 data in a file or output from another program, you can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This works particularly well if you also have my \cw{xcopy} program, which can be told to extract UTF-8 data from the X selection and write it to its standard output. With these two programs working together, if you ever have trouble identifying some text in a UTF-8-supporting web browser such as Mozilla, you can simply select the text in question, switch to a terminal window, and type \c $ xcopy -u -r | cvt-utf8 -i \e bbbbbbbbbbbbbbbbbbbbbbbbb If the text is in Chinese, you can get at least a general idea of its meaning by using the \cw{-h} option to print the meaning of each ideograph from the Unihan database. For example, if you pass in the Chinese text meaning \q{Traditional Chinese}: \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587 \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb \c U-00007E41 E7 B9 81 complicated, complex, difficult \c U-00009AD4 E9 AB 94 body; group, class, body, unit \c U-00004E2D E4 B8 AD central; center, middle; in the \c midst of; hit (target); attain \c U-00006587 E6 96 87 literature, culture, writing \U ADMINISTRATION In order to print the \cw{unicode.org} official name of each character, \cw{cvt-utf8} requires a file mapping code points to names. This file is in DBM database format, for rapid lookup. This database file is accessed using the Python \cw{anydbm} module, so its precise file name will vary depending on what flavours of DBM you have installed. The name Python knows it by is \cq{unicode}; it may actually be called \cq{unicode.db} or something similar. \cw{cvt-utf8} generates this DBM file itself starting from the Unicode Character Database, in the form of the file \cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two administrative options for this purpose: \c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode Given a copy of \cw{UnicodeData.txt} on disk, this mode will create the DBM file and store it in a place of your choice. \c cvt-utf8 --fetch-build /path/to/unicode If you have a direct Internet connection, this will automatically download the text file from \cw{unicode.org} and process it straight into the DBM file. There is a second DBM file, known to Python as \cw{unihan}, which is required to support the \cw{-h} option. This one is built from the Unihan Database, distributed by \cw{unicode.org} as a zip file containing a text file \cw{Unihan.txt}. If you already have \cw{Unihan.txt} on your system, you can build \cw{cvt-utf8}'s \cw{unihan} DBM file like this: \c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan Or, again, \cw{cvt-utf8} can automatically download it from \cw{unicode.org}, unpack the zip file on the fly, and write the DBM straight out: \c cvt-utf8 --fetch-build-unihan /path/to/unihan \cw{cvt-utf8} expects to find these database files in one of the following locations: \c /usr/share/unicode \c /usr/lib/unicode \c /usr/local/share/unicode \c /usr/local/lib/unicode \c $HOME/share/unicode \e iiiii \c $HOME/lib/unicode \e iiiii If either of these files is not found, \cw{cvt-utf8} will still perform the rest of its functions. \U LICENCE \cw{cvt-utf8} is free software, distributed under the MIT licence. Type \cw{cvt-utf8 --licence} to see the full licence text. \versionid $Id$