| 1 | \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham} |
| 2 | |
| 3 | \define{dash} \u2013{-} |
| 4 | |
| 5 | \title Man page for \cw{cvt-utf8} |
| 6 | |
| 7 | \U NAME |
| 8 | |
| 9 | \cw{cvt-utf8} \dash convert between UTF-8 and Unicode, and analyse Unicode |
| 10 | |
| 11 | \U SYNOPSIS |
| 12 | |
| 13 | \c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities] |
| 14 | \e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii |
| 15 | |
| 16 | \U DESCRIPTION |
| 17 | |
| 18 | \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and |
| 19 | Unicode data. Its functions include: |
| 20 | |
| 21 | \b Given a sequence of Unicode code points, convert them to the |
| 22 | corresponding sequence of bytes in the UTF-8 encoding. |
| 23 | |
| 24 | \b Given a sequence of UTF-8 bytes, convert them back into Unicode |
| 25 | code points. |
| 26 | |
| 27 | \b Given any combination of the above inputs, look up each Unicode |
| 28 | code point in the Unicode character database and identify it. |
| 29 | |
| 30 | \b Look up Unified Han characters in the \q{Unihan} database and |
| 31 | provide their translation text. |
| 32 | |
| 33 | By default, \cw{cvt-utf8} expects to receive character data on the |
| 34 | command line (as a mixture of UTF-8 bytes, Unicode code points and |
| 35 | SGML numeric character entities), and it will print out a verbose |
| 36 | analysis of the input data. If you need it to read UTF-8 from |
| 37 | standard input or to write pure UTF-8 to standard output, you can do |
| 38 | so using command-line options. |
| 39 | |
| 40 | \U OPTIONS |
| 41 | |
| 42 | \dt \cw{-i} |
| 43 | |
| 44 | \dd Read UTF-8 data from standard input and analyse that, instead of |
| 45 | expecting hex numbers on the command line. |
| 46 | |
| 47 | \dt \cw{-o} |
| 48 | |
| 49 | \dd Write well-formed UTF-8 to standard output, instead of writing a |
| 50 | long analysis of the input data. |
| 51 | |
| 52 | \dt \cw{-h} |
| 53 | |
| 54 | \dd Look up each code point in the Unihan database as well as the |
| 55 | main Unicode character database. |
| 56 | |
| 57 | \U EXAMPLES |
| 58 | |
| 59 | In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or |
| 60 | UTF-8 data. For example, you can give a list of Unicode code |
| 61 | points... |
| 62 | |
| 63 | \c $ cvt-utf8 U+20ac U+31 U+30 |
| 64 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
| 65 | \c U-000020AC E2 82 AC EURO SIGN |
| 66 | \c U-00000031 31 DIGIT ONE |
| 67 | \c U-00000030 30 DIGIT ZERO |
| 68 | |
| 69 | ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the |
| 70 | character definitions. |
| 71 | |
| 72 | If it's more convenient, you can specify those characters as SGML |
| 73 | numeric entity references (for example if you're cutting and pasting |
| 74 | out of a web page): |
| 75 | |
| 76 | \c $ cvt-utf8 '€' '–' |
| 77 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
| 78 | \c U-000020AC E2 82 AC EURO SIGN |
| 79 | \c U-00002013 E2 80 93 EN DASH |
| 80 | |
| 81 | Alternatively, you can supply a list of UTF-8 bytes... |
| 82 | |
| 83 | \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9 |
| 84 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
| 85 | \c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER |
| 86 | \c U-00000443 D1 83 CYRILLIC SMALL LETTER U |
| 87 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
| 88 | \c U-00000441 D1 81 CYRILLIC SMALL LETTER ES |
| 89 | \c U-0000043A D0 BA CYRILLIC SMALL LETTER KA |
| 90 | \c U-00000438 D0 B8 CYRILLIC SMALL LETTER I |
| 91 | \c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I |
| 92 | |
| 93 | ... and you get back the same output format, including the UTF-8 |
| 94 | code points. |
| 95 | |
| 96 | If you supply malformed data, \cw{cvt-utf8} will break it down for |
| 97 | you and identify the malformed pieces and any correctly formed |
| 98 | characters: |
| 99 | |
| 100 | \c $ cvt-utf8 A9 FE 45 C2 80 90 0A |
| 101 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
| 102 | \c A9 (unexpected continuation byte) |
| 103 | \c FE (invalid UTF-8 byte) |
| 104 | \c U-00000045 45 LATIN CAPITAL LETTER E |
| 105 | \c U-00000080 C2 80 <control> |
| 106 | \c 90 (unexpected continuation byte) |
| 107 | \c U-0000000A 0A <control> |
| 108 | |
| 109 | If you need the UTF-8 encoding of a particular character, you can |
| 110 | use the \cw{-o} option to cause the UTF-8 to be written to standard |
| 111 | output: |
| 112 | |
| 113 | \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt |
| 114 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
| 115 | |
| 116 | If you have UTF-8 data in a file or output from another program, you |
| 117 | can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This |
| 118 | works particularly well if you also have my \cw{xcopy} program, |
| 119 | which can be told to extract UTF-8 data from the X selection and |
| 120 | write it to its standard output. With these two programs working |
| 121 | together, if you ever have trouble identifying some text in a |
| 122 | UTF-8-supporting web browser such as Mozilla, you can simply select |
| 123 | the text in question, switch to a terminal window, and type |
| 124 | |
| 125 | \c $ xcopy -u -r | cvt-utf8 -i |
| 126 | \e bbbbbbbbbbbbbbbbbbbbbbbbb |
| 127 | |
| 128 | If the text is in Chinese, you can get at least a general idea of |
| 129 | its meaning by using the \cw{-h} option to print the meaning of each |
| 130 | ideograph from the Unihan database. For example, if you pass in the |
| 131 | Chinese text meaning \q{Traditional Chinese}: |
| 132 | |
| 133 | \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587 |
| 134 | \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb |
| 135 | \c U-00007E41 E7 B9 81 <han> complicated, complex, difficult |
| 136 | \c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit |
| 137 | \c U-00004E2D E4 B8 AD <han> central; center, middle; in the |
| 138 | \c midst of; hit (target); attain |
| 139 | \c U-00006587 E6 96 87 <han> literature, culture, writing |
| 140 | |
| 141 | \U ADMINISTRATION |
| 142 | |
| 143 | In order to print the \cw{unicode.org} official name of each |
| 144 | character, \cw{cvt-utf8} requires a file mapping code points to |
| 145 | names. This file is in DBM database format, for rapid lookup. |
| 146 | |
| 147 | This database file is accessed using the Python \cw{anydbm} module, |
| 148 | so its precise file name will vary depending on what flavours of DBM |
| 149 | you have installed. The name Python knows it by is \cq{unicode}; it |
| 150 | may actually be called \cq{unicode.db} or something similar. |
| 151 | |
| 152 | \cw{cvt-utf8} generates this DBM file itself starting from the |
| 153 | Unicode Character Database, in the form of the file |
| 154 | \cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two |
| 155 | administrative options for this purpose: |
| 156 | |
| 157 | \c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode |
| 158 | |
| 159 | Given a copy of \cw{UnicodeData.txt} on disk, this mode will create |
| 160 | the DBM file and store it in a place of your choice. |
| 161 | |
| 162 | \c cvt-utf8 --fetch-build /path/to/unicode |
| 163 | |
| 164 | If you have a direct Internet connection, this will automatically |
| 165 | download the text file from \cw{unicode.org} and process it straight |
| 166 | into the DBM file. |
| 167 | |
| 168 | There is a second DBM file, known to Python as \cw{unihan}, which is |
| 169 | required to support the \cw{-h} option. This one is built from the |
| 170 | Unihan Database, distributed by \cw{unicode.org} as a zip file |
| 171 | containing a text file \cw{Unihan.txt}. |
| 172 | |
| 173 | If you already have \cw{Unihan.txt} on your system, you can build |
| 174 | \cw{cvt-utf8}'s \cw{unihan} DBM file like this: |
| 175 | |
| 176 | \c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan |
| 177 | |
| 178 | Or, again, \cw{cvt-utf8} can automatically download it from |
| 179 | \cw{unicode.org}, unpack the zip file on the fly, and write the DBM |
| 180 | straight out: |
| 181 | |
| 182 | \c cvt-utf8 --fetch-build-unihan /path/to/unihan |
| 183 | |
| 184 | \cw{cvt-utf8} expects to find these database files in one of the |
| 185 | following locations: |
| 186 | |
| 187 | \c /usr/share/unicode |
| 188 | \c /usr/lib/unicode |
| 189 | \c /usr/local/share/unicode |
| 190 | \c /usr/local/lib/unicode |
| 191 | \c $HOME/share/unicode |
| 192 | \e iiiii |
| 193 | \c $HOME/lib/unicode |
| 194 | \e iiiii |
| 195 | |
| 196 | If either of these files is not found, \cw{cvt-utf8} will still |
| 197 | perform the rest of its functions. |
| 198 | |
| 199 | \U LICENCE |
| 200 | |
| 201 | \cw{cvt-utf8} is free software, distributed under the MIT licence. |
| 202 | Type \cw{cvt-utf8 --licence} to see the full licence text. |
| 203 | |
| 204 | \versionid $Id$ |