mdw@git.distorted.org.uk Git - sgt/utils/blob - cvt-utf8/cvt-utf8.but

   1 \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
   2
   3 \define{dash} \u2013{-}
   4
   5 \title Man page for \cw{cvt-utf8}
   6
   7 \U NAME
   8
   9 \cw{cvt-utf8} \dash convert between UTF-8 and Unicode, and analyse Unicode
  10
  11 \U SYNOPSIS
  12
  13 \c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities]
  14 \e bbbbbbbb  iiiii   iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
  15
  16 \U DESCRIPTION
  17
  18 \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
  19 Unicode data. Its functions include:
  20
  21 \b Given a sequence of Unicode code points, convert them to the
  22 corresponding sequence of bytes in the UTF-8 encoding.
  23
  24 \b Given a sequence of UTF-8 bytes, convert them back into Unicode
  25 code points.
  26
  27 \b Given any combination of the above inputs, look up each Unicode
  28 code point in the Unicode character database and identify it.
  29
  30 \b Look up Unified Han characters in the \q{Unihan} database and
  31 provide their translation text.
  32
  33 By default, \cw{cvt-utf8} expects to receive character data on the
  34 command line (as a mixture of UTF-8 bytes, Unicode code points and
  35 SGML numeric character entities), and it will print out a verbose
  36 analysis of the input data. If you need it to read UTF-8 from
  37 standard input or to write pure UTF-8 to standard output, you can do
  38 so using command-line options.
  39
  40 \U OPTIONS
  41
  42 \dt \cw{-i}
  43
  44 \dd Read UTF-8 data from standard input and analyse that, instead of
  45 expecting hex numbers on the command line.
  46
  47 \dt \cw{-o}
  48
  49 \dd Write well-formed UTF-8 to standard output, instead of writing a
  50 long analysis of the input data.
  51
  52 \dt \cw{-h}
  53
  54 \dd Look up each code point in the Unihan database as well as the
  55 main Unicode character database.
  56
  57 \U EXAMPLES
  58
  59 In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
  60 UTF-8 data. For example, you can give a list of Unicode code
  61 points...
  62
  63 \c $ cvt-utf8 U+20ac U+31 U+30
  64 \e   bbbbbbbbbbbbbbbbbbbbbbbbb
  65 \c U-000020AC  E2 82 AC          EURO SIGN
  66 \c U-00000031  31                DIGIT ONE
  67 \c U-00000030  30                DIGIT ZERO
  68
  69 ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
  70 character definitions.
  71
  72 If it's more convenient, you can specify those characters as SGML
  73 numeric entity references (for example if you're cutting and pasting
  74 out of a web page):
  75
  76 \c $ cvt-utf8 '&#8364;' '&#x2013;'
  77 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  78 \c U-000020AC  E2 82 AC          EURO SIGN
  79 \c U-00002013  E2 80 93          EN DASH
  80
  81 Alternatively, you can supply a list of UTF-8 bytes...
  82
  83 \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
  84 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  85 \c U-00000420  D0 A0             CYRILLIC CAPITAL LETTER ER
  86 \c U-00000443  D1 83             CYRILLIC SMALL LETTER U
  87 \c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
  88 \c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
  89 \c U-0000043A  D0 BA             CYRILLIC SMALL LETTER KA
  90 \c U-00000438  D0 B8             CYRILLIC SMALL LETTER I
  91 \c U-00000439  D0 B9             CYRILLIC SMALL LETTER SHORT I
  92
  93 ... and you get back the same output format, including the UTF-8
  94 code points.
  95
  96 If you supply malformed data, \cw{cvt-utf8} will break it down for
  97 you and identify the malformed pieces and any correctly formed
  98 characters:
  99
 100 \c $ cvt-utf8 A9 FE 45 C2 80 90 0A
 101 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 102 \c             A9                (unexpected continuation byte)
 103 \c             FE                (invalid UTF-8 byte)
 104 \c U-00000045  45                LATIN CAPITAL LETTER E
 105 \c U-00000080  C2 80             <control>
 106 \c             90                (unexpected continuation byte)
 107 \c U-0000000A  0A                <control>
 108
 109 If you need the UTF-8 encoding of a particular character, you can
 110 use the \cw{-o} option to cause the UTF-8 to be written to standard
 111 output:
 112
 113 \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
 114 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 115
 116 If you have UTF-8 data in a file or output from another program, you
 117 can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
 118 works particularly well if you also have my \cw{xcopy} program,
 119 which can be told to extract UTF-8 data from the X selection and
 120 write it to its standard output. With these two programs working
 121 together, if you ever have trouble identifying some text in a
 122 UTF-8-supporting web browser such as Mozilla, you can simply select
 123 the text in question, switch to a terminal window, and type
 124
 125 \c $ xcopy -u -r | cvt-utf8 -i
 126 \e   bbbbbbbbbbbbbbbbbbbbbbbbb
 127
 128 If the text is in Chinese, you can get at least a general idea of
 129 its meaning by using the \cw{-h} option to print the meaning of each
 130 ideograph from the Unihan database. For example, if you pass in the
 131 Chinese text meaning \q{Traditional Chinese}:
 132
 133 \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
 134 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 135 \c U-00007E41  E7 B9 81          <han> complicated, complex, difficult
 136 \c U-00009AD4  E9 AB 94          <han> body; group, class, body, unit
 137 \c U-00004E2D  E4 B8 AD          <han> central; center, middle; in the
 138 \c                               midst of; hit (target); attain
 139 \c U-00006587  E6 96 87          <han> literature, culture, writing
 140
 141 \U ADMINISTRATION
 142
 143 In order to print the \cw{unicode.org} official name of each
 144 character, \cw{cvt-utf8} requires a file mapping code points to
 145 names. This file is in DBM database format, for rapid lookup.
 146
 147 This database file is accessed using the Python \cw{anydbm} module,
 148 so its precise file name will vary depending on what flavours of DBM
 149 you have installed. The name Python knows it by is \cq{unicode}; it
 150 may actually be called \cq{unicode.db} or something similar.
 151
 152 \cw{cvt-utf8} generates this DBM file itself starting from the
 153 Unicode Character Database, in the form of the file
 154 \cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
 155 administrative options for this purpose:
 156
 157 \c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
 158
 159 Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
 160 the DBM file and store it in a place of your choice.
 161
 162 \c cvt-utf8 --fetch-build /path/to/unicode
 163
 164 If you have a direct Internet connection, this will automatically
 165 download the text file from \cw{unicode.org} and process it straight
 166 into the DBM file.
 167
 168 There is a second DBM file, known to Python as \cw{unihan}, which is
 169 required to support the \cw{-h} option. This one is built from the
 170 Unihan Database, distributed by \cw{unicode.org} as a zip file
 171 containing a text file \cw{Unihan.txt}.
 172
 173 If you already have \cw{Unihan.txt} on your system, you can build
 174 \cw{cvt-utf8}'s \cw{unihan} DBM file like this:
 175
 176 \c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
 177
 178 Or, again, \cw{cvt-utf8} can automatically download it from
 179 \cw{unicode.org}, unpack the zip file on the fly, and write the DBM
 180 straight out:
 181
 182 \c cvt-utf8 --fetch-build-unihan /path/to/unihan
 183
 184 \cw{cvt-utf8} expects to find these database files in one of the
 185 following locations:
 186
 187 \c /usr/share/unicode
 188 \c /usr/lib/unicode
 189 \c /usr/local/share/unicode
 190 \c /usr/local/lib/unicode
 191 \c $HOME/share/unicode
 192 \e iiiii
 193 \c $HOME/lib/unicode
 194 \e iiiii
 195
 196 If either of these files is not found, \cw{cvt-utf8} will still
 197 perform the rest of its functions.
 198
 199 \U LICENCE
 200
 201 \cw{cvt-utf8} is free software, distributed under the MIT licence.
 202 Type \cw{cvt-utf8 --licence} to see the full licence text.
 203
 204 \versionid $Id$