mdw@git.distorted.org.uk Git - sgt/utils/blob - cvt-utf8/cvt-utf8.but

   1 \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
   2
   3 \title Man page for \cw{cvt-utf8}
   4
   5 \U NAME
   6
   7 \cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode
   8
   9 \U SYNOPSIS
  10
  11 \c cvt-utf8 [flags] [hex UTF-8 bytes and/or U+codepoints]
  12 \e bbbbbbbb  iiiii   iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
  13
  14 \U DESCRIPTION
  15
  16 \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
  17 Unicode data. Its functions include:
  18
  19 \b Given a sequence of Unicode code points, convert them to the
  20 corresponding sequence of bytes in the UTF-8 encoding.
  21
  22 \b Given a sequence of UTF-8 bytes, convert them back into Unicode
  23 code points.
  24
  25 \b Given any combination of the above inputs, look up each Unicode
  26 code point in the Unicode character database and identify it.
  27
  28 \b Look up Unified Han characters in the \q{Unihan} database and
  29 provide their translation text.
  30
  31 By default, \cw{cvt-utf8} expects to receive hex numbers (either
  32 UTF-8 bytes or Unicode code points) on the command line, and it will
  33 print out a verbose analysis of the input data. If you need it to
  34 read UTF-8 from standard input or to write pure UTF-8 to standard
  35 output, you can do so using command-line options.
  36
  37 \U OPTIONS
  38
  39 \dt \cw{-i}
  40
  41 \dd Read UTF-8 data from standard input and analyse that, instead of
  42 expecting hex numbers on the command line.
  43
  44 \dt \cw{-o}
  45
  46 \dd Write well-formed UTF-8 to standard output, instead of writing a
  47 long analysis of the input data.
  48
  49 \dt \cw{-h}
  50
  51 \dd Look up each code point in the Unihan database as well as the
  52 main Unicode character database.
  53
  54 \U EXAMPLES
  55
  56 In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
  57 UTF-8 data. For example, you can give a list of Unicode code
  58 points...
  59
  60 \c $ cvt-utf8 U+20ac U+31 U+30
  61 \e   bbbbbbbbbbbbbbbbbbbbbbbbb
  62 \c U-000020AC  E2 82 AC          EURO SIGN
  63 \c U-00000031  31                DIGIT ONE
  64 \c U-00000030  30                DIGIT ZERO
  65
  66 ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
  67 character definitions.
  68
  69 Alternatively, you can supply a list of UTF-8 bytes...
  70
  71 \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
  72 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  73 \c U-00000420  D0 A0             CYRILLIC CAPITAL LETTER ER
  74 \c U-00000443  D1 83             CYRILLIC SMALL LETTER U
  75 \c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
  76 \c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
  77 \c U-0000043A  D0 BA             CYRILLIC SMALL LETTER KA
  78 \c U-00000438  D0 B8             CYRILLIC SMALL LETTER I
  79 \c U-00000439  D0 B9             CYRILLIC SMALL LETTER SHORT I
  80
  81 ... and you get back the same output format, including the UTF-8
  82 code points.
  83
  84 If you supply malformed data, \cw{cvt-utf8} will break it down for
  85 you and identify the malformed pieces and any correctly formed
  86 characters:
  87
  88 \c $ cvt-utf8 A9 FE 45 C2 80 90 0A
  89 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  90 \c             A9                (unexpected continuation byte)
  91 \c             FE                (invalid UTF-8 byte)
  92 \c U-00000045  45                LATIN CAPITAL LETTER E
  93 \c U-00000080  C2 80             <control>
  94 \c             90                (unexpected continuation byte)
  95 \c U-0000000A  0A                <control>
  96
  97 If you need the UTF-8 encoding of a particular character, you can
  98 use the \cw{-o} option to cause the UTF-8 to be written to standard
  99 output:
 100
 101 \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
 102 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 103
 104 If you have UTF-8 data in a file or output from another program, you
 105 can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
 106 works particularly well if you also have my \cw{xcopy} program,
 107 which can be told to extract UTF-8 data from the X selection and
 108 write it to its standard output. With these two programs working
 109 together, if you ever have trouble identifying some text in a
 110 UTF-8-supporting web browser such as Mozilla, you can simply select
 111 the text in question, switch to a terminal window, and type
 112
 113 \c $ xcopy -u -r | cvt-utf8 -i
 114 \e   bbbbbbbbbbbbbbbbbbbbbbbbb
 115
 116 If the text is in Chinese, you can get at least a general idea of
 117 its meaning by using the \cw{-h} option to print the meaning of each
 118 ideograph from the Unihan database. For example, if you pass in the
 119 Chinese text meaning \q{Traditional Chinese}:
 120
 121 \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
 122 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 123 \c U-00007E41  E7 B9 81          <han> complicated, complex, difficult
 124 \c U-00009AD4  E9 AB 94          <han> body; group, class, body, unit
 125 \c U-00004E2D  E4 B8 AD          <han> central; center, middle; in the
 126 \c                               midst of; hit (target); attain
 127 \c U-00006587  E6 96 87          <han> literature, culture, writing
 128
 129 \U ADMINISTRATION
 130
 131 In order to print the \cw{unicode.org} official name of each
 132 character, \cw{cvt-utf8} requires file mapping code points to names.
 133 This file is in DBM database format, for rapid lookup.
 134
 135 This database file is accessed using the Python \cw{anydbm} module,
 136 so its precise file name will vary depending on what flavours of DBM
 137 you have installed. The name Python knows it by is \cq{unicode}; it
 138 may actually be called \cq{unicode.db} or something similar.
 139
 140 \cw{cvt-utf8} generates this DBM file itself starting from the
 141 Unicode Character Database, in the form of the file
 142 \cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
 143 administrative options for this purpose:
 144
 145 \c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
 146
 147 Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
 148 the DBM file and store it in a place of your choice.
 149
 150 \c cvt-utf8 --fetch-build /path/to/unicode
 151
 152 If you have a direct Internet connection, this will automatically
 153 download the text file from \cw{unicode.org} and process it straight
 154 into the DBM file.
 155
 156 There is a second DBM file, known to Python as \cw{unihan}, which is
 157 required to support the \cw{-h} option. This one is built from the
 158 Unihan Database, distributed by \cw{unicode.org} as a zip file
 159 containing a text file \cw{Unihan.txt}.
 160
 161 If you already have \cw{Unihan.txt} on your system, you can build
 162 \cw{cvt-utf8}'s \cw{unihan} DBM file like this:
 163
 164 \c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
 165
 166 Or, again, \cw{cvt-utf8} can automatically download it from
 167 \cw{unicode.org}, unpack the zip file on the fly, and write the DBM
 168 straight out:
 169
 170 \c cvt-utf8 --fetch-build-unihan /path/to/unihan
 171
 172 \cw{cvt-utf8} expects to find these database files in one of the
 173 following locations:
 174
 175 \c /usr/share/unicode
 176 \c /usr/lib/unicode
 177 \c /usr/local/share/unicode
 178 \c /usr/local/lib/unicode
 179 \c $HOME/share/unicode
 180 \e iiiii
 181 \c $HOME/lib/unicode
 182 \e iiiii
 183
 184 If either of these files is not found, \cw{cvt-utf8} will still
 185 perform the rest of its functions.
 186
 187 \U LICENCE
 188
 189 \cw{cvt-utf8} is free software, distributed under the MIT licence.
 190 Type \cw{cvt-utf8 --licence} to see the full licence text.
 191
 192 \versionid $Id$