mdw@git.distorted.org.uk Git - sgt/utils/blob - cvt-utf8/cvt-utf8.but

   1 \cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
   2 \cfg{man-mindepth}{1}
   3
   4 \C{cvt-utf8-manpage} Man page for \cw{cvt-utf8}
   5
   6 \H{cvt-utf8-manpage-name} NAME
   7
   8 \cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode
   9
  10 \H{cvt-utf8-manpage-synopsis} SYNOPSIS
  11
  12 \c cvt-utf8 [flags] [hex UTF-8 bytes and/or U+codepoints]
  13 \e bbbbbbbb  iiiii   iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
  14
  15 \H{cvt-utf8-manpage-description} DESCRIPTION
  16
  17 \cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
  18 Unicode data. Its functions include:
  19
  20 \b Given a sequence of Unicode code points, convert them to the
  21 corresponding sequence of bytes in the UTF-8 encoding.
  22
  23 \b Given a sequence of UTF-8 bytes, convert them back into Unicode
  24 code points.
  25
  26 \b Given any combination of the above inputs, look up each Unicode
  27 code point in the Unicode character database and identify it.
  28
  29 \b Look up Unified Han characters in the \q{Unihan} database and
  30 provide their translation text.
  31
  32 By default, \cw{cvt-utf8} expects to receive hex numbers (either
  33 UTF-8 bytes or Unicode code points) on the command line, and it will
  34 print out a verbose analysis of the input data. If you need it to
  35 read UTF-8 from standard input or to write pure UTF-8 to standard
  36 output, you can do so using command-line options.
  37
  38 \H{cvt-utf8-manpage-options} OPTIONS
  39
  40 \dt \cw{-i}
  41
  42 \dd Read UTF-8 data from standard input and analyse that, instead of
  43 expecting hex numbers on the command line.
  44
  45 \dt \cw{-o}
  46
  47 \dd Write well-formed UTF-8 to standard output, instead of writing a
  48 long analysis of the input data.
  49
  50 \dt \cw{-h}
  51
  52 \dd Look up each code point in the Unihan database as well as the
  53 main Unicode character database.
  54
  55 \H{cvt-utf8-manpage-examples} EXAMPLES
  56
  57 In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
  58 UTF-8 data. For example, you can give a list of Unicode code
  59 points...
  60
  61 \c $ cvt-utf8 U+20ac U+31 U+30
  62 \e   bbbbbbbbbbbbbbbbbbbbbbbbb
  63 \c U-000020AC  E2 82 AC          EURO SIGN
  64 \c U-00000031  31                DIGIT ONE
  65 \c U-00000030  30                DIGIT ZERO
  66
  67 ... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
  68 character definitions.
  69
  70 Alternatively, you can supply a list of UTF-8 bytes...
  71
  72 \c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
  73 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  74 \c U-00000420  D0 A0             CYRILLIC CAPITAL LETTER ER
  75 \c U-00000443  D1 83             CYRILLIC SMALL LETTER U
  76 \c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
  77 \c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
  78 \c U-0000043A  D0 BA             CYRILLIC SMALL LETTER KA
  79 \c U-00000438  D0 B8             CYRILLIC SMALL LETTER I
  80 \c U-00000439  D0 B9             CYRILLIC SMALL LETTER SHORT I
  81
  82 ... and you get back the same output format, including the UTF-8
  83 code points.
  84
  85 If you supply malformed data, \cw{cvt-utf8} will break it down for
  86 you and identify the malformed pieces and any correctly formed
  87 characters:
  88
  89 \c $ cvt-utf8 A9 FE 45 C2 80 90 0A
  90 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
  91 \c             A9                (unexpected continuation byte)
  92 \c             FE                (invalid UTF-8 byte)
  93 \c U-00000045  45                LATIN CAPITAL LETTER E
  94 \c U-00000080  C2 80             <control>
  95 \c             90                (unexpected continuation byte)
  96 \c U-0000000A  0A                <control>
  97
  98 If you need the UTF-8 encoding of a particular character, you can
  99 use the \cw{-o} option to cause the UTF-8 to be written to standard
 100 output:
 101
 102 \c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
 103 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 104
 105 If you have UTF-8 data in a file or output from another program, you
 106 can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
 107 works particularly well if you also have my \cw{xcopy} program,
 108 which can be told to extract UTF-8 data from the X selection and
 109 write it to its standard output. With these two programs working
 110 together, if you ever have trouble identifying some text in a
 111 UTF-8-supporting web browser such as Mozilla, you can simply select
 112 the text in question, switch to a terminal window, and type
 113
 114 \c $ xcopy -u -r | cvt-utf8 -i
 115 \e   bbbbbbbbbbbbbbbbbbbbbbbbb
 116
 117 If the text is in Chinese, you can get at least a general idea of
 118 its meaning by using the \cw{-h} option to print the meaning of each
 119 ideograph from the Unihan database. For example, if you pass in the
 120 Chinese text meaning \q{Traditional Chinese}:
 121
 122 \c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
 123 \e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
 124 \c U-00007E41  E7 B9 81          <han> complicated, complex, difficult
 125 \c U-00009AD4  E9 AB 94          <han> body; group, class, body, unit
 126 \c U-00004E2D  E4 B8 AD          <han> central; center, middle; in the
 127 \c                               midst of; hit (target); attain
 128 \c U-00006587  E6 96 87          <han> literature, culture, writing
 129
 130 \H{cvt-utf8-manpage-bugs} BUGS
 131
 132 Command-line option processing is very basic. In particular, \cw{-h}
 133 must come before \cw{-i} or it will not be recognised.