[sgt/utils] / cvt-utf8 / cvt-utf8.but

\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}

\title Man page for \cw{cvt-utf8}

\U NAME

\cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode

\U SYNOPSIS

\c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities]
\e bbbbbbbb  iiiii   iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii

\U DESCRIPTION

\cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
Unicode data. Its functions include:

\b Given a sequence of Unicode code points, convert them to the
corresponding sequence of bytes in the UTF-8 encoding.

\b Given a sequence of UTF-8 bytes, convert them back into Unicode
code points.

\b Given any combination of the above inputs, look up each Unicode
code point in the Unicode character database and identify it.

\b Look up Unified Han characters in the \q{Unihan} database and
provide their translation text.

By default, \cw{cvt-utf8} expects to receive character data on the
command line (as a mixture of UTF-8 bytes, Unicode code points and
SGML numeric character entities), and it will print out a verbose
analysis of the input data. If you need it to read UTF-8 from
standard input or to write pure UTF-8 to standard output, you can do
so using command-line options.

\U OPTIONS

\dt \cw{-i}

\dd Read UTF-8 data from standard input and analyse that, instead of
expecting hex numbers on the command line.

\dt \cw{-o}

\dd Write well-formed UTF-8 to standard output, instead of writing a
long analysis of the input data.

\dt \cw{-h}

\dd Look up each code point in the Unihan database as well as the
main Unicode character database.

\U EXAMPLES

In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
UTF-8 data. For example, you can give a list of Unicode code
points...

\c $ cvt-utf8 U+20ac U+31 U+30
\e   bbbbbbbbbbbbbbbbbbbbbbbbb
\c U-000020AC  E2 82 AC          EURO SIGN
\c U-00000031  31                DIGIT ONE
\c U-00000030  30                DIGIT ZERO

... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
character definitions.

If it's more convenient, you can specify those characters as SGML
numeric entity references (for example if you're cutting and pasting
out of a web page):

\c $ cvt-utf8 '&#8364;' '&#x2013;'
\e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
\c U-000020AC  E2 82 AC          EURO SIGN
\c U-00002013  E2 80 93          EN DASH

Alternatively, you can supply a list of UTF-8 bytes...

\c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
\e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
\c U-00000420  D0 A0             CYRILLIC CAPITAL LETTER ER
\c U-00000443  D1 83             CYRILLIC SMALL LETTER U
\c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
\c U-00000441  D1 81             CYRILLIC SMALL LETTER ES
\c U-0000043A  D0 BA             CYRILLIC SMALL LETTER KA
\c U-00000438  D0 B8             CYRILLIC SMALL LETTER I
\c U-00000439  D0 B9             CYRILLIC SMALL LETTER SHORT I

... and you get back the same output format, including the UTF-8
code points.

If you supply malformed data, \cw{cvt-utf8} will break it down for
you and identify the malformed pieces and any correctly formed
characters:

\c $ cvt-utf8 A9 FE 45 C2 80 90 0A
\e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
\c             A9                (unexpected continuation byte)
\c             FE                (invalid UTF-8 byte)
\c U-00000045  45                LATIN CAPITAL LETTER E
\c U-00000080  C2 80             <control>
\c             90                (unexpected continuation byte)
\c U-0000000A  0A                <control>

If you need the UTF-8 encoding of a particular character, you can
use the \cw{-o} option to cause the UTF-8 to be written to standard
output:

\c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
\e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

If you have UTF-8 data in a file or output from another program, you
can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
works particularly well if you also have my \cw{xcopy} program,
which can be told to extract UTF-8 data from the X selection and
write it to its standard output. With these two programs working
together, if you ever have trouble identifying some text in a
UTF-8-supporting web browser such as Mozilla, you can simply select
the text in question, switch to a terminal window, and type

\c $ xcopy -u -r | cvt-utf8 -i
\e   bbbbbbbbbbbbbbbbbbbbbbbbb

If the text is in Chinese, you can get at least a general idea of
its meaning by using the \cw{-h} option to print the meaning of each
ideograph from the Unihan database. For example, if you pass in the
Chinese text meaning \q{Traditional Chinese}:

\c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
\e   bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
\c U-00007E41  E7 B9 81          <han> complicated, complex, difficult
\c U-00009AD4  E9 AB 94          <han> body; group, class, body, unit
\c U-00004E2D  E4 B8 AD          <han> central; center, middle; in the
\c                               midst of; hit (target); attain
\c U-00006587  E6 96 87          <han> literature, culture, writing

\U ADMINISTRATION

In order to print the \cw{unicode.org} official name of each
character, \cw{cvt-utf8} requires a file mapping code points to
names. This file is in DBM database format, for rapid lookup.

This database file is accessed using the Python \cw{anydbm} module,
so its precise file name will vary depending on what flavours of DBM
you have installed. The name Python knows it by is \cq{unicode}; it
may actually be called \cq{unicode.db} or something similar.

\cw{cvt-utf8} generates this DBM file itself starting from the
Unicode Character Database, in the form of the file
\cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
administrative options for this purpose:

\c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode

Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
the DBM file and store it in a place of your choice.

\c cvt-utf8 --fetch-build /path/to/unicode

If you have a direct Internet connection, this will automatically
download the text file from \cw{unicode.org} and process it straight
into the DBM file.

There is a second DBM file, known to Python as \cw{unihan}, which is
required to support the \cw{-h} option. This one is built from the
Unihan Database, distributed by \cw{unicode.org} as a zip file
containing a text file \cw{Unihan.txt}.

If you already have \cw{Unihan.txt} on your system, you can build
\cw{cvt-utf8}'s \cw{unihan} DBM file like this:

\c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan

Or, again, \cw{cvt-utf8} can automatically download it from
\cw{unicode.org}, unpack the zip file on the fly, and write the DBM
straight out:

\c cvt-utf8 --fetch-build-unihan /path/to/unihan

\cw{cvt-utf8} expects to find these database files in one of the
following locations:

\c /usr/share/unicode
\c /usr/lib/unicode
\c /usr/local/share/unicode
\c /usr/local/lib/unicode
\c $HOME/share/unicode
\e iiiii
\c $HOME/lib/unicode
\e iiiii

If either of these files is not found, \cw{cvt-utf8} will still
perform the rest of its functions.

\U LICENCE

\cw{cvt-utf8} is free software, distributed under the MIT licence.
Type \cw{cvt-utf8 --licence} to see the full licence text.

\versionid $Id$
Commit	Line	Data
9acadc2b	1	\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
9acadc2b	2
8a48d402	3	\title Man page for \cw{cvt-utf8}
9acadc2b	4
8a48d402	5	\U NAME
9acadc2b	6
	7	\cw{cvt-utf8} - convert between UTF-8 and Unicode, and analyse Unicode
	8
8a48d402	9	\U SYNOPSIS
9acadc2b	10
337e121d	11	\c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities]
337e121d	12	\e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
9acadc2b	13
8a48d402	14	\U DESCRIPTION
9acadc2b	15
	16	\cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
	17	Unicode data. Its functions include:
	18
	19	\b Given a sequence of Unicode code points, convert them to the
	20	corresponding sequence of bytes in the UTF-8 encoding.
	21
	22	\b Given a sequence of UTF-8 bytes, convert them back into Unicode
	23	code points.
	24
	25	\b Given any combination of the above inputs, look up each Unicode
	26	code point in the Unicode character database and identify it.
	27
	28	\b Look up Unified Han characters in the \q{Unihan} database and
	29	provide their translation text.
	30
337e121d	31	By default, \cw{cvt-utf8} expects to receive character data on the
	32	command line (as a mixture of UTF-8 bytes, Unicode code points and
	33	SGML numeric character entities), and it will print out a verbose
	34	analysis of the input data. If you need it to read UTF-8 from
	35	standard input or to write pure UTF-8 to standard output, you can do
	36	so using command-line options.
9acadc2b	37
8a48d402	38	\U OPTIONS
9acadc2b	39
	40	\dt \cw{-i}
	41
	42	\dd Read UTF-8 data from standard input and analyse that, instead of
	43	expecting hex numbers on the command line.
	44
	45	\dt \cw{-o}
	46
	47	\dd Write well-formed UTF-8 to standard output, instead of writing a
	48	long analysis of the input data.
	49
	50	\dt \cw{-h}
	51
	52	\dd Look up each code point in the Unihan database as well as the
	53	main Unicode character database.
	54
8a48d402	55	\U EXAMPLES
9acadc2b	56
	57	In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
	58	UTF-8 data. For example, you can give a list of Unicode code
	59	points...
	60
	61	\c $ cvt-utf8 U+20ac U+31 U+30
	62	\e bbbbbbbbbbbbbbbbbbbbbbbbb
	63	\c U-000020AC E2 82 AC EURO SIGN
	64	\c U-00000031 31 DIGIT ONE
	65	\c U-00000030 30 DIGIT ZERO
	66
	67	... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
	68	character definitions.
	69
337e121d	70	If it's more convenient, you can specify those characters as SGML
	71	numeric entity references (for example if you're cutting and pasting
	72	out of a web page):
	73
	74	\c $ cvt-utf8 '€' '–'
	75	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	76	\c U-000020AC E2 82 AC EURO SIGN
	77	\c U-00002013 E2 80 93 EN DASH
	78
9acadc2b	79	Alternatively, you can supply a list of UTF-8 bytes...
	80
	81	\c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
	82	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	83	\c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER
	84	\c U-00000443 D1 83 CYRILLIC SMALL LETTER U
	85	\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
	86	\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
	87	\c U-0000043A D0 BA CYRILLIC SMALL LETTER KA
	88	\c U-00000438 D0 B8 CYRILLIC SMALL LETTER I
	89	\c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I
	90
	91	... and you get back the same output format, including the UTF-8
	92	code points.
	93
	94	If you supply malformed data, \cw{cvt-utf8} will break it down for
	95	you and identify the malformed pieces and any correctly formed
	96	characters:
	97
	98	\c $ cvt-utf8 A9 FE 45 C2 80 90 0A
	99	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	100	\c A9 (unexpected continuation byte)
	101	\c FE (invalid UTF-8 byte)
	102	\c U-00000045 45 LATIN CAPITAL LETTER E
	103	\c U-00000080 C2 80 <control>
	104	\c 90 (unexpected continuation byte)
	105	\c U-0000000A 0A <control>
	106
	107	If you need the UTF-8 encoding of a particular character, you can
	108	use the \cw{-o} option to cause the UTF-8 to be written to standard
	109	output:
	110
	111	\c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
	112	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	113
	114	If you have UTF-8 data in a file or output from another program, you
	115	can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
	116	works particularly well if you also have my \cw{xcopy} program,
	117	which can be told to extract UTF-8 data from the X selection and
	118	write it to its standard output. With these two programs working
	119	together, if you ever have trouble identifying some text in a
	120	UTF-8-supporting web browser such as Mozilla, you can simply select
	121	the text in question, switch to a terminal window, and type
	122
	123	\c $ xcopy -u -r \| cvt-utf8 -i
	124	\e bbbbbbbbbbbbbbbbbbbbbbbbb
	125
	126	If the text is in Chinese, you can get at least a general idea of
	127	its meaning by using the \cw{-h} option to print the meaning of each
	128	ideograph from the Unihan database. For example, if you pass in the
	129	Chinese text meaning \q{Traditional Chinese}:
	130
	131	\c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
	132	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	133	\c U-00007E41 E7 B9 81 <han> complicated, complex, difficult
	134	\c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit
	135	\c U-00004E2D E4 B8 AD <han> central; center, middle; in the
	136	\c midst of; hit (target); attain
	137	\c U-00006587 E6 96 87 <han> literature, culture, writing
	138
8a48d402	139	\U ADMINISTRATION
9acadc2b	140
da0f8522	141	In order to print the \cw{unicode.org} official name of each
f2cae604	142	character, \cw{cvt-utf8} requires a file mapping code points to
f2cae604	143	names. This file is in DBM database format, for rapid lookup.
da0f8522	144
	145	This database file is accessed using the Python \cw{anydbm} module,
	146	so its precise file name will vary depending on what flavours of DBM
	147	you have installed. The name Python knows it by is \cq{unicode}; it
	148	may actually be called \cq{unicode.db} or something similar.
	149
	150	\cw{cvt-utf8} generates this DBM file itself starting from the
	151	Unicode Character Database, in the form of the file
	152	\cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
	153	administrative options for this purpose:
	154
	155	\c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
	156
	157	Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
	158	the DBM file and store it in a place of your choice.
	159
	160	\c cvt-utf8 --fetch-build /path/to/unicode
	161
	162	If you have a direct Internet connection, this will automatically
	163	download the text file from \cw{unicode.org} and process it straight
	164	into the DBM file.
	165
	166	There is a second DBM file, known to Python as \cw{unihan}, which is
	167	required to support the \cw{-h} option. This one is built from the
	168	Unihan Database, distributed by \cw{unicode.org} as a zip file
	169	containing a text file \cw{Unihan.txt}.
	170
	171	If you already have \cw{Unihan.txt} on your system, you can build
	172	\cw{cvt-utf8}'s \cw{unihan} DBM file like this:
	173
	174	\c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
	175
	176	Or, again, \cw{cvt-utf8} can automatically download it from
	177	\cw{unicode.org}, unpack the zip file on the fly, and write the DBM
	178	straight out:
	179
	180	\c cvt-utf8 --fetch-build-unihan /path/to/unihan
	181
	182	\cw{cvt-utf8} expects to find these database files in one of the
	183	following locations:
	184
	185	\c /usr/share/unicode
	186	\c /usr/lib/unicode
	187	\c /usr/local/share/unicode
	188	\c /usr/local/lib/unicode
	189	\c $HOME/share/unicode
	190	\e iiiii
	191	\c $HOME/lib/unicode
	192	\e iiiii
	193
	194	If either of these files is not found, \cw{cvt-utf8} will still
	195	perform the rest of its functions.
	196
8a48d402	197	\U LICENCE
da0f8522	198
	199	\cw{cvt-utf8} is free software, distributed under the MIT licence.
	200	Type \cw{cvt-utf8 --licence} to see the full licence text.
1166ff62	201
1166ff62	202	\versionid $Id$