mdw@git.distorted.org.uk Git - sgt/utils/blame_incremental

... / ...

Commit	Line	Data
	1	\cfg{man-identity}{cvt-utf8}{1}{2004-03-24}{Simon Tatham}{Simon Tatham}
	2
	3	\define{dash} \u2013{-}
	4
	5	\title Man page for \cw{cvt-utf8}
	6
	7	\U NAME
	8
	9	\cw{cvt-utf8} \dash convert between UTF-8 and Unicode, and analyse Unicode
	10
	11	\U SYNOPSIS
	12
	13	\c cvt-utf8 [flags] [hex UTF-8 bytes, U+codepoints, SGML entities]
	14	\e bbbbbbbb iiiii iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
	15
	16	\U DESCRIPTION
	17
	18	\cw{cvt-utf8} is a tool for manipulating and analysing UTF-8 and
	19	Unicode data. Its functions include:
	20
	21	\b Given a sequence of Unicode code points, convert them to the
	22	corresponding sequence of bytes in the UTF-8 encoding.
	23
	24	\b Given a sequence of UTF-8 bytes, convert them back into Unicode
	25	code points.
	26
	27	\b Given any combination of the above inputs, look up each Unicode
	28	code point in the Unicode character database and identify it.
	29
	30	\b Look up Unified Han characters in the \q{Unihan} database and
	31	provide their translation text.
	32
	33	By default, \cw{cvt-utf8} expects to receive character data on the
	34	command line (as a mixture of UTF-8 bytes, Unicode code points and
	35	SGML numeric character entities), and it will print out a verbose
	36	analysis of the input data. If you need it to read UTF-8 from
	37	standard input or to write pure UTF-8 to standard output, you can do
	38	so using command-line options.
	39
	40	\U OPTIONS
	41
	42	\dt \cw{-i}
	43
	44	\dd Read UTF-8 data from standard input and analyse that, instead of
	45	expecting hex numbers on the command line.
	46
	47	\dt \cw{-o}
	48
	49	\dd Write well-formed UTF-8 to standard output, instead of writing a
	50	long analysis of the input data.
	51
	52	\dt \cw{-h}
	53
	54	\dd Look up each code point in the Unihan database as well as the
	55	main Unicode character database.
	56
	57	\U EXAMPLES
	58
	59	In \cw{cvt-utf8}'s native mode, it simply analyses input Unicode or
	60	UTF-8 data. For example, you can give a list of Unicode code
	61	points...
	62
	63	\c $ cvt-utf8 U+20ac U+31 U+30
	64	\e bbbbbbbbbbbbbbbbbbbbbbbbb
	65	\c U-000020AC E2 82 AC EURO SIGN
	66	\c U-00000031 31 DIGIT ONE
	67	\c U-00000030 30 DIGIT ZERO
	68
	69	... and \cw{cvt-utf8} gives you the UTF-8 encodings plus the
	70	character definitions.
	71
	72	If it's more convenient, you can specify those characters as SGML
	73	numeric entity references (for example if you're cutting and pasting
	74	out of a web page):
	75
	76	\c $ cvt-utf8 '€' '–'
	77	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	78	\c U-000020AC E2 82 AC EURO SIGN
	79	\c U-00002013 E2 80 93 EN DASH
	80
	81	Alternatively, you can supply a list of UTF-8 bytes...
	82
	83	\c $ cvt-utf8 D0 A0 D1 83 D1 81 D1 81 D0 BA D0 B8 D0 B9
	84	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	85	\c U-00000420 D0 A0 CYRILLIC CAPITAL LETTER ER
	86	\c U-00000443 D1 83 CYRILLIC SMALL LETTER U
	87	\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
	88	\c U-00000441 D1 81 CYRILLIC SMALL LETTER ES
	89	\c U-0000043A D0 BA CYRILLIC SMALL LETTER KA
	90	\c U-00000438 D0 B8 CYRILLIC SMALL LETTER I
	91	\c U-00000439 D0 B9 CYRILLIC SMALL LETTER SHORT I
	92
	93	... and you get back the same output format, including the UTF-8
	94	code points.
	95
	96	If you supply malformed data, \cw{cvt-utf8} will break it down for
	97	you and identify the malformed pieces and any correctly formed
	98	characters:
	99
	100	\c $ cvt-utf8 A9 FE 45 C2 80 90 0A
	101	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	102	\c A9 (unexpected continuation byte)
	103	\c FE (invalid UTF-8 byte)
	104	\c U-00000045 45 LATIN CAPITAL LETTER E
	105	\c U-00000080 C2 80 <control>
	106	\c 90 (unexpected continuation byte)
	107	\c U-0000000A 0A <control>
	108
	109	If you need the UTF-8 encoding of a particular character, you can
	110	use the \cw{-o} option to cause the UTF-8 to be written to standard
	111	output:
	112
	113	\c $ cvt-utf8 -o U+20AC >> my-utf8-file.txt
	114	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	115
	116	If you have UTF-8 data in a file or output from another program, you
	117	can use the \cw{-i} option to have \cw{cvt-utf8} analyse it. This
	118	works particularly well if you also have my \cw{xcopy} program,
	119	which can be told to extract UTF-8 data from the X selection and
	120	write it to its standard output. With these two programs working
	121	together, if you ever have trouble identifying some text in a
	122	UTF-8-supporting web browser such as Mozilla, you can simply select
	123	the text in question, switch to a terminal window, and type
	124
	125	\c $ xcopy -u -r \| cvt-utf8 -i
	126	\e bbbbbbbbbbbbbbbbbbbbbbbbb
	127
	128	If the text is in Chinese, you can get at least a general idea of
	129	its meaning by using the \cw{-h} option to print the meaning of each
	130	ideograph from the Unihan database. For example, if you pass in the
	131	Chinese text meaning \q{Traditional Chinese}:
	132
	133	\c $ cvt-utf8 -h U+7E41 U+9AD4 U+4E2D U+6587
	134	\e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
	135	\c U-00007E41 E7 B9 81 <han> complicated, complex, difficult
	136	\c U-00009AD4 E9 AB 94 <han> body; group, class, body, unit
	137	\c U-00004E2D E4 B8 AD <han> central; center, middle; in the
	138	\c midst of; hit (target); attain
	139	\c U-00006587 E6 96 87 <han> literature, culture, writing
	140
	141	\U ADMINISTRATION
	142
	143	In order to print the \cw{unicode.org} official name of each
	144	character, \cw{cvt-utf8} requires a file mapping code points to
	145	names. This file is in DBM database format, for rapid lookup.
	146
	147	This database file is accessed using the Python \cw{anydbm} module,
	148	so its precise file name will vary depending on what flavours of DBM
	149	you have installed. The name Python knows it by is \cq{unicode}; it
	150	may actually be called \cq{unicode.db} or something similar.
	151
	152	\cw{cvt-utf8} generates this DBM file itself starting from the
	153	Unicode Character Database, in the form of the file
	154	\cw{UnicodeData.txt} supplied by \cw{unicode.org}. It supports two
	155	administrative options for this purpose:
	156
	157	\c cvt-utf8 --build /path/to/UnicodeData.txt /path/to/unicode
	158
	159	Given a copy of \cw{UnicodeData.txt} on disk, this mode will create
	160	the DBM file and store it in a place of your choice.
	161
	162	\c cvt-utf8 --fetch-build /path/to/unicode
	163
	164	If you have a direct Internet connection, this will automatically
	165	download the text file from \cw{unicode.org} and process it straight
	166	into the DBM file.
	167
	168	There is a second DBM file, known to Python as \cw{unihan}, which is
	169	required to support the \cw{-h} option. This one is built from the
	170	Unihan Database, distributed by \cw{unicode.org} as a zip file
	171	containing a text file \cw{Unihan.txt}.
	172
	173	If you already have \cw{Unihan.txt} on your system, you can build
	174	\cw{cvt-utf8}'s \cw{unihan} DBM file like this:
	175
	176	\c cvt-utf8 --build-unihan /path/to/Unihan.txt /path/to/unihan
	177
	178	Or, again, \cw{cvt-utf8} can automatically download it from
	179	\cw{unicode.org}, unpack the zip file on the fly, and write the DBM
	180	straight out:
	181
	182	\c cvt-utf8 --fetch-build-unihan /path/to/unihan
	183
	184	\cw{cvt-utf8} expects to find these database files in one of the
	185	following locations:
	186
	187	\c /usr/share/unicode
	188	\c /usr/lib/unicode
	189	\c /usr/local/share/unicode
	190	\c /usr/local/lib/unicode
	191	\c $HOME/share/unicode
	192	\e iiiii
	193	\c $HOME/lib/unicode
	194	\e iiiii
	195
	196	If either of these files is not found, \cw{cvt-utf8} will still
	197	perform the rest of its functions.
	198
	199	\U LICENCE
	200
	201	\cw{cvt-utf8} is free software, distributed under the MIT licence.
	202	Type \cw{cvt-utf8 --licence} to see the full licence text.
	203
	204	\versionid $Id$