From: simon Date: Mon, 25 Jul 2011 18:16:30 +0000 (+0000) Subject: TODO update: include some notes on proper character set conversion X-Git-Url: https://git.distorted.org.uk/~mdw/sgt/agedu/commitdiff_plain/268e65c204e5dbcb0f185bdf0f5bb01d8f3f7aee TODO update: include some notes on proper character set conversion support. I don't really know how this should work, and for precisely that reason I thought I'd write down what it is that's hard about it... git-svn-id: svn://svn.tartarus.org/sgt/agedu@9247 cda61777-01e9-0310-a592-d414129be87e --- diff --git a/TODO b/TODO index 4220a4d..343e05e 100644 --- a/TODO +++ b/TODO @@ -8,6 +8,40 @@ TODO list for agedu enable other modes of use like the built-in --cgi mode, without me having to anticipate them in detail.) + - non-ASCII character set support + + could usefully apply to --title and also to file names + + how do we determine the input charset? Via locale, presumably. + + how do we do translation? Importing my charset library is one + heavyweight option; alternatively, does the native C locale + mechanism provide enough functionality to do the job by itself? + + in HTML, we would need to decide on an _output_ character set, + specify it in a tag, and translate to it from + the input locale + - one option is to make the output charset the same as the + input one, in which case all we need is to identify its name + for the tag + - the other option is to make the output charset UTF-8 always + and translate to that from everything else + - in the web server and CGI modes, it would probably be nicer + to move that tag into a proper HTTP header + + even in text mode we would want to parse the filenames in some + fashion, due to the unhelpful corner case of Shift-JIS Windows + (in which backslashes in the input string must be classified as + path separators or the second byte of a two-byte character) + - that's really painful, since it will impact string processing + of filenames throughout the code + - so perhaps a better approach would be to do locale processing + of filenames at _scan_ time, and normalise to UTF-8 in both + the index and dump files? + + involves incrementing the version of the dump-file format + + then paths given on the command line are translated + quickly to UTF-8 before comparing them against index paths + + and now the HTML output side becomes easy, though the text + output involves translating back again + + but what if the filenames aren't intended to be + interpreted in any particular character set (old-style + Unix semantics) or in a consistent one? + - we could still be using more of the information coming from autoconf. Our config.h is defining a whole bunch of HAVE_FOOs for particular functions (e.g. HAVE_INET_NTOA, HAVE_MEMCHR,