X-Git-Url: https://git.distorted.org.uk/~mdw/sgt/agedu/blobdiff_plain/373a02e53778079f10f67bd33a948c4f6f2be6da..HEAD:/TODO diff --git a/TODO b/TODO index 1a8a7eb..cc9fb0b 100644 --- a/TODO +++ b/TODO @@ -1,71 +1,59 @@ TODO list for agedu =================== -Before it's non-embarrassingly releasable: + - flexibility in the HTML report output mode: expose the internal + mechanism for configuring the output filenames, and allow the + user to request individual files with hyperlinks as if the other + files existed. (In particular, functionality of this kind would + enable other modes of use like the built-in --cgi mode, without + me having to anticipate them in detail.) - - more flexible running modes - + combined scan+dump mode which doesn't even generate an index - file (nearly indistinguishable from find(1)) - + load mode which reads a dump from standard input and builds - the index (need to nail down a perfectly general dump format) - + at least some ability to chain actions within the same run: - "agedu -s dirname -w" would seem handy. + - non-ASCII character set support + + could usefully apply to --title and also to file names + + how do we determine the input charset? Via locale, presumably. + + how do we do translation? Importing my charset library is one + heavyweight option; alternatively, does the native C locale + mechanism provide enough functionality to do the job by itself? + + in HTML, we would need to decide on an _output_ character set, + specify it in a tag, and translate to it from + the input locale + - one option is to make the output charset the same as the + input one, in which case all we need is to identify its name + for the tag + - the other option is to make the output charset UTF-8 always + and translate to that from everything else + - in the web server and CGI modes, it would probably be nicer + to move that tag into a proper HTTP header + + even in text mode we would want to parse the filenames in some + fashion, due to the unhelpful corner case of Shift-JIS Windows + (in which backslashes in the input string must be classified as + path separators or the second byte of a two-byte character) + - that's really painful, since it will impact string processing + of filenames throughout the code + - so perhaps a better approach would be to do locale processing + of filenames at _scan_ time, and normalise to UTF-8 in both + the index and dump files? + + involves incrementing the version of the dump-file format + + then paths given on the command line are translated + quickly to UTF-8 before comparing them against index paths + + and now the HTML output side becomes easy, though the text + output involves translating back again + + but what if the filenames aren't intended to be + interpreted in any particular character set (old-style + Unix semantics) or in a consistent one? - - work out what to do about atimes on directories in the absence of - the Linux syscall magic - * one option is to read them during the scan and reinstate them - after each recursion pop. Race-condition prone. - * marking them in a distinctive colour in the reports is another - option. - * a third option is simply to ignore space taken up by - directories in the first place; inaccurate but terribly simple. - * incidentally, sometimes open(...,O_NOATIME) will fail, and - then we have to fall back to ordinary open. Be prepared to do - this, which probably means getting rid of the icky macro - hackery in du.c and turning it into a more sensible run-time - abstraction layer. - - - polish the plain-text output to make it look more like du - + configurable recursive output depth - + show the right bits last - - - figure out what to do about scans starting in the root directory - + Currently we end up with a double leading slash on the - pathnames, which is ugly, and we also get a zero-length href - in between those slashes which means the web interface doesn't - let you click back up to the top level at all. - + One big problem here is that a lot of the code assumes that - you can find the extent of a pathname by searching for "foo" - and "foo^A", trusting that anything inside the directory will - begin "foo/". So I'd need to consistently fix this everywhere - so that a trailing slash is disregarded while doing it, but - not actually removed. - + The text output gets it all wrong. - + The HTML output is fiddly even at the design stage: where - would I _ideally_ put the link to click on to get back to /? - It's unclear! - - - cross-Unix portability: - + use autoconf - * configure use of stat64 - * configure use of /proc/net/tcp - * configure use of /dev/random - * configure use of Linux syscall magic replacing readdir - + later glibcs have fdopendir, hooray! So we can use that - too, if it's available and O_NOATIME is too. - * what do we do elsewhere about _GNU_SOURCE? - - - prepare a little in advance for a potential future Windows port: - + store the path separator character in the index file when - writing it, and load it back in when reading - + store literal byte sizes in all the size fields, instead of - Unixoid 512-byte sectors - - - man page, licence. - -Future directions: - - - IPv6 support in the HTTP server + - we could still be using more of the information coming from + autoconf. Our config.h is defining a whole bunch of HAVE_FOOs for + particular functions (e.g. HAVE_INET_NTOA, HAVE_MEMCHR, + HAVE_FNMATCH). We could usefully supply alternatives for some of + these functions (e.g. cannibalise the PuTTY wildcard matcher for + use in the absence of fnmatch, switch to vanilla truncate() in + the absence of ftruncate); where we don't have alternative code, + it would perhaps be polite to throw an error at configure time + rather than allowing the subsequent build to fail. + + however, I don't see anything here that looks very + controversial; IIRC it's all in POSIX, for one thing. So more + likely this should simply wait until somebody complains. - run-time configuration in the HTTP server * I think this probably works by having a configuration form, or @@ -85,20 +73,6 @@ Future directions: straight to terminfo: generate lines of attribute-interleaved text and display them, so we only really need the sequences "go here and display stuff", "scroll up", "scroll down". - + I think the attribute-interleaved text might be possible to do - cunningly, as well: we autodetect a basically VT-style - terminal, and add 256-colour sequences on the end. So, for - instance, we might set ANSI-yellow foreground, set ANSI-red - background, _then_ set both foreground and background to the - appropriate xterm 256-colour, and then display some - appropriate character which would have given the right blend - of the ANSI-16 fore and background colours. Then the same - display code should gracefully degrade in the face of a - terminal which doesn't support xterm-256. - * current best plan is to simulate the xterm-256 shading from - 0/5 to 5/5 by doing space, colon and hash in colour A on - colour B background, then hash, colon and space in B on A - background. + Infrastructure work before doing any of this would be to split html.c into two: one part to prepare an abstract data structure describing an HTML-like report (in particular, all @@ -107,18 +81,42 @@ Future directions: HTML. Then the former can be reused to produce very similar reports in coloured plain text. - - http://msdn.microsoft.com/en-us/library/ms724290.aspx suggest - modern Windowses support atime-equivalents, so a Windows port is - possible in principle. - + For a full Windows port, would need to modify the current - structure a lot, to abstract away (at least) memory-mapping of - files, details of disk scan procedure, networking for httpd. - Unclear what the right UI would be on Windows, too; - command-line exactly as now might be considered just a - _little_ unfriendly. Or perhaps not. - + Alternatively, a much easier approach would be to write a - Windows version of just the --scan-dump mode, which does a - filesystem scan via the Windows API and generates a valid - agedu dump file on standard output. Then one would simply feed - that over the network connection of one's choice to the rest - of agedu running on Unix as usual. + - abstracting away all the Unix calls so as to enable a full + Windows port. We can already do the difficult bit on Windows + (scanning the filesystem and retrieving atime-analogues). + Everything else is just coding - albeit quite a _lot_ of coding, + since the Unix assumptions are woven quite tightly into the + current code. + + If nothing else, it's unclear what the user interface properly + ought to be in a Windows port of agedu. A command-line job + exactly like the Unix version might be useful to some people, + but would certainly be strange and confusing to others. + + - it might conceivably be useful to support a choice of indexing + strategies. The current "continuous index" mechanism's tradeoff of + taking O(N log N) space in order to be able to support any age + cutoff you like is not going to be ideal for everybody. A second + more conventional "discrete index" mechanism which allows the + user to specify a number of fixed cutoffs and just indexes each + directory on those alone would undoubtedly be a useful thing for + large-scale users. This will require considerable thought about + how to make the indexers pluggable at both index-generation time + and query time. + * however, now we have the cut-down version of the continuous + index, the space saving is less compelling. + + - A user requested what's essentially a VFS layer: given multiple + index files and a map of how they fit into an overall namespace, + we should be able to construct the right answers for any query + about the resulting aggregated hierarchy by doing at most + O(number of indexes * normal number of queries) work. + + - Support for filtering the scan by ownership and permissions. The + index data structure can't handle this, so we can't build a + single index file admitting multiple subset views; but a user + suggested that the scan phase could record information about + ownership and permissions in the dump file, and then the indexing + phase could filter down to a particular sub-view - which would at + least allow the construction of various subset indices from one + dump file, without having to redo the full disk scan which is the + most time-consuming part of all.