Saying in the man page that I hadn't tested on a 64-bit machine was
[sgt/agedu] / agedu.but
1 \cfg{man-identity}{agedu}{1}{2008-11-02}{Simon Tatham}{Simon Tatham}
2
3 \define{dash} \u2013{-}
4
5 \title Man page for \cw{agedu}
6
7 \U NAME
8
9 \cw{agedu} - correlate disk usage with last-access times to identify
10 large and disused data
11
12 \U SYNOPSIS
13
14 \c agedu [ options ] action [action...]
15 \e bbbbb iiiiiii iiiiii iiiiii
16
17 \U DESCRIPTION
18
19 \cw{agedu} scans a directory tree and produces reports about how
20 much disk space is used in each directory and subdirectory, and also
21 how that usage of disk space corresponds to files with last-access
22 times a long time ago.
23
24 In other words, \cw{agedu} is a tool you might use to help you free
25 up disk space. It lets you see which directories are taking up the
26 most space, as \cw{du} does; but unlike \cw{du}, it also
27 distinguishes between large collections of data which are still in
28 use and ones which have not been accessed in months or years \dash
29 for instance, large archives downloaded, unpacked, used once, and
30 never cleaned up. Where \cw{du} helps you find what's using your
31 disk space, \cw{agedu} helps you find what's \e{wasting} your disk
32 space.
33
34 \cw{agedu} has several operating modes. In one mode, it scans your
35 disk and builds an index file containing a data structure which
36 allows it to efficiently retrieve any information it might need.
37 Typically, you would use it in this mode first, and then run it in
38 one of a number of \q{query} modes to display a report of the disk
39 space usage of a particular directory and its subdirectories. Those
40 reports can be produced as plain text (much like \cw{du}) or as
41 HTML. \cw{agedu} can even run as a miniature web server, presenting
42 each directory's HTML report with hyperlinks to let you navigate
43 around the file system to similar reports for other directories.
44
45 So you would typically start using \cw{agedu} by telling it to do a
46 scan of a directory tree and build an index. This is done with a
47 command such as
48
49 \c $ agedu -s /home/fred
50 \e bbbbbbbbbbbbbbbbbbb
51
52 which will build a large data file called \c{agedu.dat} in your
53 current directory. (If that current directory is \e{inside}
54 \cw{/home/fred}, don't worry \dash \cw{agedu} is smart enough to
55 discount its own index file.)
56
57 Having built the index, you would now query it for reports of disk
58 space usage. If you have a graphical web browser, the simplest and
59 nicest way to query the index is by running \cw{agedu} in web server
60 mode:
61
62 \c $ agedu -w
63 \e bbbbbbbb
64
65 which will print (among other messages) a URL on its standard output
66 along the lines of
67
68 \c URL: http://127.164.152.163:48638/
69
70 (That URL will always begin with \cq{127.}, meaning that it's in the
71 \cw{localhost} address space.)
72
73 Now paste that URL into your web browser, and you will be shown a
74 graphical representation of the disk usage in \cw{/home/fred} and
75 its immediate subdirectories, with varying colours used to show the
76 difference between disused and recently-accessed data. Click on any
77 subdirectory to descend into it and see a report for its
78 subdirectories in turn; click on parts of the pathname at the top of
79 any page to return to higher-level directories. When you've finished
80 browsing, you can just press Ctrl-D to send an end-of-file
81 indication to \cw{agedu}, and it will shut down.
82
83 After that, you probably want to delete the data file
84 \cw{agedu.dat}, since it's pretty large. In fact, the command
85 \cw{agedu -R} will do this for you; and you can chain \cw{agedu}
86 commands on the same command line, so that instead of the above you
87 could have done
88
89 \c $ agedu -s /home/fred -w -R
90 \e bbbbbbbbbbbbbbbbbbbbbbbbb
91
92 for a single self-contained run of \cw{agedu} which builds its
93 index, serves web pages from it, and cleans it up when finished.
94
95 If you don't have a graphical web browser, you can do text-based
96 queries as well. Having scanned \cw{/home/fred} as above, you might
97 run
98
99 \c $ agedu -t /home/fred
100 \e bbbbbbbbbbbbbbbbbbb
101
102 which again gives a summary of the disk usage in \cw{/home/fred} and
103 its immediate subdirectories; but this time \cw{agedu} will print it
104 on standard output, in much the same format as \cw{du}. If you then
105 want to find out how much \e{old} data is there, you can add the
106 \cw{-a} option to show only files last accessed a certain length of
107 time ago. For example, to show only files which haven't been looked
108 at in six months or more:
109
110 \c $ agedu -t /home/fred -a 6m
111 \e bbbbbbbbbbbbbbbbbbbbbbbbb
112
113 That's the essence of what \cw{agedu} does. It has other modes of
114 operation for more complex situations, and the usual array of
115 configurable options. The following sections contain a complete
116 reference for all its functionality.
117
118 \U OPERATING MODES
119
120 This section describes the operating modes supported by \cw{agedu}.
121 Each of these is in the form of a command-line option, sometimes
122 with an argument. Multiple operating-mode options may appear on the
123 command line, in which case \cw{agedu} will perform the specified
124 actions one after another. For instance, as shown in the previous
125 section, you might want to perform a disk scan and immediately
126 launch a web server giving reports from that scan.
127
128 \dt \cw{-s} \e{directory} or \cw{--scan} \e{directory}
129
130 \dd In this mode, \cw{agedu} scans the file system starting at the
131 specified directory, and indexes the results of the scan into a
132 large data file which other operating modes can query.
133
134 \lcont{
135
136 By default, the scan is restricted to a single file system (since
137 the expected use of \cw{agedu} is that you would probably use it
138 because a particular disk partition was running low on space). You
139 can remove that restriction using the \cw{--cross-fs} option; other
140 configuration options allow you to include or exclude files or
141 entire subdirectories from the scan. See the next section for full
142 details of the configurable options.
143
144 The index file is created with restrictive permissions, in case the
145 file system you are scanning contains confidential information in
146 its structure.
147
148 Index files are dependent on the characteristics of the CPU
149 architecture you created them on. You should not expect to be able
150 to move an index file between different types of computer and have
151 it continue to work. If you need to transfer the results of a disk
152 scan to a different kind of computer, see the \cw{-D} and \cw{-L}
153 options below.
154
155 }
156
157 \dt \cw{-w} or \cw{--web}
158
159 \dd In this mode, \cw{agedu} expects to find an index file already
160 written. It allocates a network port, and starts up a web server on
161 that port which serves reports generated from the index file. By
162 default it invents its own URL and prints it out.
163
164 \lcont{
165
166 The web server runs until \cw{agedu} receives an end-of-file event
167 on its standard input. (The expected usage is that you run it from
168 the command line, immediately browse web pages until you're
169 satisfied, and then press Ctrl-D.)
170
171 In case the index file contains any confidential information about
172 your file system, the web server protects the pages it serves from
173 access by other people. On Linux, this is done transparently by
174 means of using \cw{/proc/net/tcp} to check the owner of each
175 incoming connection; failing that, the web server will require a
176 password to view the reports, and \cw{agedu} will print the password
177 it invented on standard output along with the URL.
178
179 Configurable options for this mode let you specify your own address
180 and port number to listen on, and also specify your own choice of
181 authentication method (including turning authentication off
182 completely) and a username and password of your choice.
183
184 }
185
186 \dt \cw{-t} \e{directory} or \cw{--text} \e{directory}
187
188 \dd In this mode, \cw{agedu} generates a textual report on standard
189 output, listing the disk usage in the specified directory and all
190 its subdirectories down to a fixed depth. By default that depth is
191 1, so that you see a report for \e{directory} itself and all of its
192 immediate subdirectories. You can configure a different depth using
193 \cw{-d}, described in the next section.
194
195 \lcont{
196
197 Used on its own, \cw{-t} merely lists the \e{total} disk usage in
198 each subdirectory; \cw{agedu}'s additional ability to distinguish
199 unused from recently-used data is not activated. To activate it, use
200 the \cw{-a} option to specify a minimum age.
201
202 The directory structure stored in \cw{agedu}'s index file is treated
203 as a set of literal strings. This means that you cannot refer to
204 directories by synonyms. So if you ran \cw{agedu -s .}, then all the
205 path names you later pass to the \cw{-t} option must be either
206 \cq{.} or begin with \cq{./}. Similarly, symbolic links within the
207 directory you scanned will not be followed; you must refer to each
208 directory by its canonical, symlink-free pathname.
209
210 }
211
212 \dt \cw{-R} or \cw{--remove}
213
214 \dd In this mode, \cw{agedu} deletes its index file. Running just
215 \cw{agedu -R} on its own is therefore equivalent to typing \cw{rm
216 agedu.dat}. However, you can also put \cw{-R} on the end of a
217 command line to indicate that \cw{agedu} should delete its index
218 file after it finishes performing other operations.
219
220 \dt \cw{-D} or \cw{--dump}
221
222 \dd In this mode, \cw{agedu} reads an existing index file and
223 produces a dump of its contents on standard output. This dump can
224 later be loaded into a new index file, perhaps on another computer.
225
226 \dt \cw{-L} or \cw{--load}
227
228 \dd In this mode, \cw{agedu} expects to read a dump produced by the
229 \cw{-D} option from its standard input. It constructs an index file
230 from that dump, exactly as it would have if it had read the same
231 data from a disk scan in \cw{-s} mode.
232
233 \dt \cw{-S} \e{directory} or \cw{--scan-dump} \e{directory}
234
235 \dd In this mode, \cw{agedu} will scan a directory tree and convert
236 the results straight into a dump on standard output, without
237 generating an index file at all. So running \cw{agedu -S /path}
238 should produce equivalent output to that of \cw{agedu -s /path -D},
239 except that the latter will produce an index file as a side effect
240 whereas \cw{-S} will not.
241
242 \lcont{
243
244 (Actually, the output will not be exactly identical, due to a
245 difference in treatment of last-access times on directories. See the
246 documentation of the \cw{--dir-atime} option in the next section.
247
248 }
249
250 \dt \cw{-H} \e{directory} or \cw{--html} \e{directory}
251
252 \dd In this mode, \cw{agedu} will generate an HTML report of the
253 disk usage in the specified directory and its immediate
254 subdirectories, in the same form that it serves from its web server
255 in \cw{-w} mode. However, this time, a single HTML report will be
256 generated and simply written to standard output, with no hyperlinks
257 pointing to other similar pages.
258
259 \U OPTIONS
260
261 This section describes the various configuration options that affect
262 \cw{agedu}'s operation in one mode or another.
263
264 The following option affects nearly all modes (except \cw{-S}):
265
266 \dt \cw{-f} \e{filename} or \cw{--file} \e{filename}
267
268 \dd Specifies the location of the index file which \cw{agedu}
269 creates, reads or removes depending on its operating mode. By
270 default, this is simply \cq{agedu.dat}, in whatever is the current
271 working directory when you run \cw{agedu}.
272
273 The following options affect the disk-scanning modes, \cw{-s} and
274 \cw{-S}:
275
276 \dt \cw{--cross-fs} and \cw{--no-cross-fs}
277
278 \dd These configure whether or not the disk scan is permitted to
279 cross between different file systems. The default is not to:
280 \cw{agedu} will normally skip over subdirectories on which a
281 different file system is mounted. This makes it convenient when you
282 want to free up space on a particular file system which is running
283 low. However, in other circumstances you might wish to see general
284 information about the use of space no matter which file system it's
285 on (for instance, if your real concern is your backup media running
286 out of space, and if your backups do not treat different file
287 systems specially); in that situation, use \cw{--cross-fs}.
288
289 \lcont{
290
291 (Note that this default is the opposite way round from the
292 corresponding option in \cw{du}.)
293
294 }
295
296 \dt \cw{--prune} \e{wildcard} and \cw{--prune-path} \e{wildcard}
297
298 \dd These cause particular files or directories to be omitted
299 entirely from the scan. If \cw{agedu}'s scan encounters a file or
300 directory whose name matches the wildcard provided to the
301 \cw{--prune} option, it will not include that file in its index, and
302 also if it's a directory it will skip over it and not scan its
303 contents.
304
305 \lcont{
306
307 Note that in most Unix shells, wildcards will probably need to be
308 escaped on the command line, to prevent the shell from expanding the
309 wildcard before \cw{agedu} sees it.
310
311 \cw{--prune-path} is similar to \cw{--prune}, except that the
312 wildcard is matched against the entire pathname instead of just the
313 filename at the end of it. So whereas \cw{--prune *a*b*} will match
314 any file whose actual name contains an \cw{a} somewhere before a
315 \cw{b}, \cw{--prune-path *a*b*} will also match a file whose name
316 contains \cw{b} and which is inside a directory containing an
317 \cw{a}, or any file inside a directory of that form, and so on.
318
319 }
320
321 \dt \cw{--exclude} \e{wildcard} and \cw{--exclude-path} \e{wildcard}
322
323 \dd These cause particular files or directories to be omitted from
324 the index, but not from the scan. If \cw{agedu}'s scan encounters a
325 file or directory whose name matches the wildcard provided to the
326 \cw{--exclude} option, it will not include that file in its index
327 \dash but unlike \cw{--prune}, if the file in question is a
328 directory it will still scan its contents and index them if they are
329 not ruled out themselves by \cw{--exclude} options.
330
331 \lcont{
332
333 As above, \cw{--exclude-path} is similar to \cw{--exclude}, except
334 that the wildcard is matched against the entire pathname.
335
336 }
337
338 \dt \cw{--include} \e{wildcard} and \cw{--include-path} \e{wildcard}
339
340 \dd These cause particular files or directories to be re-included in
341 the index and the scan, if they had previously been ruled out by one
342 of the above exclude or prune options. You can interleave include,
343 exclude and prune options as you wish on the command line, and if
344 more than one of them applies to a file then the last one takes
345 priority.
346
347 \lcont{
348
349 For example, if you wanted to see only the disk space taken up by
350 MP3 files, you might run
351
352 \c $ agedu -s . --exclude '*' --include '*.mp3'
353 \e bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
354
355 which will cause everything to be omitted from the scan, but then
356 the MP3 files to be put back in. If you then wanted only a subset of
357 those MP3s, you could then exclude some of them again by adding,
358 say, \cq{--exclude-path './queen/*'} (or, more efficiently,
359 \cq{--prune ./queen}) on the end of that command.
360
361 As with the previous two options, \cw{--include-path} is similar to
362 \cw{--include} except that the wildcard is matched against the
363 entire pathname.
364
365 }
366
367 \dt \cw{--progress}, \cw{--no-progress} and \cw{--tty-progress}
368
369 \dd When \cw{agedu} is scanning a directory tree, it will typically
370 print a one-line progress report every second showing where it has
371 reached in the scan, so you can have some idea of how much longer it
372 will take. (Of course, it can't predict \e{exactly} how long it will
373 take, since it doesn't know which of the directories it hasn't
374 scanned yet will turn out to be huge.)
375
376 \lcont{
377
378 By default, those progress reports are displayed on \cw{agedu}'s
379 standard error channel, if that channel points to a terminal device.
380 If you need to manually enable or disable them, you can use the
381 above three options to do so: \cw{--progress} unconditionally
382 enables the progress reports, \cw{--no-progress} unconditionally
383 disables them, and \cw{--tty-progress} reverts to the default
384 behaviour which is conditional on standard error being a terminal.
385
386 }
387
388 \dt \cw{--dir-atime} and \cw{--no-dir-atime}
389
390 \dd In normal operation, \cw{agedu} ignores the atimes (last access
391 times) on the \e{directories} it scans: it only pays attention to
392 the atimes of the \e{files} inside those directories. This is
393 because directory atimes tend to be reset by a lot of system
394 administrative tasks, such as \cw{cron} jobs which scan the file
395 system for one reason or another \dash or even other invocations of
396 \cw{agedu} itself, though it tries to avoid modifying any atimes if
397 possible. So the literal atimes on directories are typically not
398 representative of how long ago the data in question was last
399 accessed with real intent to use that data in particular.
400
401 \lcont{
402
403 Instead, \cw{agedu} makes up a fake atime for every directory it
404 scans, which is equal to the newest atime of any file in or below
405 that directory (or the directory's last \e{modification} time,
406 whichever is newest). This is based on the assumption that all
407 \e{important} accesses to directories are actually accesses to the
408 files inside those directories, so that when any file is accessed
409 all the directories on the path leading to it should be considered
410 to have been accessed as well.
411
412 In unusual cases it is possible that a directory itself might embody
413 important data which is accessed by reading the directory. In that
414 situation, \cw{agedu}'s atime-faking policy will misreport the
415 directory as disused. In the unlikely event that such directories
416 form a significant part of your disk space usage, you might want to
417 turn off the faking. The \cw{--dir-atime} option does this: it
418 causes the disk scan to read the original atimes of the directories
419 it scans.
420
421 The faking of atimes on directories also requires a processing pass
422 over the index file after the main disk scan is complete.
423 \cw{--dir-atime} also turns this pass off. Hence, this option
424 affects the \cw{-L} option as well as \cw{-s} and \cw{-S}.
425
426 (The previous section mentioned that there might be subtle
427 differences between the output of \cw{agedu -s /path -D} and
428 \cw{agedu -S /path}. This is why. Doing a scan with \cw{-s} and then
429 dumping it with \cw{-D} will dump the fully faked atimes on the
430 directories, whereas doing a scan-to-dump with \cw{-S} will dump
431 only \e{partially} faked atimes \dash specifically, each directory's
432 last modification time \dash since the subsequent processing pass
433 will not have had a chance to take place. However, loading either of
434 the resulting dump files with \cw{-L} will perform the atime-faking
435 processing pass, leading to the same data in the index file in each
436 case. In normal usage it should be safe to ignore all of this
437 complexity.)
438
439 }
440
441 The following options affect the web server mode \cw{-w}, and in one
442 case also the stand-along HTML generation mode \cw{-H}:
443
444 \dt \cw{-r} \e{age range} or \cw{--age-range} \e{age range}
445
446 \dd The HTML reports produced by \cw{agedu} use a range of colours
447 to indicate how long ago data was last accessed, running from red
448 (representing the most disused data) to green (representing the
449 newest). By default, the lengths of time represented by the two ends
450 of that spectrum are chosen by examining the data file to see what
451 range of ages appears in it. However, you might want to set your own
452 limits, and you can do this using \cw{-r}.
453
454 \lcont{
455
456 The argument to \cw{-r} consists of a single age, or two ages
457 separated by a minus sign. An age is a number, followed by one of
458 \cq{y} (years), \cq{m} (months), \cq{w} (weeks) or \cq{d} (days).
459 The first age in the range represents the oldest data, and will be
460 coloured red in the HTML; the second age represents the newest,
461 coloured green. If the second age is not specified, it will default
462 to zero (so that green means data which has been accessed \e{just
463 now}).
464
465 For example, \cw{-r 2y} will mark data in red if it has been unused
466 for two years or more, and green if it has been accessed just now.
467 \cw{-r 2y-3m} will similarly mark data red if it has been unused for
468 two years or more, but will mark it green if it has been accessed
469 three months ago or later.
470
471 }
472
473 \dt \cw{--address} \e{addr}[\cw{:}\e{port}]
474
475 \dd Specifies the network address and port number on which
476 \cw{agedu} should listen when running its web server. If you want
477 \cw{agedu} to listen for connections coming in from any source, you
478 should probably specify the special IP address \cw{0.0.0.0}. If the
479 port number is omitted, it will be assumed to be 80 (for which
480 \cw{agedu} will probably need to be running as a privileged user).
481
482 \lcont{
483
484 If you specify this option, \cw{agedu} will not print its URL on
485 standard output (since you are expected to know what address you
486 told it to listen to).
487
488 }
489
490 \dt \cw{--auth} \e{auth-type}
491
492 \dd Specifies how \cw{agedu} should control access to the web pages
493 it serves. The options are as follows:
494
495 \lcont{
496
497 \dt \cw{magic}
498
499 \dd This option only works on Linux, and only when the incoming
500 connection is from the same machine that \cw{agedu} is running on.
501 On Linux, the special file \cw{/proc/net/tcp} contains a list of
502 network connections currently known to the operating system kernel,
503 including which user id created them. So \cw{agedu} will look up
504 each incoming connection in that file, and allow access if it comes
505 from the same user id under which \cw{agedu} itself is running.
506 Therefore, in \cw{agedu}'s normal web server mode, you can safely
507 run it on a multi-user machine and no other user will be able to
508 read data out of your index file.
509
510 \dt \cw{basic}
511
512 \dd In this mode, \cw{agedu} will use HTTP Basic authentication: the
513 user will have to provide a username and password via their browser.
514 \cw{agedu} will normally make up a username and password for the
515 purpose, but you can specify your own; see below.
516
517 \dt \cw{none}
518
519 \dd In this mode, the web server is unauthenticated: anyone
520 connecting to it has full access to the reports generated by
521 \cw{agedu}. Do not do this unless there is nothing confidential at
522 all in your index file, or unless you are certain that nobody but
523 you can run processes on your computer.
524
525 \dt \cw{default}
526
527 \dd This is the default mode if you do not specify one of the above.
528 In this mode, \cw{agedu} will attempt to use Linux magic
529 authentication, but if it detects at startup time that
530 \cw{/proc/net/tcp} is absent or non-functional then it will fall
531 back to using HTTP Basic authentication and invent a user name and
532 password.
533
534 }
535
536 \dt \cw{--auth-file} \e{filename} or \cw{--auth-fd} \e{fd}
537
538 \dd When \cw{agedu} is using HTTP Basic authentication, these
539 options allow you to specify your own user name and password. If you
540 specify \cw{--auth-file}, these will be read from the specified
541 file; if you specify \cw{--auth-fd} they will instead be read from a
542 given file descriptor which you should have arranged to pass to
543 \cw{agedu}. In either case, the authentication details should
544 consist of the username, followed by a colon, followed by the
545 password, followed \e{immediately} by end of file (no trailing
546 newline, or else it will be considered part of the password).
547
548 \U LIMITATIONS
549
550 The data file is pretty large. The core of \cw{agedu} is the
551 tree-based data structure it uses in its index in order to
552 efficiently perform the queries it needs; this data structure
553 requires \cw{O(N log N)} storage. This is larger than you might
554 expect; a scan of my own home directory, containing half a million
555 files and directories and about 20Gb of data, produced an index file
556 nearly a third of a Gb in size. Furthermore, since the data file
557 must be memory-mapped during most processing, it can never grow
558 larger than available address space, which means that any use of
559 \cw{agedu} on a seriously large file system is probably going to
560 have to be done on a 64-bit computer.
561
562 The data structure also does not usefully permit access control
563 within the data file, so it would be difficult \dash even given the
564 willingness to do additional coding \dash to run a system-wide
565 \cw{agedu} scan on a \cw{cron} job and serve the right subset of
566 reports to each user.
567
568 \U LICENCE
569
570 \cw{agedu} is free software, distributed under the MIT licence. Type
571 \cw{agedu --licence} to see the full licence text.
572
573 \versionid $Id$