From c1065ef6db11830ffb544b5ffabe6af3e59688f6 Mon Sep 17 00:00:00 2001 From: simon Date: Sun, 6 Dec 2009 23:07:33 +0000 Subject: [PATCH] About time I wrote up the data structure used in agedu, since it's quite fun. git-svn-id: svn://svn.tartarus.org/sgt/agedu@8767 cda61777-01e9-0310-a592-d414129be87e --- Buildscr | 2 + tree.but | 260 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 262 insertions(+) create mode 100644 tree.but diff --git a/Buildscr b/Buildscr index 8bf909c..76e5b2c 100644 --- a/Buildscr +++ b/Buildscr @@ -20,9 +20,11 @@ in . do rm agedu-r$(revision)/GNUmakefile in . do tar chzvf agedu-r$(revision).tar.gz agedu-r$(revision) in agedu do halibut --html=manpage.html agedu.but +in agedu do halibut --html=tree.html tree.but deliver agedu-r$(revision).tar.gz $@ deliver agedu/manpage.html $@ +deliver agedu/tree.html $@ delegate windows # FIXME: Cygwin alternative? diff --git a/tree.but b/tree.but new file mode 100644 index 0000000..2569075 --- /dev/null +++ b/tree.but @@ -0,0 +1,260 @@ +\cfg{html-chapter-numeric}{true} +\cfg{html-chapter-suffix}{. } +\cfg{chapter}{Section} + +\define{dash} \u2013{-} + +\title A tree structure for log-time quadrant counting + +This article describes a general form of data structure which +permits the storing of two-dimensional data in such a way that a +log-time lookup can reliably return the total \e{amount} of data in +an arbitrary quadrant (i.e. a region both upward and leftward of a +user-specified point). + +The program +\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu}\cw{agedu}, +which uses a data structure of this type for its on-disk index +files, is used as a case study. + +\C{problem} The problem + +The basic problem can be stated as follows: you have a collection of +\cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and +you want to be able to efficiently answer questions of the general +form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}, +for arbitrary \cw{(x0,y0)}. + +A slightly enhanced form of the problem, which is no more difficult +to solve, allows each point to have a \e{weight} as well as +coordinates; now the question to be answered is \q{What is the +\e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The +previous version of the question, of course, is a special case of +this in which all weights are set to 1. + +The data structure presented in this article answers any question of +this type in time \cw{O(log N)}. + +\C{onedim} The one-dimensional case + +To begin with, we address the one-dimensional analogue of the +problem. If we have a set of points which each have an +\cw{x}-coordinate and a weight, how do we represent them so as to be +able to find the total weight of all points with \cw{x\_<\_x0}? + +An obvious solution is to sort the points into order of their +\cw{x}-coordinate, and alongside each point to list the total weight of +everything up to and including that point. Then, given any \cw{x0}, a +log-time binary search will find the last point in the list with +\cw{x\_<\_x0}, and we can immediately read off the cumulative weight +figure. + +Now let's be more ambitious. What if the set of points were +constantly changing \dash new points arriving, old points going away +\dash and we wanted a data structure which would cope efficiently +with dynamic updates while maintaining the ability to answer these +count queries? + +An efficient solution begins by storing the points in a balanced +tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash +AVL tree, red-black tree, the various forms of B-tree \dash will +suffice. We then annotate every node of the tree with an extra field +giving the total weight of the points in and below that node. + +These annotations can be maintained through updates to the tree, +without sacrificing the \cw{O(log N)} time bound on insertions or +deletions. So we can start with an empty tree and insert a complete +set of points, or we can insert and delete points in arbitrary +order, and the tree annotations will remain valid at all times. + +A balanced tree sorted by \cw{x} can easily be searched to find the +last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight +annotations as described above, we can arrange that this search +calculates the total weight of all points with \cw{x\_<\_x0}: every time +we examine a tree node and decide which subtree of it to descend to, +we look at the top node of each subtree to the left of that one, and +add their weight annotations to a running total. When we reach a +leaf node and find our chosen subtree is \cw{NULL}, the running +total is the required answer. + +So this data structure allows us to maintain a changing set of +points in such a way that we can do an efficient one-dimensional +count query at any time. + +\C{incremental} An incremental approach + +Now we'll use the above one-dimensional structure to answer a +restricted form of the original 2-D problem. Suppose we have some +large set of \cw{(x0,y0)} pairs for which we want to answer +counted-quadrant queries, and suppose (for the moment) that we have +\e{all of the queries presented in advance}. Is there any way we can +do that efficiently? + +There is. We sort our points by their \cw{y}-coordinate, and then go +through them in that order adding them one by one to a balanced tree +sorted by \cw{x} as described above. + +So for any query coordinates \cw{(x0,y0)}, there must be some moment +during that process at which we have added to our tree precisely +those points with \cw{y\_<\_y0}. At that moment, we could search the tree +to find the total weight of everything it contained with \cw{x\_<\_x0}, +and that would be the answer to our two-dimensional query. + +Hence, if we also sorted our queries into order by \cw{y0}, we could +progress through the list of queries in parallel with the list of +points (much like merging two sorted lists), answering each query at +the moment when the tree contained just the right set of points to +make it easy. + +\C{cow} Copy-on-write + +In real life, of course, we typically don't receive all our queries +in a big batch up front. We want to construct a data structure +capable of answering \e{any} query efficiently, and then be able to +deal with queries as they arise. + +A data structure capable of this, it turns out, is only a small step +away from the one described in the previous section. The small step +is \e{copy-on-write}. + +As described in the previous section, we go through our list of +points in order of their \cw{y}-coordinate, and we add each one in turn +to a balanced tree sorted by \cw{x} with total-weight annotations. The +catch is that, this time, we never \e{modify} any node in the tree: +whenever the process of inserting an element into a tree wants to +modify a node, we instead make a \e{copy} of that node containing +the modified data. The parent of that node must now be modified +(even if it would previously not have needed modification) so that +it points at the copied node instead of the original \dash so we do +the same thing there, and copy that one too, and so on up to the +root. + +So after we have done a single insertion by this method, we end up +with a new tree root which describes the new tree \dash but the old +tree root, describing the tree before the insertion, is still valid, +because every node reachable from the old root is unchanged. + +Therefore, if we start with an empty tree and repeatedly do this, we +will end up with a distinct tree root for each point in the process, +and \e{all of them will be valid at once}. It's as if we had done +the incremental tree construction in the previous section, but could +rewind time to any point in the process. + +So now all we have to do is make a list of those tree roots, with +their associated \cw{y}-coordinates. Any way of doing this will do +\dash another balanced tree, or a simple sorted list to be +binary-searched, or something more exotic, whatever is convenient. + +Then we can answer an arbitrary quadrant query using only a pair of +log-time searches: given a query coordinate pair \cw{(x0,y0)}, we +look through our list of tree roots to find the one describing +precisely the set of points with \cw{y\_<\_y0}, and then do a +one-dimensional count query into that tree for the total weight of +points with \cw{x\_<\_x0}. Done! + +The nice thing about all of this is that \e{nearly all} the nodes in +each tree are shared with the next one. Consider: since the +operation of adding an element to a balanced tree takes \cw{O(log N)} +time, it must modify at most \cw{O(log N)} nodes. Each of these +node-copying insertion processes must copy all of those nodes, but +need not copy any others \dash so it creates at most \cw{O(log N)} new +nodes. Hence, the total storage used by the combined set of trees is +\cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if +the trees were all separate or even mostly separate. + +\C{limitations} Limitations + +The one-dimensional data structure described at the start of this +article is dynamically updatable: if the set of points to be +searched changes, the structure can be modified efficiently without +losing its searching properties. The two-dimensional structure we +ended up with is not: if a single point changes its coordinates or +weight, or appears, or disappears, then the whole structure must be +rebuilt. + +Since the technique I used to add an extra dimension is critically +dependent on the dynamic updatability of the one-dimensional base +structure, but the resulting structure is not dynamically updatable +in turn, it follows that this technique cannot be applied twice: no +analogous transformation will construct a \e{three}-dimensional +struccture capable of counting the total weight of an octant +\cw{\{x\_ <\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way +to do that. + +The structure as described above uses \cw{O(N log N)} storage. Many +algorithms using \cw{O(N log N)} time are considered efficient (e.g. +sorting), but space is generally more expensive than time, and \cw{O(N +log N)} space is larger than you think! + +\C{application} An application: \cw{agedu} + +The application for which I developed this data structure is +\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu}\cw{agedu}, a +program which scans a file system and indexes pathnames against +last-access times (\q{atimes}, in Unix terminology) so as to be able +to point out directories which take up a lot of disk space but whose +contents have not been accessed in years (making them strong +candidates for tidying up or archiving to save space). + +So the fundamental searching primitive we want for \cw{agedu} is +\q{What is the total size of the files contained within some +directory path \cw{Ptop} which have atime at most \cw{T0}?} + +We begin by sorting the files into order by their full pathname. +This brings every subdirectory, at every level, together into a +contiguous sequence of files. So now our query primitive becomes +\q{What is the total size of files whose pathname falls between +\cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?} + +Clearly, we can simplify this to the same type of quadrant query as +discussed above, by splitting this into two subqueries: the total +size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the +total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size +of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of +precisely the type we have just derived a data structure to answer. + +So we want to sort our files by two \q{coordinates}: one is atime, +and the other is pathname (sorted in ASCII collation order). So +which should be \cw{x} and which \cw{y}, in the above notation? + +Well, either way round would work in principle, but two criteria +inform the decision. Firstly, \cw{agedu} typically wants to do many +queries on the same pathname for different atimes, so as to build up +a detailed profile of size against atime for a given subdirectory. +So it makes sense to have the first-level lookup (on \cw{y}, to find a +tree root) be done on pathname, and the secondary lookup (on \cw{x}, +within that tree) be done on atime; then we can cache the tree root +found in the first lookup, and use it many times without wasting +time repeating the pathname search. + +Another important point for \cw{agedu} is that not all tree roots +are actually used: the only pathnames ever submitted to a quadrant +search are those at the start or the end of a particular +subdirectory. This allows us to save a lot of disk space by limiting +the copy-on-write behaviour: instead of \e{never} modifying an +existing tree node, we now rule that we may modify a tree node \e{if +it has already been modified once since the last tree root we +saved}. In real-world cases, this cuts down the total space usage by +about a factor of five, so it's well worthwhile \dash and we +wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate +instead of pathname. + +Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the +top-level lookup to find a tree root is most efficiently done using +neither a balanced tree nor a sorted list, but a trie: tries allow +lookup of a string in time proportional to the length of the string, +whereas either of the other approaches would require \cw{O(log N)} +compare operations \e{each} of which could take time proportional to +the length of the string. + +Finally, the two-dimensions limitation on the above data structure +unfortunately imposes limitations on what \cw{agedu} can do. One +would like to run a single \cw{agedu} as a system-wide job on a file +server (perhaps nightly or weekly), and present the results to all +users in such a way that each user's view of the data was filtered +to only what their access permissions permitted them to see. Sadly, +to do this would require a third dimension in the data structure +(representing ownership or access control, in some fashion), and +hence cannot be done efficiently. \cw{agedu} is therefore instead +most sensibly used on demand by an individual user, so that it +generates a custom data set for that user every time. -- 2.11.0