mdw@git.distorted.org.uk Git - sgt/agedu/blob - tree.but

   1 \cfg{html-chapter-numeric}{true}
   2 \cfg{html-chapter-suffix}{. }
   3 \cfg{chapter}{Section}
   4
   5 \define{dash} \u2013{-}
   6
   7 \title A tree structure for log-time quadrant counting
   8
   9 This article describes a general form of data structure which
  10 permits the storing of two-dimensional data in such a way that a
  11 log-time lookup can reliably return the total \e{amount} of data in
  12 an arbitrary quadrant (i.e. a region both upward and leftward of a
  13 user-specified point).
  14
  15 The program
  16 \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu},
  17 which uses a data structure of this type for its on-disk index
  18 files, is used as a case study.
  19
  20 \C{problem} The problem
  21
  22 The basic problem can be stated as follows: you have a collection of
  23 \cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and
  24 you want to be able to efficiently answer questions of the general
  25 form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?},
  26 for arbitrary \cw{(x0,y0)}.
  27
  28 A slightly enhanced form of the problem, which is no more difficult
  29 to solve, allows each point to have a \e{weight} as well as
  30 coordinates; now the question to be answered is \q{What is the
  31 \e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The
  32 previous version of the question, of course, is a special case of
  33 this in which all weights are set to 1.
  34
  35 The data structure presented in this article answers any question of
  36 this type in time \cw{O(log N)}.
  37
  38 \C{onedim} The one-dimensional case
  39
  40 To begin with, we address the one-dimensional analogue of the
  41 problem. If we have a set of points which each have an
  42 \cw{x}-coordinate and a weight, how do we represent them so as to be
  43 able to find the total weight of all points with \cw{x\_<\_x0}?
  44
  45 An obvious solution is to sort the points into order of their
  46 \cw{x}-coordinate, and alongside each point to list the total weight of
  47 everything up to and including that point. Then, given any \cw{x0}, a
  48 log-time binary search will find the last point in the list with
  49 \cw{x\_<\_x0}, and we can immediately read off the cumulative weight
  50 figure.
  51
  52 Now let's be more ambitious. What if the set of points were
  53 constantly changing \dash new points arriving, old points going away
  54 \dash and we wanted a data structure which would cope efficiently
  55 with dynamic updates while maintaining the ability to answer these
  56 count queries?
  57
  58 An efficient solution begins by storing the points in a balanced
  59 tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash
  60 AVL tree, red-black tree, the various forms of B-tree \dash will
  61 suffice. We then annotate every node of the tree with an extra field
  62 giving the total weight of the points in and below that node.
  63
  64 These annotations can be maintained through updates to the tree,
  65 without sacrificing the \cw{O(log N)} time bound on insertions or
  66 deletions. So we can start with an empty tree and insert a complete
  67 set of points, or we can insert and delete points in arbitrary
  68 order, and the tree annotations will remain valid at all times.
  69
  70 A balanced tree sorted by \cw{x} can easily be searched to find the
  71 last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight
  72 annotations as described above, we can arrange that this search
  73 calculates the total weight of all points with \cw{x\_<\_x0}: every time
  74 we examine a tree node and decide which subtree of it to descend to,
  75 we look at the top node of each subtree to the left of that one, and
  76 add their weight annotations to a running total. When we reach a
  77 leaf node and find our chosen subtree is \cw{NULL}, the running
  78 total is the required answer.
  79
  80 So this data structure allows us to maintain a changing set of
  81 points in such a way that we can do an efficient one-dimensional
  82 count query at any time.
  83
  84 \C{incremental} An incremental approach
  85
  86 Now we'll use the above one-dimensional structure to answer a
  87 restricted form of the original 2-D problem. Suppose we have some
  88 large set of \cw{(x0,y0)} pairs for which we want to answer
  89 counted-quadrant queries, and suppose (for the moment) that we have
  90 \e{all of the queries presented in advance}. Is there any way we can
  91 do that efficiently?
  92
  93 There is. We sort our points by their \cw{y}-coordinate, and then go
  94 through them in that order adding them one by one to a balanced tree
  95 sorted by \cw{x} as described above.
  96
  97 So for any query coordinates \cw{(x0,y0)}, there must be some moment
  98 during that process at which we have added to our tree precisely
  99 those points with \cw{y\_<\_y0}. At that moment, we could search the tree
 100 to find the total weight of everything it contained with \cw{x\_<\_x0},
 101 and that would be the answer to our two-dimensional query.
 102
 103 Hence, if we also sorted our queries into order by \cw{y0}, we could
 104 progress through the list of queries in parallel with the list of
 105 points (much like merging two sorted lists), answering each query at
 106 the moment when the tree contained just the right set of points to
 107 make it easy.
 108
 109 \C{cow} Copy-on-write
 110
 111 In real life, of course, we typically don't receive all our queries
 112 in a big batch up front. We want to construct a data structure
 113 capable of answering \e{any} query efficiently, and then be able to
 114 deal with queries as they arise.
 115
 116 A data structure capable of this, it turns out, is only a small step
 117 away from the one described in the previous section. The small step
 118 is \e{copy-on-write}.
 119
 120 As described in the previous section, we go through our list of
 121 points in order of their \cw{y}-coordinate, and we add each one in turn
 122 to a balanced tree sorted by \cw{x} with total-weight annotations. The
 123 catch is that, this time, we never \e{modify} any node in the tree:
 124 whenever the process of inserting an element into a tree wants to
 125 modify a node, we instead make a \e{copy} of that node containing
 126 the modified data. The parent of that node must now be modified
 127 (even if it would previously not have needed modification) so that
 128 it points at the copied node instead of the original \dash so we do
 129 the same thing there, and copy that one too, and so on up to the
 130 root.
 131
 132 So after we have done a single insertion by this method, we end up
 133 with a new tree root which describes the new tree \dash but the old
 134 tree root, describing the tree before the insertion, is still valid,
 135 because every node reachable from the old root is unchanged.
 136
 137 Therefore, if we start with an empty tree and repeatedly do this, we
 138 will end up with a distinct tree root for each point in the process,
 139 and \e{all of them will be valid at once}. It's as if we had done
 140 the incremental tree construction in the previous section, but could
 141 rewind time to any point in the process.
 142
 143 So now all we have to do is make a list of those tree roots, with
 144 their associated \cw{y}-coordinates. Any way of doing this will do
 145 \dash another balanced tree, or a simple sorted list to be
 146 binary-searched, or something more exotic, whatever is convenient.
 147
 148 Then we can answer an arbitrary quadrant query using only a pair of
 149 log-time searches: given a query coordinate pair \cw{(x0,y0)}, we
 150 look through our list of tree roots to find the one describing
 151 precisely the set of points with \cw{y\_<\_y0}, and then do a
 152 one-dimensional count query into that tree for the total weight of
 153 points with \cw{x\_<\_x0}. Done!
 154
 155 The nice thing about all of this is that \e{nearly all} the nodes in
 156 each tree are shared with the next one. Consider: since the
 157 operation of adding an element to a balanced tree takes \cw{O(log N)}
 158 time, it must modify at most \cw{O(log N)} nodes. Each of these
 159 node-copying insertion processes must copy all of those nodes, but
 160 need not copy any others \dash so it creates at most \cw{O(log N)} new
 161 nodes. Hence, the total storage used by the combined set of trees is
 162 \cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if
 163 the trees were all separate or even mostly separate.
 164
 165 \C{limitations} Limitations
 166
 167 The one-dimensional data structure described at the start of this
 168 article is dynamically updatable: if the set of points to be
 169 searched changes, the structure can be modified efficiently without
 170 losing its searching properties. The two-dimensional structure we
 171 ended up with is not: if a single point changes its coordinates or
 172 weight, or appears, or disappears, then the whole structure must be
 173 rebuilt.
 174
 175 Since the technique I used to add an extra dimension is critically
 176 dependent on the dynamic updatability of the one-dimensional base
 177 structure, but the resulting structure is not dynamically updatable
 178 in turn, it follows that this technique cannot be applied twice: no
 179 analogous transformation will construct a \e{three}-dimensional
 180 structure capable of counting the total weight of an octant
 181 \cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way
 182 to do that.
 183
 184 The structure as described above uses \cw{O(N log N)} storage. Many
 185 algorithms using \cw{O(N log N)} time are considered efficient (e.g.
 186 sorting), but space is generally more expensive than time, and \cw{O(N
 187 log N)} space is larger than you think!
 188
 189 \C{application} An application: \cw{agedu}
 190
 191 The application for which I developed this data structure is
 192 \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a
 193 program which scans a file system and indexes pathnames against
 194 last-access times (\q{atimes}, in Unix terminology) so as to be able
 195 to point out directories which take up a lot of disk space but whose
 196 contents have not been accessed in years (making them strong
 197 candidates for tidying up or archiving to save space).
 198
 199 So the fundamental searching primitive we want for \cw{agedu} is
 200 \q{What is the total size of the files contained within some
 201 directory path \cw{Ptop} which have atime at most \cw{T0}?}
 202
 203 We begin by sorting the files into order by their full pathname.
 204 This brings every subdirectory, at every level, together into a
 205 contiguous sequence of files. So now our query primitive becomes
 206 \q{What is the total size of files whose pathname falls between
 207 \cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?}
 208
 209 Clearly, we can simplify this to the same type of quadrant query as
 210 discussed above, by splitting this into two subqueries: the total
 211 size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the
 212 total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size
 213 of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of
 214 precisely the type we have just derived a data structure to answer.
 215
 216 So we want to sort our files by two \q{coordinates}: one is atime,
 217 and the other is pathname (sorted in ASCII collation order). So
 218 which should be \cw{x} and which \cw{y}, in the above notation?
 219
 220 Well, either way round would work in principle, but two criteria
 221 inform the decision. Firstly, \cw{agedu} typically wants to do many
 222 queries on the same pathname for different atimes, so as to build up
 223 a detailed profile of size against atime for a given subdirectory.
 224 So it makes sense to have the first-level lookup (on \cw{y}, to find a
 225 tree root) be done on pathname, and the secondary lookup (on \cw{x},
 226 within that tree) be done on atime; then we can cache the tree root
 227 found in the first lookup, and use it many times without wasting
 228 time repeating the pathname search.
 229
 230 Another important point for \cw{agedu} is that not all tree roots
 231 are actually used: the only pathnames ever submitted to a quadrant
 232 search are those at the start or the end of a particular
 233 subdirectory. This allows us to save a lot of disk space by limiting
 234 the copy-on-write behaviour: instead of \e{never} modifying an
 235 existing tree node, we now rule that we may modify a tree node \e{if
 236 it has already been modified once since the last tree root we
 237 saved}. In real-world cases, this cuts down the total space usage by
 238 about a factor of five, so it's well worthwhile \dash and we
 239 wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate
 240 instead of pathname.
 241
 242 Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the
 243 top-level lookup to find a tree root is most efficiently done using
 244 neither a balanced tree nor a sorted list, but a trie: tries allow
 245 lookup of a string in time proportional to the length of the string,
 246 whereas either of the other approaches would require \cw{O(log N)}
 247 compare operations \e{each} of which could take time proportional to
 248 the length of the string.
 249
 250 Finally, the two-dimensions limitation on the above data structure
 251 unfortunately imposes limitations on what \cw{agedu} can do. One
 252 would like to run a single \cw{agedu} as a system-wide job on a file
 253 server (perhaps nightly or weekly), and present the results to all
 254 users in such a way that each user's view of the data was filtered
 255 to only what their access permissions permitted them to see. Sadly,
 256 to do this would require a third dimension in the data structure
 257 (representing ownership or access control, in some fashion), and
 258 hence cannot be done efficiently. \cw{agedu} is therefore instead
 259 most sensibly used on demand by an individual user, so that it
 260 generates a custom data set for that user every time.