| 1 | \cfg{html-chapter-numeric}{true} |
| 2 | \cfg{html-chapter-suffix}{. } |
| 3 | \cfg{chapter}{Section} |
| 4 | |
| 5 | \define{dash} \u2013{-} |
| 6 | |
| 7 | \title A tree structure for log-time quadrant counting |
| 8 | |
| 9 | This article describes a general form of data structure which |
| 10 | permits the storing of two-dimensional data in such a way that a |
| 11 | log-time lookup can reliably return the total \e{amount} of data in |
| 12 | an arbitrary quadrant (i.e. a region both upward and leftward of a |
| 13 | user-specified point). |
| 14 | |
| 15 | The program |
| 16 | \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, |
| 17 | which uses a data structure of this type for its on-disk index |
| 18 | files, is used as a case study. |
| 19 | |
| 20 | \C{problem} The problem |
| 21 | |
| 22 | The basic problem can be stated as follows: you have a collection of |
| 23 | \cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and |
| 24 | you want to be able to efficiently answer questions of the general |
| 25 | form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}, |
| 26 | for arbitrary \cw{(x0,y0)}. |
| 27 | |
| 28 | A slightly enhanced form of the problem, which is no more difficult |
| 29 | to solve, allows each point to have a \e{weight} as well as |
| 30 | coordinates; now the question to be answered is \q{What is the |
| 31 | \e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The |
| 32 | previous version of the question, of course, is a special case of |
| 33 | this in which all weights are set to 1. |
| 34 | |
| 35 | The data structure presented in this article answers any question of |
| 36 | this type in time \cw{O(log N)}. |
| 37 | |
| 38 | \C{onedim} The one-dimensional case |
| 39 | |
| 40 | To begin with, we address the one-dimensional analogue of the |
| 41 | problem. If we have a set of points which each have an |
| 42 | \cw{x}-coordinate and a weight, how do we represent them so as to be |
| 43 | able to find the total weight of all points with \cw{x\_<\_x0}? |
| 44 | |
| 45 | An obvious solution is to sort the points into order of their |
| 46 | \cw{x}-coordinate, and alongside each point to list the total weight of |
| 47 | everything up to and including that point. Then, given any \cw{x0}, a |
| 48 | log-time binary search will find the last point in the list with |
| 49 | \cw{x\_<\_x0}, and we can immediately read off the cumulative weight |
| 50 | figure. |
| 51 | |
| 52 | Now let's be more ambitious. What if the set of points were |
| 53 | constantly changing \dash new points arriving, old points going away |
| 54 | \dash and we wanted a data structure which would cope efficiently |
| 55 | with dynamic updates while maintaining the ability to answer these |
| 56 | count queries? |
| 57 | |
| 58 | An efficient solution begins by storing the points in a balanced |
| 59 | tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash |
| 60 | AVL tree, red-black tree, the various forms of B-tree \dash will |
| 61 | suffice. We then annotate every node of the tree with an extra field |
| 62 | giving the total weight of the points in and below that node. |
| 63 | |
| 64 | These annotations can be maintained through updates to the tree, |
| 65 | without sacrificing the \cw{O(log N)} time bound on insertions or |
| 66 | deletions. So we can start with an empty tree and insert a complete |
| 67 | set of points, or we can insert and delete points in arbitrary |
| 68 | order, and the tree annotations will remain valid at all times. |
| 69 | |
| 70 | A balanced tree sorted by \cw{x} can easily be searched to find the |
| 71 | last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight |
| 72 | annotations as described above, we can arrange that this search |
| 73 | calculates the total weight of all points with \cw{x\_<\_x0}: every time |
| 74 | we examine a tree node and decide which subtree of it to descend to, |
| 75 | we look at the top node of each subtree to the left of that one, and |
| 76 | add their weight annotations to a running total. When we reach a |
| 77 | leaf node and find our chosen subtree is \cw{NULL}, the running |
| 78 | total is the required answer. |
| 79 | |
| 80 | So this data structure allows us to maintain a changing set of |
| 81 | points in such a way that we can do an efficient one-dimensional |
| 82 | count query at any time. |
| 83 | |
| 84 | \C{incremental} An incremental approach |
| 85 | |
| 86 | Now we'll use the above one-dimensional structure to answer a |
| 87 | restricted form of the original 2-D problem. Suppose we have some |
| 88 | large set of \cw{(x0,y0)} pairs for which we want to answer |
| 89 | counted-quadrant queries, and suppose (for the moment) that we have |
| 90 | \e{all of the queries presented in advance}. Is there any way we can |
| 91 | do that efficiently? |
| 92 | |
| 93 | There is. We sort our points by their \cw{y}-coordinate, and then go |
| 94 | through them in that order adding them one by one to a balanced tree |
| 95 | sorted by \cw{x} as described above. |
| 96 | |
| 97 | So for any query coordinates \cw{(x0,y0)}, there must be some moment |
| 98 | during that process at which we have added to our tree precisely |
| 99 | those points with \cw{y\_<\_y0}. At that moment, we could search the tree |
| 100 | to find the total weight of everything it contained with \cw{x\_<\_x0}, |
| 101 | and that would be the answer to our two-dimensional query. |
| 102 | |
| 103 | Hence, if we also sorted our queries into order by \cw{y0}, we could |
| 104 | progress through the list of queries in parallel with the list of |
| 105 | points (much like merging two sorted lists), answering each query at |
| 106 | the moment when the tree contained just the right set of points to |
| 107 | make it easy. |
| 108 | |
| 109 | \C{cow} Copy-on-write |
| 110 | |
| 111 | In real life, of course, we typically don't receive all our queries |
| 112 | in a big batch up front. We want to construct a data structure |
| 113 | capable of answering \e{any} query efficiently, and then be able to |
| 114 | deal with queries as they arise. |
| 115 | |
| 116 | A data structure capable of this, it turns out, is only a small step |
| 117 | away from the one described in the previous section. The small step |
| 118 | is \e{copy-on-write}. |
| 119 | |
| 120 | As described in the previous section, we go through our list of |
| 121 | points in order of their \cw{y}-coordinate, and we add each one in turn |
| 122 | to a balanced tree sorted by \cw{x} with total-weight annotations. The |
| 123 | catch is that, this time, we never \e{modify} any node in the tree: |
| 124 | whenever the process of inserting an element into a tree wants to |
| 125 | modify a node, we instead make a \e{copy} of that node containing |
| 126 | the modified data. The parent of that node must now be modified |
| 127 | (even if it would previously not have needed modification) so that |
| 128 | it points at the copied node instead of the original \dash so we do |
| 129 | the same thing there, and copy that one too, and so on up to the |
| 130 | root. |
| 131 | |
| 132 | So after we have done a single insertion by this method, we end up |
| 133 | with a new tree root which describes the new tree \dash but the old |
| 134 | tree root, describing the tree before the insertion, is still valid, |
| 135 | because every node reachable from the old root is unchanged. |
| 136 | |
| 137 | Therefore, if we start with an empty tree and repeatedly do this, we |
| 138 | will end up with a distinct tree root for each point in the process, |
| 139 | and \e{all of them will be valid at once}. It's as if we had done |
| 140 | the incremental tree construction in the previous section, but could |
| 141 | rewind time to any point in the process. |
| 142 | |
| 143 | So now all we have to do is make a list of those tree roots, with |
| 144 | their associated \cw{y}-coordinates. Any way of doing this will do |
| 145 | \dash another balanced tree, or a simple sorted list to be |
| 146 | binary-searched, or something more exotic, whatever is convenient. |
| 147 | |
| 148 | Then we can answer an arbitrary quadrant query using only a pair of |
| 149 | log-time searches: given a query coordinate pair \cw{(x0,y0)}, we |
| 150 | look through our list of tree roots to find the one describing |
| 151 | precisely the set of points with \cw{y\_<\_y0}, and then do a |
| 152 | one-dimensional count query into that tree for the total weight of |
| 153 | points with \cw{x\_<\_x0}. Done! |
| 154 | |
| 155 | The nice thing about all of this is that \e{nearly all} the nodes in |
| 156 | each tree are shared with the next one. Consider: since the |
| 157 | operation of adding an element to a balanced tree takes \cw{O(log N)} |
| 158 | time, it must modify at most \cw{O(log N)} nodes. Each of these |
| 159 | node-copying insertion processes must copy all of those nodes, but |
| 160 | need not copy any others \dash so it creates at most \cw{O(log N)} new |
| 161 | nodes. Hence, the total storage used by the combined set of trees is |
| 162 | \cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if |
| 163 | the trees were all separate or even mostly separate. |
| 164 | |
| 165 | \C{limitations} Limitations |
| 166 | |
| 167 | The one-dimensional data structure described at the start of this |
| 168 | article is dynamically updatable: if the set of points to be |
| 169 | searched changes, the structure can be modified efficiently without |
| 170 | losing its searching properties. The two-dimensional structure we |
| 171 | ended up with is not: if a single point changes its coordinates or |
| 172 | weight, or appears, or disappears, then the whole structure must be |
| 173 | rebuilt. |
| 174 | |
| 175 | Since the technique I used to add an extra dimension is critically |
| 176 | dependent on the dynamic updatability of the one-dimensional base |
| 177 | structure, but the resulting structure is not dynamically updatable |
| 178 | in turn, it follows that this technique cannot be applied twice: no |
| 179 | analogous transformation will construct a \e{three}-dimensional |
| 180 | structure capable of counting the total weight of an octant |
| 181 | \cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way |
| 182 | to do that. |
| 183 | |
| 184 | The structure as described above uses \cw{O(N log N)} storage. Many |
| 185 | algorithms using \cw{O(N log N)} time are considered efficient (e.g. |
| 186 | sorting), but space is generally more expensive than time, and \cw{O(N |
| 187 | log N)} space is larger than you think! |
| 188 | |
| 189 | \C{application} An application: \cw{agedu} |
| 190 | |
| 191 | The application for which I developed this data structure is |
| 192 | \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a |
| 193 | program which scans a file system and indexes pathnames against |
| 194 | last-access times (\q{atimes}, in Unix terminology) so as to be able |
| 195 | to point out directories which take up a lot of disk space but whose |
| 196 | contents have not been accessed in years (making them strong |
| 197 | candidates for tidying up or archiving to save space). |
| 198 | |
| 199 | So the fundamental searching primitive we want for \cw{agedu} is |
| 200 | \q{What is the total size of the files contained within some |
| 201 | directory path \cw{Ptop} which have atime at most \cw{T0}?} |
| 202 | |
| 203 | We begin by sorting the files into order by their full pathname. |
| 204 | This brings every subdirectory, at every level, together into a |
| 205 | contiguous sequence of files. So now our query primitive becomes |
| 206 | \q{What is the total size of files whose pathname falls between |
| 207 | \cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?} |
| 208 | |
| 209 | Clearly, we can simplify this to the same type of quadrant query as |
| 210 | discussed above, by splitting this into two subqueries: the total |
| 211 | size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the |
| 212 | total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size |
| 213 | of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of |
| 214 | precisely the type we have just derived a data structure to answer. |
| 215 | |
| 216 | So we want to sort our files by two \q{coordinates}: one is atime, |
| 217 | and the other is pathname (sorted in ASCII collation order). So |
| 218 | which should be \cw{x} and which \cw{y}, in the above notation? |
| 219 | |
| 220 | Well, either way round would work in principle, but two criteria |
| 221 | inform the decision. Firstly, \cw{agedu} typically wants to do many |
| 222 | queries on the same pathname for different atimes, so as to build up |
| 223 | a detailed profile of size against atime for a given subdirectory. |
| 224 | So it makes sense to have the first-level lookup (on \cw{y}, to find a |
| 225 | tree root) be done on pathname, and the secondary lookup (on \cw{x}, |
| 226 | within that tree) be done on atime; then we can cache the tree root |
| 227 | found in the first lookup, and use it many times without wasting |
| 228 | time repeating the pathname search. |
| 229 | |
| 230 | Another important point for \cw{agedu} is that not all tree roots |
| 231 | are actually used: the only pathnames ever submitted to a quadrant |
| 232 | search are those at the start or the end of a particular |
| 233 | subdirectory. This allows us to save a lot of disk space by limiting |
| 234 | the copy-on-write behaviour: instead of \e{never} modifying an |
| 235 | existing tree node, we now rule that we may modify a tree node \e{if |
| 236 | it has already been modified once since the last tree root we |
| 237 | saved}. In real-world cases, this cuts down the total space usage by |
| 238 | about a factor of five, so it's well worthwhile \dash and we |
| 239 | wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate |
| 240 | instead of pathname. |
| 241 | |
| 242 | Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the |
| 243 | top-level lookup to find a tree root is most efficiently done using |
| 244 | neither a balanced tree nor a sorted list, but a trie: tries allow |
| 245 | lookup of a string in time proportional to the length of the string, |
| 246 | whereas either of the other approaches would require \cw{O(log N)} |
| 247 | compare operations \e{each} of which could take time proportional to |
| 248 | the length of the string. |
| 249 | |
| 250 | Finally, the two-dimensions limitation on the above data structure |
| 251 | unfortunately imposes limitations on what \cw{agedu} can do. One |
| 252 | would like to run a single \cw{agedu} as a system-wide job on a file |
| 253 | server (perhaps nightly or weekly), and present the results to all |
| 254 | users in such a way that each user's view of the data was filtered |
| 255 | to only what their access permissions permitted them to see. Sadly, |
| 256 | to do this would require a third dimension in the data structure |
| 257 | (representing ownership or access control, in some fashion), and |
| 258 | hence cannot be done efficiently. \cw{agedu} is therefore instead |
| 259 | most sensibly used on demand by an individual user, so that it |
| 260 | generates a custom data set for that user every time. |