c1065ef6 |
1 | \cfg{html-chapter-numeric}{true} |
2 | \cfg{html-chapter-suffix}{. } |
3 | \cfg{chapter}{Section} |
4 | |
5 | \define{dash} \u2013{-} |
6 | |
7 | \title A tree structure for log-time quadrant counting |
8 | |
9 | This article describes a general form of data structure which |
10 | permits the storing of two-dimensional data in such a way that a |
11 | log-time lookup can reliably return the total \e{amount} of data in |
12 | an arbitrary quadrant (i.e. a region both upward and leftward of a |
13 | user-specified point). |
14 | |
15 | The program |
f5d425d9 |
16 | \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, |
c1065ef6 |
17 | which uses a data structure of this type for its on-disk index |
18 | files, is used as a case study. |
19 | |
20 | \C{problem} The problem |
21 | |
22 | The basic problem can be stated as follows: you have a collection of |
23 | \cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and |
24 | you want to be able to efficiently answer questions of the general |
25 | form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}, |
26 | for arbitrary \cw{(x0,y0)}. |
27 | |
28 | A slightly enhanced form of the problem, which is no more difficult |
29 | to solve, allows each point to have a \e{weight} as well as |
30 | coordinates; now the question to be answered is \q{What is the |
31 | \e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The |
32 | previous version of the question, of course, is a special case of |
33 | this in which all weights are set to 1. |
34 | |
35 | The data structure presented in this article answers any question of |
36 | this type in time \cw{O(log N)}. |
37 | |
38 | \C{onedim} The one-dimensional case |
39 | |
40 | To begin with, we address the one-dimensional analogue of the |
41 | problem. If we have a set of points which each have an |
42 | \cw{x}-coordinate and a weight, how do we represent them so as to be |
43 | able to find the total weight of all points with \cw{x\_<\_x0}? |
44 | |
45 | An obvious solution is to sort the points into order of their |
46 | \cw{x}-coordinate, and alongside each point to list the total weight of |
47 | everything up to and including that point. Then, given any \cw{x0}, a |
48 | log-time binary search will find the last point in the list with |
49 | \cw{x\_<\_x0}, and we can immediately read off the cumulative weight |
50 | figure. |
51 | |
52 | Now let's be more ambitious. What if the set of points were |
53 | constantly changing \dash new points arriving, old points going away |
54 | \dash and we wanted a data structure which would cope efficiently |
55 | with dynamic updates while maintaining the ability to answer these |
56 | count queries? |
57 | |
58 | An efficient solution begins by storing the points in a balanced |
59 | tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash |
60 | AVL tree, red-black tree, the various forms of B-tree \dash will |
61 | suffice. We then annotate every node of the tree with an extra field |
62 | giving the total weight of the points in and below that node. |
63 | |
64 | These annotations can be maintained through updates to the tree, |
65 | without sacrificing the \cw{O(log N)} time bound on insertions or |
66 | deletions. So we can start with an empty tree and insert a complete |
67 | set of points, or we can insert and delete points in arbitrary |
68 | order, and the tree annotations will remain valid at all times. |
69 | |
70 | A balanced tree sorted by \cw{x} can easily be searched to find the |
71 | last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight |
72 | annotations as described above, we can arrange that this search |
73 | calculates the total weight of all points with \cw{x\_<\_x0}: every time |
74 | we examine a tree node and decide which subtree of it to descend to, |
75 | we look at the top node of each subtree to the left of that one, and |
76 | add their weight annotations to a running total. When we reach a |
77 | leaf node and find our chosen subtree is \cw{NULL}, the running |
78 | total is the required answer. |
79 | |
80 | So this data structure allows us to maintain a changing set of |
81 | points in such a way that we can do an efficient one-dimensional |
82 | count query at any time. |
83 | |
84 | \C{incremental} An incremental approach |
85 | |
86 | Now we'll use the above one-dimensional structure to answer a |
87 | restricted form of the original 2-D problem. Suppose we have some |
88 | large set of \cw{(x0,y0)} pairs for which we want to answer |
89 | counted-quadrant queries, and suppose (for the moment) that we have |
90 | \e{all of the queries presented in advance}. Is there any way we can |
91 | do that efficiently? |
92 | |
93 | There is. We sort our points by their \cw{y}-coordinate, and then go |
94 | through them in that order adding them one by one to a balanced tree |
95 | sorted by \cw{x} as described above. |
96 | |
97 | So for any query coordinates \cw{(x0,y0)}, there must be some moment |
98 | during that process at which we have added to our tree precisely |
99 | those points with \cw{y\_<\_y0}. At that moment, we could search the tree |
100 | to find the total weight of everything it contained with \cw{x\_<\_x0}, |
101 | and that would be the answer to our two-dimensional query. |
102 | |
103 | Hence, if we also sorted our queries into order by \cw{y0}, we could |
104 | progress through the list of queries in parallel with the list of |
105 | points (much like merging two sorted lists), answering each query at |
106 | the moment when the tree contained just the right set of points to |
107 | make it easy. |
108 | |
109 | \C{cow} Copy-on-write |
110 | |
111 | In real life, of course, we typically don't receive all our queries |
112 | in a big batch up front. We want to construct a data structure |
113 | capable of answering \e{any} query efficiently, and then be able to |
114 | deal with queries as they arise. |
115 | |
116 | A data structure capable of this, it turns out, is only a small step |
117 | away from the one described in the previous section. The small step |
118 | is \e{copy-on-write}. |
119 | |
120 | As described in the previous section, we go through our list of |
121 | points in order of their \cw{y}-coordinate, and we add each one in turn |
122 | to a balanced tree sorted by \cw{x} with total-weight annotations. The |
123 | catch is that, this time, we never \e{modify} any node in the tree: |
124 | whenever the process of inserting an element into a tree wants to |
125 | modify a node, we instead make a \e{copy} of that node containing |
126 | the modified data. The parent of that node must now be modified |
127 | (even if it would previously not have needed modification) so that |
128 | it points at the copied node instead of the original \dash so we do |
129 | the same thing there, and copy that one too, and so on up to the |
130 | root. |
131 | |
132 | So after we have done a single insertion by this method, we end up |
133 | with a new tree root which describes the new tree \dash but the old |
134 | tree root, describing the tree before the insertion, is still valid, |
135 | because every node reachable from the old root is unchanged. |
136 | |
137 | Therefore, if we start with an empty tree and repeatedly do this, we |
138 | will end up with a distinct tree root for each point in the process, |
139 | and \e{all of them will be valid at once}. It's as if we had done |
140 | the incremental tree construction in the previous section, but could |
141 | rewind time to any point in the process. |
142 | |
143 | So now all we have to do is make a list of those tree roots, with |
144 | their associated \cw{y}-coordinates. Any way of doing this will do |
145 | \dash another balanced tree, or a simple sorted list to be |
146 | binary-searched, or something more exotic, whatever is convenient. |
147 | |
148 | Then we can answer an arbitrary quadrant query using only a pair of |
149 | log-time searches: given a query coordinate pair \cw{(x0,y0)}, we |
150 | look through our list of tree roots to find the one describing |
151 | precisely the set of points with \cw{y\_<\_y0}, and then do a |
152 | one-dimensional count query into that tree for the total weight of |
153 | points with \cw{x\_<\_x0}. Done! |
154 | |
155 | The nice thing about all of this is that \e{nearly all} the nodes in |
156 | each tree are shared with the next one. Consider: since the |
157 | operation of adding an element to a balanced tree takes \cw{O(log N)} |
158 | time, it must modify at most \cw{O(log N)} nodes. Each of these |
159 | node-copying insertion processes must copy all of those nodes, but |
160 | need not copy any others \dash so it creates at most \cw{O(log N)} new |
161 | nodes. Hence, the total storage used by the combined set of trees is |
162 | \cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if |
163 | the trees were all separate or even mostly separate. |
164 | |
165 | \C{limitations} Limitations |
166 | |
167 | The one-dimensional data structure described at the start of this |
168 | article is dynamically updatable: if the set of points to be |
169 | searched changes, the structure can be modified efficiently without |
170 | losing its searching properties. The two-dimensional structure we |
171 | ended up with is not: if a single point changes its coordinates or |
172 | weight, or appears, or disappears, then the whole structure must be |
173 | rebuilt. |
174 | |
175 | Since the technique I used to add an extra dimension is critically |
176 | dependent on the dynamic updatability of the one-dimensional base |
177 | structure, but the resulting structure is not dynamically updatable |
178 | in turn, it follows that this technique cannot be applied twice: no |
179 | analogous transformation will construct a \e{three}-dimensional |
92e53e57 |
180 | structure capable of counting the total weight of an octant |
181 | \cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way |
c1065ef6 |
182 | to do that. |
183 | |
184 | The structure as described above uses \cw{O(N log N)} storage. Many |
185 | algorithms using \cw{O(N log N)} time are considered efficient (e.g. |
186 | sorting), but space is generally more expensive than time, and \cw{O(N |
187 | log N)} space is larger than you think! |
188 | |
189 | \C{application} An application: \cw{agedu} |
190 | |
191 | The application for which I developed this data structure is |
f5d425d9 |
192 | \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a |
c1065ef6 |
193 | program which scans a file system and indexes pathnames against |
194 | last-access times (\q{atimes}, in Unix terminology) so as to be able |
195 | to point out directories which take up a lot of disk space but whose |
196 | contents have not been accessed in years (making them strong |
197 | candidates for tidying up or archiving to save space). |
198 | |
199 | So the fundamental searching primitive we want for \cw{agedu} is |
200 | \q{What is the total size of the files contained within some |
201 | directory path \cw{Ptop} which have atime at most \cw{T0}?} |
202 | |
203 | We begin by sorting the files into order by their full pathname. |
204 | This brings every subdirectory, at every level, together into a |
205 | contiguous sequence of files. So now our query primitive becomes |
206 | \q{What is the total size of files whose pathname falls between |
207 | \cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?} |
208 | |
209 | Clearly, we can simplify this to the same type of quadrant query as |
210 | discussed above, by splitting this into two subqueries: the total |
211 | size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the |
212 | total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size |
213 | of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of |
214 | precisely the type we have just derived a data structure to answer. |
215 | |
216 | So we want to sort our files by two \q{coordinates}: one is atime, |
217 | and the other is pathname (sorted in ASCII collation order). So |
218 | which should be \cw{x} and which \cw{y}, in the above notation? |
219 | |
220 | Well, either way round would work in principle, but two criteria |
221 | inform the decision. Firstly, \cw{agedu} typically wants to do many |
222 | queries on the same pathname for different atimes, so as to build up |
223 | a detailed profile of size against atime for a given subdirectory. |
224 | So it makes sense to have the first-level lookup (on \cw{y}, to find a |
225 | tree root) be done on pathname, and the secondary lookup (on \cw{x}, |
226 | within that tree) be done on atime; then we can cache the tree root |
227 | found in the first lookup, and use it many times without wasting |
228 | time repeating the pathname search. |
229 | |
230 | Another important point for \cw{agedu} is that not all tree roots |
231 | are actually used: the only pathnames ever submitted to a quadrant |
232 | search are those at the start or the end of a particular |
233 | subdirectory. This allows us to save a lot of disk space by limiting |
234 | the copy-on-write behaviour: instead of \e{never} modifying an |
235 | existing tree node, we now rule that we may modify a tree node \e{if |
236 | it has already been modified once since the last tree root we |
237 | saved}. In real-world cases, this cuts down the total space usage by |
238 | about a factor of five, so it's well worthwhile \dash and we |
239 | wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate |
240 | instead of pathname. |
241 | |
242 | Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the |
243 | top-level lookup to find a tree root is most efficiently done using |
244 | neither a balanced tree nor a sorted list, but a trie: tries allow |
245 | lookup of a string in time proportional to the length of the string, |
246 | whereas either of the other approaches would require \cw{O(log N)} |
247 | compare operations \e{each} of which could take time proportional to |
248 | the length of the string. |
249 | |
250 | Finally, the two-dimensions limitation on the above data structure |
251 | unfortunately imposes limitations on what \cw{agedu} can do. One |
252 | would like to run a single \cw{agedu} as a system-wide job on a file |
253 | server (perhaps nightly or weekly), and present the results to all |
254 | users in such a way that each user's view of the data was filtered |
255 | to only what their access permissions permitted them to see. Sadly, |
256 | to do this would require a third dimension in the data structure |
257 | (representing ownership or access control, in some fashion), and |
258 | hence cannot be done efficiently. \cw{agedu} is therefore instead |
259 | most sensibly used on demand by an individual user, so that it |
260 | generates a custom data set for that user every time. |