Change the magic number used to introduce a trie file, so that instead
[sgt/agedu] / tree.but
1 \cfg{html-chapter-numeric}{true}
2 \cfg{html-chapter-suffix}{. }
3 \cfg{chapter}{Section}
4
5 \define{dash} \u2013{-}
6
7 \title A tree structure for log-time quadrant counting
8
9 This article describes a general form of data structure which
10 permits the storing of two-dimensional data in such a way that a
11 log-time lookup can reliably return the total \e{amount} of data in
12 an arbitrary quadrant (i.e. a region both upward and leftward of a
13 user-specified point).
14
15 The program
16 \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu},
17 which uses a data structure of this type for its on-disk index
18 files, is used as a case study.
19
20 \C{problem} The problem
21
22 The basic problem can be stated as follows: you have a collection of
23 \cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and
24 you want to be able to efficiently answer questions of the general
25 form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?},
26 for arbitrary \cw{(x0,y0)}.
27
28 A slightly enhanced form of the problem, which is no more difficult
29 to solve, allows each point to have a \e{weight} as well as
30 coordinates; now the question to be answered is \q{What is the
31 \e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The
32 previous version of the question, of course, is a special case of
33 this in which all weights are set to 1.
34
35 The data structure presented in this article answers any question of
36 this type in time \cw{O(log N)}.
37
38 \C{onedim} The one-dimensional case
39
40 To begin with, we address the one-dimensional analogue of the
41 problem. If we have a set of points which each have an
42 \cw{x}-coordinate and a weight, how do we represent them so as to be
43 able to find the total weight of all points with \cw{x\_<\_x0}?
44
45 An obvious solution is to sort the points into order of their
46 \cw{x}-coordinate, and alongside each point to list the total weight of
47 everything up to and including that point. Then, given any \cw{x0}, a
48 log-time binary search will find the last point in the list with
49 \cw{x\_<\_x0}, and we can immediately read off the cumulative weight
50 figure.
51
52 Now let's be more ambitious. What if the set of points were
53 constantly changing \dash new points arriving, old points going away
54 \dash and we wanted a data structure which would cope efficiently
55 with dynamic updates while maintaining the ability to answer these
56 count queries?
57
58 An efficient solution begins by storing the points in a balanced
59 tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash
60 AVL tree, red-black tree, the various forms of B-tree \dash will
61 suffice. We then annotate every node of the tree with an extra field
62 giving the total weight of the points in and below that node.
63
64 These annotations can be maintained through updates to the tree,
65 without sacrificing the \cw{O(log N)} time bound on insertions or
66 deletions. So we can start with an empty tree and insert a complete
67 set of points, or we can insert and delete points in arbitrary
68 order, and the tree annotations will remain valid at all times.
69
70 A balanced tree sorted by \cw{x} can easily be searched to find the
71 last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight
72 annotations as described above, we can arrange that this search
73 calculates the total weight of all points with \cw{x\_<\_x0}: every time
74 we examine a tree node and decide which subtree of it to descend to,
75 we look at the top node of each subtree to the left of that one, and
76 add their weight annotations to a running total. When we reach a
77 leaf node and find our chosen subtree is \cw{NULL}, the running
78 total is the required answer.
79
80 So this data structure allows us to maintain a changing set of
81 points in such a way that we can do an efficient one-dimensional
82 count query at any time.
83
84 \C{incremental} An incremental approach
85
86 Now we'll use the above one-dimensional structure to answer a
87 restricted form of the original 2-D problem. Suppose we have some
88 large set of \cw{(x0,y0)} pairs for which we want to answer
89 counted-quadrant queries, and suppose (for the moment) that we have
90 \e{all of the queries presented in advance}. Is there any way we can
91 do that efficiently?
92
93 There is. We sort our points by their \cw{y}-coordinate, and then go
94 through them in that order adding them one by one to a balanced tree
95 sorted by \cw{x} as described above.
96
97 So for any query coordinates \cw{(x0,y0)}, there must be some moment
98 during that process at which we have added to our tree precisely
99 those points with \cw{y\_<\_y0}. At that moment, we could search the tree
100 to find the total weight of everything it contained with \cw{x\_<\_x0},
101 and that would be the answer to our two-dimensional query.
102
103 Hence, if we also sorted our queries into order by \cw{y0}, we could
104 progress through the list of queries in parallel with the list of
105 points (much like merging two sorted lists), answering each query at
106 the moment when the tree contained just the right set of points to
107 make it easy.
108
109 \C{cow} Copy-on-write
110
111 In real life, of course, we typically don't receive all our queries
112 in a big batch up front. We want to construct a data structure
113 capable of answering \e{any} query efficiently, and then be able to
114 deal with queries as they arise.
115
116 A data structure capable of this, it turns out, is only a small step
117 away from the one described in the previous section. The small step
118 is \e{copy-on-write}.
119
120 As described in the previous section, we go through our list of
121 points in order of their \cw{y}-coordinate, and we add each one in turn
122 to a balanced tree sorted by \cw{x} with total-weight annotations. The
123 catch is that, this time, we never \e{modify} any node in the tree:
124 whenever the process of inserting an element into a tree wants to
125 modify a node, we instead make a \e{copy} of that node containing
126 the modified data. The parent of that node must now be modified
127 (even if it would previously not have needed modification) so that
128 it points at the copied node instead of the original \dash so we do
129 the same thing there, and copy that one too, and so on up to the
130 root.
131
132 So after we have done a single insertion by this method, we end up
133 with a new tree root which describes the new tree \dash but the old
134 tree root, describing the tree before the insertion, is still valid,
135 because every node reachable from the old root is unchanged.
136
137 Therefore, if we start with an empty tree and repeatedly do this, we
138 will end up with a distinct tree root for each point in the process,
139 and \e{all of them will be valid at once}. It's as if we had done
140 the incremental tree construction in the previous section, but could
141 rewind time to any point in the process.
142
143 So now all we have to do is make a list of those tree roots, with
144 their associated \cw{y}-coordinates. Any way of doing this will do
145 \dash another balanced tree, or a simple sorted list to be
146 binary-searched, or something more exotic, whatever is convenient.
147
148 Then we can answer an arbitrary quadrant query using only a pair of
149 log-time searches: given a query coordinate pair \cw{(x0,y0)}, we
150 look through our list of tree roots to find the one describing
151 precisely the set of points with \cw{y\_<\_y0}, and then do a
152 one-dimensional count query into that tree for the total weight of
153 points with \cw{x\_<\_x0}. Done!
154
155 The nice thing about all of this is that \e{nearly all} the nodes in
156 each tree are shared with the next one. Consider: since the
157 operation of adding an element to a balanced tree takes \cw{O(log N)}
158 time, it must modify at most \cw{O(log N)} nodes. Each of these
159 node-copying insertion processes must copy all of those nodes, but
160 need not copy any others \dash so it creates at most \cw{O(log N)} new
161 nodes. Hence, the total storage used by the combined set of trees is
162 \cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if
163 the trees were all separate or even mostly separate.
164
165 \C{limitations} Limitations
166
167 The one-dimensional data structure described at the start of this
168 article is dynamically updatable: if the set of points to be
169 searched changes, the structure can be modified efficiently without
170 losing its searching properties. The two-dimensional structure we
171 ended up with is not: if a single point changes its coordinates or
172 weight, or appears, or disappears, then the whole structure must be
173 rebuilt.
174
175 Since the technique I used to add an extra dimension is critically
176 dependent on the dynamic updatability of the one-dimensional base
177 structure, but the resulting structure is not dynamically updatable
178 in turn, it follows that this technique cannot be applied twice: no
179 analogous transformation will construct a \e{three}-dimensional
180 structure capable of counting the total weight of an octant
181 \cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way
182 to do that.
183
184 The structure as described above uses \cw{O(N log N)} storage. Many
185 algorithms using \cw{O(N log N)} time are considered efficient (e.g.
186 sorting), but space is generally more expensive than time, and \cw{O(N
187 log N)} space is larger than you think!
188
189 \C{application} An application: \cw{agedu}
190
191 The application for which I developed this data structure is
192 \W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a
193 program which scans a file system and indexes pathnames against
194 last-access times (\q{atimes}, in Unix terminology) so as to be able
195 to point out directories which take up a lot of disk space but whose
196 contents have not been accessed in years (making them strong
197 candidates for tidying up or archiving to save space).
198
199 So the fundamental searching primitive we want for \cw{agedu} is
200 \q{What is the total size of the files contained within some
201 directory path \cw{Ptop} which have atime at most \cw{T0}?}
202
203 We begin by sorting the files into order by their full pathname.
204 This brings every subdirectory, at every level, together into a
205 contiguous sequence of files. So now our query primitive becomes
206 \q{What is the total size of files whose pathname falls between
207 \cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?}
208
209 Clearly, we can simplify this to the same type of quadrant query as
210 discussed above, by splitting this into two subqueries: the total
211 size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the
212 total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size
213 of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of
214 precisely the type we have just derived a data structure to answer.
215
216 So we want to sort our files by two \q{coordinates}: one is atime,
217 and the other is pathname (sorted in ASCII collation order). So
218 which should be \cw{x} and which \cw{y}, in the above notation?
219
220 Well, either way round would work in principle, but two criteria
221 inform the decision. Firstly, \cw{agedu} typically wants to do many
222 queries on the same pathname for different atimes, so as to build up
223 a detailed profile of size against atime for a given subdirectory.
224 So it makes sense to have the first-level lookup (on \cw{y}, to find a
225 tree root) be done on pathname, and the secondary lookup (on \cw{x},
226 within that tree) be done on atime; then we can cache the tree root
227 found in the first lookup, and use it many times without wasting
228 time repeating the pathname search.
229
230 Another important point for \cw{agedu} is that not all tree roots
231 are actually used: the only pathnames ever submitted to a quadrant
232 search are those at the start or the end of a particular
233 subdirectory. This allows us to save a lot of disk space by limiting
234 the copy-on-write behaviour: instead of \e{never} modifying an
235 existing tree node, we now rule that we may modify a tree node \e{if
236 it has already been modified once since the last tree root we
237 saved}. In real-world cases, this cuts down the total space usage by
238 about a factor of five, so it's well worthwhile \dash and we
239 wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate
240 instead of pathname.
241
242 Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the
243 top-level lookup to find a tree root is most efficiently done using
244 neither a balanced tree nor a sorted list, but a trie: tries allow
245 lookup of a string in time proportional to the length of the string,
246 whereas either of the other approaches would require \cw{O(log N)}
247 compare operations \e{each} of which could take time proportional to
248 the length of the string.
249
250 Finally, the two-dimensions limitation on the above data structure
251 unfortunately imposes limitations on what \cw{agedu} can do. One
252 would like to run a single \cw{agedu} as a system-wide job on a file
253 server (perhaps nightly or weekly), and present the results to all
254 users in such a way that each user's view of the data was filtered
255 to only what their access permissions permitted them to see. Sadly,
256 to do this would require a third dimension in the data structure
257 (representing ownership or access control, in some fashion), and
258 hence cannot be done efficiently. \cw{agedu} is therefore instead
259 most sensibly used on demand by an individual user, so that it
260 generates a custom data set for that user every time.