Change the magic number used to introduce a trie file, so that instead
[sgt/agedu] / tree.but
CommitLineData
c1065ef6 1\cfg{html-chapter-numeric}{true}
2\cfg{html-chapter-suffix}{. }
3\cfg{chapter}{Section}
4
5\define{dash} \u2013{-}
6
7\title A tree structure for log-time quadrant counting
8
9This article describes a general form of data structure which
10permits the storing of two-dimensional data in such a way that a
11log-time lookup can reliably return the total \e{amount} of data in
12an arbitrary quadrant (i.e. a region both upward and leftward of a
13user-specified point).
14
15The program
f5d425d9 16\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu},
c1065ef6 17which uses a data structure of this type for its on-disk index
18files, is used as a case study.
19
20\C{problem} The problem
21
22The basic problem can be stated as follows: you have a collection of
23\cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and
24you want to be able to efficiently answer questions of the general
25form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?},
26for arbitrary \cw{(x0,y0)}.
27
28A slightly enhanced form of the problem, which is no more difficult
29to solve, allows each point to have a \e{weight} as well as
30coordinates; now the question to be answered is \q{What is the
31\e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The
32previous version of the question, of course, is a special case of
33this in which all weights are set to 1.
34
35The data structure presented in this article answers any question of
36this type in time \cw{O(log N)}.
37
38\C{onedim} The one-dimensional case
39
40To begin with, we address the one-dimensional analogue of the
41problem. If we have a set of points which each have an
42\cw{x}-coordinate and a weight, how do we represent them so as to be
43able to find the total weight of all points with \cw{x\_<\_x0}?
44
45An obvious solution is to sort the points into order of their
46\cw{x}-coordinate, and alongside each point to list the total weight of
47everything up to and including that point. Then, given any \cw{x0}, a
48log-time binary search will find the last point in the list with
49\cw{x\_<\_x0}, and we can immediately read off the cumulative weight
50figure.
51
52Now let's be more ambitious. What if the set of points were
53constantly changing \dash new points arriving, old points going away
54\dash and we wanted a data structure which would cope efficiently
55with dynamic updates while maintaining the ability to answer these
56count queries?
57
58An efficient solution begins by storing the points in a balanced
59tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash
60AVL tree, red-black tree, the various forms of B-tree \dash will
61suffice. We then annotate every node of the tree with an extra field
62giving the total weight of the points in and below that node.
63
64These annotations can be maintained through updates to the tree,
65without sacrificing the \cw{O(log N)} time bound on insertions or
66deletions. So we can start with an empty tree and insert a complete
67set of points, or we can insert and delete points in arbitrary
68order, and the tree annotations will remain valid at all times.
69
70A balanced tree sorted by \cw{x} can easily be searched to find the
71last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight
72annotations as described above, we can arrange that this search
73calculates the total weight of all points with \cw{x\_<\_x0}: every time
74we examine a tree node and decide which subtree of it to descend to,
75we look at the top node of each subtree to the left of that one, and
76add their weight annotations to a running total. When we reach a
77leaf node and find our chosen subtree is \cw{NULL}, the running
78total is the required answer.
79
80So this data structure allows us to maintain a changing set of
81points in such a way that we can do an efficient one-dimensional
82count query at any time.
83
84\C{incremental} An incremental approach
85
86Now we'll use the above one-dimensional structure to answer a
87restricted form of the original 2-D problem. Suppose we have some
88large set of \cw{(x0,y0)} pairs for which we want to answer
89counted-quadrant queries, and suppose (for the moment) that we have
90\e{all of the queries presented in advance}. Is there any way we can
91do that efficiently?
92
93There is. We sort our points by their \cw{y}-coordinate, and then go
94through them in that order adding them one by one to a balanced tree
95sorted by \cw{x} as described above.
96
97So for any query coordinates \cw{(x0,y0)}, there must be some moment
98during that process at which we have added to our tree precisely
99those points with \cw{y\_<\_y0}. At that moment, we could search the tree
100to find the total weight of everything it contained with \cw{x\_<\_x0},
101and that would be the answer to our two-dimensional query.
102
103Hence, if we also sorted our queries into order by \cw{y0}, we could
104progress through the list of queries in parallel with the list of
105points (much like merging two sorted lists), answering each query at
106the moment when the tree contained just the right set of points to
107make it easy.
108
109\C{cow} Copy-on-write
110
111In real life, of course, we typically don't receive all our queries
112in a big batch up front. We want to construct a data structure
113capable of answering \e{any} query efficiently, and then be able to
114deal with queries as they arise.
115
116A data structure capable of this, it turns out, is only a small step
117away from the one described in the previous section. The small step
118is \e{copy-on-write}.
119
120As described in the previous section, we go through our list of
121points in order of their \cw{y}-coordinate, and we add each one in turn
122to a balanced tree sorted by \cw{x} with total-weight annotations. The
123catch is that, this time, we never \e{modify} any node in the tree:
124whenever the process of inserting an element into a tree wants to
125modify a node, we instead make a \e{copy} of that node containing
126the modified data. The parent of that node must now be modified
127(even if it would previously not have needed modification) so that
128it points at the copied node instead of the original \dash so we do
129the same thing there, and copy that one too, and so on up to the
130root.
131
132So after we have done a single insertion by this method, we end up
133with a new tree root which describes the new tree \dash but the old
134tree root, describing the tree before the insertion, is still valid,
135because every node reachable from the old root is unchanged.
136
137Therefore, if we start with an empty tree and repeatedly do this, we
138will end up with a distinct tree root for each point in the process,
139and \e{all of them will be valid at once}. It's as if we had done
140the incremental tree construction in the previous section, but could
141rewind time to any point in the process.
142
143So now all we have to do is make a list of those tree roots, with
144their associated \cw{y}-coordinates. Any way of doing this will do
145\dash another balanced tree, or a simple sorted list to be
146binary-searched, or something more exotic, whatever is convenient.
147
148Then we can answer an arbitrary quadrant query using only a pair of
149log-time searches: given a query coordinate pair \cw{(x0,y0)}, we
150look through our list of tree roots to find the one describing
151precisely the set of points with \cw{y\_<\_y0}, and then do a
152one-dimensional count query into that tree for the total weight of
153points with \cw{x\_<\_x0}. Done!
154
155The nice thing about all of this is that \e{nearly all} the nodes in
156each tree are shared with the next one. Consider: since the
157operation of adding an element to a balanced tree takes \cw{O(log N)}
158time, it must modify at most \cw{O(log N)} nodes. Each of these
159node-copying insertion processes must copy all of those nodes, but
160need not copy any others \dash so it creates at most \cw{O(log N)} new
161nodes. Hence, the total storage used by the combined set of trees is
162\cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if
163the trees were all separate or even mostly separate.
164
165\C{limitations} Limitations
166
167The one-dimensional data structure described at the start of this
168article is dynamically updatable: if the set of points to be
169searched changes, the structure can be modified efficiently without
170losing its searching properties. The two-dimensional structure we
171ended up with is not: if a single point changes its coordinates or
172weight, or appears, or disappears, then the whole structure must be
173rebuilt.
174
175Since the technique I used to add an extra dimension is critically
176dependent on the dynamic updatability of the one-dimensional base
177structure, but the resulting structure is not dynamically updatable
178in turn, it follows that this technique cannot be applied twice: no
179analogous transformation will construct a \e{three}-dimensional
92e53e57 180structure capable of counting the total weight of an octant
181\cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way
c1065ef6 182to do that.
183
184The structure as described above uses \cw{O(N log N)} storage. Many
185algorithms using \cw{O(N log N)} time are considered efficient (e.g.
186sorting), but space is generally more expensive than time, and \cw{O(N
187log N)} space is larger than you think!
188
189\C{application} An application: \cw{agedu}
190
191The application for which I developed this data structure is
f5d425d9 192\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a
c1065ef6 193program which scans a file system and indexes pathnames against
194last-access times (\q{atimes}, in Unix terminology) so as to be able
195to point out directories which take up a lot of disk space but whose
196contents have not been accessed in years (making them strong
197candidates for tidying up or archiving to save space).
198
199So the fundamental searching primitive we want for \cw{agedu} is
200\q{What is the total size of the files contained within some
201directory path \cw{Ptop} which have atime at most \cw{T0}?}
202
203We begin by sorting the files into order by their full pathname.
204This brings every subdirectory, at every level, together into a
205contiguous sequence of files. So now our query primitive becomes
206\q{What is the total size of files whose pathname falls between
207\cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?}
208
209Clearly, we can simplify this to the same type of quadrant query as
210discussed above, by splitting this into two subqueries: the total
211size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the
212total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size
213of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of
214precisely the type we have just derived a data structure to answer.
215
216So we want to sort our files by two \q{coordinates}: one is atime,
217and the other is pathname (sorted in ASCII collation order). So
218which should be \cw{x} and which \cw{y}, in the above notation?
219
220Well, either way round would work in principle, but two criteria
221inform the decision. Firstly, \cw{agedu} typically wants to do many
222queries on the same pathname for different atimes, so as to build up
223a detailed profile of size against atime for a given subdirectory.
224So it makes sense to have the first-level lookup (on \cw{y}, to find a
225tree root) be done on pathname, and the secondary lookup (on \cw{x},
226within that tree) be done on atime; then we can cache the tree root
227found in the first lookup, and use it many times without wasting
228time repeating the pathname search.
229
230Another important point for \cw{agedu} is that not all tree roots
231are actually used: the only pathnames ever submitted to a quadrant
232search are those at the start or the end of a particular
233subdirectory. This allows us to save a lot of disk space by limiting
234the copy-on-write behaviour: instead of \e{never} modifying an
235existing tree node, we now rule that we may modify a tree node \e{if
236it has already been modified once since the last tree root we
237saved}. In real-world cases, this cuts down the total space usage by
238about a factor of five, so it's well worthwhile \dash and we
239wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate
240instead of pathname.
241
242Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the
243top-level lookup to find a tree root is most efficiently done using
244neither a balanced tree nor a sorted list, but a trie: tries allow
245lookup of a string in time proportional to the length of the string,
246whereas either of the other approaches would require \cw{O(log N)}
247compare operations \e{each} of which could take time proportional to
248the length of the string.
249
250Finally, the two-dimensions limitation on the above data structure
251unfortunately imposes limitations on what \cw{agedu} can do. One
252would like to run a single \cw{agedu} as a system-wide job on a file
253server (perhaps nightly or weekly), and present the results to all
254users in such a way that each user's view of the data was filtered
255to only what their access permissions permitted them to see. Sadly,
256to do this would require a third dimension in the data structure
257(representing ownership or access control, in some fashion), and
258hence cannot be done efficiently. \cw{agedu} is therefore instead
259most sensibly used on demand by an individual user, so that it
260generates a custom data set for that user every time.