[sgt/agedu] / tree.but

\cfg{html-chapter-numeric}{true}
\cfg{html-chapter-suffix}{. }
\cfg{chapter}{Section}

\define{dash} \u2013{-}

\title A tree structure for log-time quadrant counting

This article describes a general form of data structure which
permits the storing of two-dimensional data in such a way that a
log-time lookup can reliably return the total \e{amount} of data in
an arbitrary quadrant (i.e. a region both upward and leftward of a
user-specified point).

The program
\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu},
which uses a data structure of this type for its on-disk index
files, is used as a case study.

\C{problem} The problem

The basic problem can be stated as follows: you have a collection of
\cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and
you want to be able to efficiently answer questions of the general
form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?},
for arbitrary \cw{(x0,y0)}.

A slightly enhanced form of the problem, which is no more difficult
to solve, allows each point to have a \e{weight} as well as
coordinates; now the question to be answered is \q{What is the
\e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The
previous version of the question, of course, is a special case of
this in which all weights are set to 1.

The data structure presented in this article answers any question of
this type in time \cw{O(log N)}.

\C{onedim} The one-dimensional case

To begin with, we address the one-dimensional analogue of the
problem. If we have a set of points which each have an
\cw{x}-coordinate and a weight, how do we represent them so as to be
able to find the total weight of all points with \cw{x\_<\_x0}?

An obvious solution is to sort the points into order of their
\cw{x}-coordinate, and alongside each point to list the total weight of
everything up to and including that point. Then, given any \cw{x0}, a
log-time binary search will find the last point in the list with
\cw{x\_<\_x0}, and we can immediately read off the cumulative weight
figure.

Now let's be more ambitious. What if the set of points were
constantly changing \dash new points arriving, old points going away
\dash and we wanted a data structure which would cope efficiently
with dynamic updates while maintaining the ability to answer these
count queries?

An efficient solution begins by storing the points in a balanced
tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash
AVL tree, red-black tree, the various forms of B-tree \dash will
suffice. We then annotate every node of the tree with an extra field
giving the total weight of the points in and below that node.

These annotations can be maintained through updates to the tree,
without sacrificing the \cw{O(log N)} time bound on insertions or
deletions. So we can start with an empty tree and insert a complete
set of points, or we can insert and delete points in arbitrary
order, and the tree annotations will remain valid at all times.

A balanced tree sorted by \cw{x} can easily be searched to find the
last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight
annotations as described above, we can arrange that this search
calculates the total weight of all points with \cw{x\_<\_x0}: every time
we examine a tree node and decide which subtree of it to descend to,
we look at the top node of each subtree to the left of that one, and
add their weight annotations to a running total. When we reach a
leaf node and find our chosen subtree is \cw{NULL}, the running
total is the required answer.

So this data structure allows us to maintain a changing set of
points in such a way that we can do an efficient one-dimensional
count query at any time.

\C{incremental} An incremental approach

Now we'll use the above one-dimensional structure to answer a
restricted form of the original 2-D problem. Suppose we have some
large set of \cw{(x0,y0)} pairs for which we want to answer
counted-quadrant queries, and suppose (for the moment) that we have
\e{all of the queries presented in advance}. Is there any way we can
do that efficiently?

There is. We sort our points by their \cw{y}-coordinate, and then go
through them in that order adding them one by one to a balanced tree
sorted by \cw{x} as described above.

So for any query coordinates \cw{(x0,y0)}, there must be some moment
during that process at which we have added to our tree precisely
those points with \cw{y\_<\_y0}. At that moment, we could search the tree
to find the total weight of everything it contained with \cw{x\_<\_x0},
and that would be the answer to our two-dimensional query.

Hence, if we also sorted our queries into order by \cw{y0}, we could
progress through the list of queries in parallel with the list of
points (much like merging two sorted lists), answering each query at
the moment when the tree contained just the right set of points to
make it easy.

\C{cow} Copy-on-write

In real life, of course, we typically don't receive all our queries
in a big batch up front. We want to construct a data structure
capable of answering \e{any} query efficiently, and then be able to
deal with queries as they arise.

A data structure capable of this, it turns out, is only a small step
away from the one described in the previous section. The small step
is \e{copy-on-write}.

As described in the previous section, we go through our list of
points in order of their \cw{y}-coordinate, and we add each one in turn
to a balanced tree sorted by \cw{x} with total-weight annotations. The
catch is that, this time, we never \e{modify} any node in the tree:
whenever the process of inserting an element into a tree wants to
modify a node, we instead make a \e{copy} of that node containing
the modified data. The parent of that node must now be modified
(even if it would previously not have needed modification) so that
it points at the copied node instead of the original \dash so we do
the same thing there, and copy that one too, and so on up to the
root.

So after we have done a single insertion by this method, we end up
with a new tree root which describes the new tree \dash but the old
tree root, describing the tree before the insertion, is still valid,
because every node reachable from the old root is unchanged.

Therefore, if we start with an empty tree and repeatedly do this, we
will end up with a distinct tree root for each point in the process,
and \e{all of them will be valid at once}. It's as if we had done
the incremental tree construction in the previous section, but could
rewind time to any point in the process.

So now all we have to do is make a list of those tree roots, with
their associated \cw{y}-coordinates. Any way of doing this will do
\dash another balanced tree, or a simple sorted list to be
binary-searched, or something more exotic, whatever is convenient.

Then we can answer an arbitrary quadrant query using only a pair of
log-time searches: given a query coordinate pair \cw{(x0,y0)}, we
look through our list of tree roots to find the one describing
precisely the set of points with \cw{y\_<\_y0}, and then do a
one-dimensional count query into that tree for the total weight of
points with \cw{x\_<\_x0}. Done!

The nice thing about all of this is that \e{nearly all} the nodes in
each tree are shared with the next one. Consider: since the
operation of adding an element to a balanced tree takes \cw{O(log N)}
time, it must modify at most \cw{O(log N)} nodes. Each of these
node-copying insertion processes must copy all of those nodes, but
need not copy any others \dash so it creates at most \cw{O(log N)} new
nodes. Hence, the total storage used by the combined set of trees is
\cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if
the trees were all separate or even mostly separate.

\C{limitations} Limitations

The one-dimensional data structure described at the start of this
article is dynamically updatable: if the set of points to be
searched changes, the structure can be modified efficiently without
losing its searching properties. The two-dimensional structure we
ended up with is not: if a single point changes its coordinates or
weight, or appears, or disappears, then the whole structure must be
rebuilt.

Since the technique I used to add an extra dimension is critically
dependent on the dynamic updatability of the one-dimensional base
structure, but the resulting structure is not dynamically updatable
in turn, it follows that this technique cannot be applied twice: no
analogous transformation will construct a \e{three}-dimensional
structure capable of counting the total weight of an octant
\cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way
to do that.

The structure as described above uses \cw{O(N log N)} storage. Many
algorithms using \cw{O(N log N)} time are considered efficient (e.g.
sorting), but space is generally more expensive than time, and \cw{O(N
log N)} space is larger than you think!

\C{application} An application: \cw{agedu}

The application for which I developed this data structure is
\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a
program which scans a file system and indexes pathnames against
last-access times (\q{atimes}, in Unix terminology) so as to be able
to point out directories which take up a lot of disk space but whose
contents have not been accessed in years (making them strong
candidates for tidying up or archiving to save space).

So the fundamental searching primitive we want for \cw{agedu} is
\q{What is the total size of the files contained within some
directory path \cw{Ptop} which have atime at most \cw{T0}?}

We begin by sorting the files into order by their full pathname.
This brings every subdirectory, at every level, together into a
contiguous sequence of files. So now our query primitive becomes
\q{What is the total size of files whose pathname falls between
\cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?}

Clearly, we can simplify this to the same type of quadrant query as
discussed above, by splitting this into two subqueries: the total
size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the
total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size
of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of
precisely the type we have just derived a data structure to answer.

So we want to sort our files by two \q{coordinates}: one is atime,
and the other is pathname (sorted in ASCII collation order). So
which should be \cw{x} and which \cw{y}, in the above notation?

Well, either way round would work in principle, but two criteria
inform the decision. Firstly, \cw{agedu} typically wants to do many
queries on the same pathname for different atimes, so as to build up
a detailed profile of size against atime for a given subdirectory.
So it makes sense to have the first-level lookup (on \cw{y}, to find a
tree root) be done on pathname, and the secondary lookup (on \cw{x},
within that tree) be done on atime; then we can cache the tree root
found in the first lookup, and use it many times without wasting
time repeating the pathname search.

Another important point for \cw{agedu} is that not all tree roots
are actually used: the only pathnames ever submitted to a quadrant
search are those at the start or the end of a particular
subdirectory. This allows us to save a lot of disk space by limiting
the copy-on-write behaviour: instead of \e{never} modifying an
existing tree node, we now rule that we may modify a tree node \e{if
it has already been modified once since the last tree root we
saved}. In real-world cases, this cuts down the total space usage by
about a factor of five, so it's well worthwhile \dash and we
wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate
instead of pathname.

Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the
top-level lookup to find a tree root is most efficiently done using
neither a balanced tree nor a sorted list, but a trie: tries allow
lookup of a string in time proportional to the length of the string,
whereas either of the other approaches would require \cw{O(log N)}
compare operations \e{each} of which could take time proportional to
the length of the string.

Finally, the two-dimensions limitation on the above data structure
unfortunately imposes limitations on what \cw{agedu} can do. One
would like to run a single \cw{agedu} as a system-wide job on a file
server (perhaps nightly or weekly), and present the results to all
users in such a way that each user's view of the data was filtered
to only what their access permissions permitted them to see. Sadly,
to do this would require a third dimension in the data structure
(representing ownership or access control, in some fashion), and
hence cannot be done efficiently. \cw{agedu} is therefore instead
most sensibly used on demand by an individual user, so that it
generates a custom data set for that user every time.
Commit	Line	Data
c1065ef6	1	\cfg{html-chapter-numeric}{true}
	2	\cfg{html-chapter-suffix}{. }
	3	\cfg{chapter}{Section}
	4
	5	\define{dash} \u2013{-}
	6
	7	\title A tree structure for log-time quadrant counting
	8
	9	This article describes a general form of data structure which
	10	permits the storing of two-dimensional data in such a way that a
	11	log-time lookup can reliably return the total \e{amount} of data in
	12	an arbitrary quadrant (i.e. a region both upward and leftward of a
	13	user-specified point).
	14
	15	The program
f5d425d9	16	\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu},
c1065ef6	17	which uses a data structure of this type for its on-disk index
	18	files, is used as a case study.
	19
	20	\C{problem} The problem
	21
	22	The basic problem can be stated as follows: you have a collection of
	23	\cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and
	24	you want to be able to efficiently answer questions of the general
	25	form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?},
	26	for arbitrary \cw{(x0,y0)}.
	27
	28	A slightly enhanced form of the problem, which is no more difficult
	29	to solve, allows each point to have a \e{weight} as well as
	30	coordinates; now the question to be answered is \q{What is the
	31	\e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The
	32	previous version of the question, of course, is a special case of
	33	this in which all weights are set to 1.
	34
	35	The data structure presented in this article answers any question of
	36	this type in time \cw{O(log N)}.
	37
	38	\C{onedim} The one-dimensional case
	39
	40	To begin with, we address the one-dimensional analogue of the
	41	problem. If we have a set of points which each have an
	42	\cw{x}-coordinate and a weight, how do we represent them so as to be
	43	able to find the total weight of all points with \cw{x\_<\_x0}?
	44
	45	An obvious solution is to sort the points into order of their
	46	\cw{x}-coordinate, and alongside each point to list the total weight of
	47	everything up to and including that point. Then, given any \cw{x0}, a
	48	log-time binary search will find the last point in the list with
	49	\cw{x\_<\_x0}, and we can immediately read off the cumulative weight
	50	figure.
	51
	52	Now let's be more ambitious. What if the set of points were
	53	constantly changing \dash new points arriving, old points going away
	54	\dash and we wanted a data structure which would cope efficiently
	55	with dynamic updates while maintaining the ability to answer these
	56	count queries?
	57
	58	An efficient solution begins by storing the points in a balanced
	59	tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash
	60	AVL tree, red-black tree, the various forms of B-tree \dash will
	61	suffice. We then annotate every node of the tree with an extra field
	62	giving the total weight of the points in and below that node.
	63
	64	These annotations can be maintained through updates to the tree,
	65	without sacrificing the \cw{O(log N)} time bound on insertions or
	66	deletions. So we can start with an empty tree and insert a complete
	67	set of points, or we can insert and delete points in arbitrary
	68	order, and the tree annotations will remain valid at all times.
	69
	70	A balanced tree sorted by \cw{x} can easily be searched to find the
	71	last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight
	72	annotations as described above, we can arrange that this search
	73	calculates the total weight of all points with \cw{x\_<\_x0}: every time
	74	we examine a tree node and decide which subtree of it to descend to,
	75	we look at the top node of each subtree to the left of that one, and
	76	add their weight annotations to a running total. When we reach a
	77	leaf node and find our chosen subtree is \cw{NULL}, the running
	78	total is the required answer.
	79
	80	So this data structure allows us to maintain a changing set of
81	points in such a way that we can do an efficient one-dimensional
82	count query at any time.
83
84	\C{incremental} An incremental approach
85
86	Now we'll use the above one-dimensional structure to answer a
87	restricted form of the original 2-D problem. Suppose we have some
88	large set of \cw{(x0,y0)} pairs for which we want to answer
89	counted-quadrant queries, and suppose (for the moment) that we have
90	\e{all of the queries presented in advance}. Is there any way we can
91	do that efficiently?
92
93	There is. We sort our points by their \cw{y}-coordinate, and then go
94	through them in that order adding them one by one to a balanced tree
95	sorted by \cw{x} as described above.
96
97	So for any query coordinates \cw{(x0,y0)}, there must be some moment
98	during that process at which we have added to our tree precisely
99	those points with \cw{y\_<\_y0}. At that moment, we could search the tree
100	to find the total weight of everything it contained with \cw{x\_<\_x0},
101	and that would be the answer to our two-dimensional query.
102
103	Hence, if we also sorted our queries into order by \cw{y0}, we could
104	progress through the list of queries in parallel with the list of
105	points (much like merging two sorted lists), answering each query at
106	the moment when the tree contained just the right set of points to
107	make it easy.
108
109	\C{cow} Copy-on-write
110
111	In real life, of course, we typically don't receive all our queries
112	in a big batch up front. We want to construct a data structure
113	capable of answering \e{any} query efficiently, and then be able to
114	deal with queries as they arise.
115
116	A data structure capable of this, it turns out, is only a small step
117	away from the one described in the previous section. The small step
118	is \e{copy-on-write}.
119
120	As described in the previous section, we go through our list of
121	points in order of their \cw{y}-coordinate, and we add each one in turn
122	to a balanced tree sorted by \cw{x} with total-weight annotations. The
123	catch is that, this time, we never \e{modify} any node in the tree:
124	whenever the process of inserting an element into a tree wants to
125	modify a node, we instead make a \e{copy} of that node containing
126	the modified data. The parent of that node must now be modified
127	(even if it would previously not have needed modification) so that
128	it points at the copied node instead of the original \dash so we do
129	the same thing there, and copy that one too, and so on up to the
130	root.
131
132	So after we have done a single insertion by this method, we end up
133	with a new tree root which describes the new tree \dash but the old
134	tree root, describing the tree before the insertion, is still valid,
135	because every node reachable from the old root is unchanged.
136
137	Therefore, if we start with an empty tree and repeatedly do this, we
138	will end up with a distinct tree root for each point in the process,
139	and \e{all of them will be valid at once}. It's as if we had done
140	the incremental tree construction in the previous section, but could
141	rewind time to any point in the process.
142
143	So now all we have to do is make a list of those tree roots, with
144	their associated \cw{y}-coordinates. Any way of doing this will do
145	\dash another balanced tree, or a simple sorted list to be
146	binary-searched, or something more exotic, whatever is convenient.
147
148	Then we can answer an arbitrary quadrant query using only a pair of
149	log-time searches: given a query coordinate pair \cw{(x0,y0)}, we
150	look through our list of tree roots to find the one describing
151	precisely the set of points with \cw{y\_<\_y0}, and then do a
152	one-dimensional count query into that tree for the total weight of
153	points with \cw{x\_<\_x0}. Done!
154
155	The nice thing about all of this is that \e{nearly all} the nodes in
156	each tree are shared with the next one. Consider: since the
157	operation of adding an element to a balanced tree takes \cw{O(log N)}
158	time, it must modify at most \cw{O(log N)} nodes. Each of these
159	node-copying insertion processes must copy all of those nodes, but
160	need not copy any others \dash so it creates at most \cw{O(log N)} new
161	nodes. Hence, the total storage used by the combined set of trees is
162	\cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if
163	the trees were all separate or even mostly separate.
164
165	\C{limitations} Limitations
166
167	The one-dimensional data structure described at the start of this
168	article is dynamically updatable: if the set of points to be
169	searched changes, the structure can be modified efficiently without
170	losing its searching properties. The two-dimensional structure we
171	ended up with is not: if a single point changes its coordinates or
172	weight, or appears, or disappears, then the whole structure must be
173	rebuilt.
174
175	Since the technique I used to add an extra dimension is critically
176	dependent on the dynamic updatability of the one-dimensional base
177	structure, but the resulting structure is not dynamically updatable
178	in turn, it follows that this technique cannot be applied twice: no
179	analogous transformation will construct a \e{three}-dimensional
92e53e57	180	structure capable of counting the total weight of an octant
92e53e57	181	\cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way
c1065ef6	182	to do that.
	183
	184	The structure as described above uses \cw{O(N log N)} storage. Many
	185	algorithms using \cw{O(N log N)} time are considered efficient (e.g.
	186	sorting), but space is generally more expensive than time, and \cw{O(N
	187	log N)} space is larger than you think!
	188
	189	\C{application} An application: \cw{agedu}
	190
	191	The application for which I developed this data structure is
f5d425d9	192	\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a
c1065ef6	193	program which scans a file system and indexes pathnames against
	194	last-access times (\q{atimes}, in Unix terminology) so as to be able
	195	to point out directories which take up a lot of disk space but whose
	196	contents have not been accessed in years (making them strong
	197	candidates for tidying up or archiving to save space).
	198
	199	So the fundamental searching primitive we want for \cw{agedu} is
	200	\q{What is the total size of the files contained within some
	201	directory path \cw{Ptop} which have atime at most \cw{T0}?}
	202
	203	We begin by sorting the files into order by their full pathname.
	204	This brings every subdirectory, at every level, together into a
	205	contiguous sequence of files. So now our query primitive becomes
	206	\q{What is the total size of files whose pathname falls between
	207	\cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?}
	208
	209	Clearly, we can simplify this to the same type of quadrant query as
	210	discussed above, by splitting this into two subqueries: the total
	211	size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the
	212	total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size
	213	of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of
	214	precisely the type we have just derived a data structure to answer.
	215
	216	So we want to sort our files by two \q{coordinates}: one is atime,
	217	and the other is pathname (sorted in ASCII collation order). So
	218	which should be \cw{x} and which \cw{y}, in the above notation?
	219
	220	Well, either way round would work in principle, but two criteria
	221	inform the decision. Firstly, \cw{agedu} typically wants to do many
	222	queries on the same pathname for different atimes, so as to build up
	223	a detailed profile of size against atime for a given subdirectory.
	224	So it makes sense to have the first-level lookup (on \cw{y}, to find a
	225	tree root) be done on pathname, and the secondary lookup (on \cw{x},
	226	within that tree) be done on atime; then we can cache the tree root
	227	found in the first lookup, and use it many times without wasting
	228	time repeating the pathname search.
	229
	230	Another important point for \cw{agedu} is that not all tree roots
	231	are actually used: the only pathnames ever submitted to a quadrant
	232	search are those at the start or the end of a particular
	233	subdirectory. This allows us to save a lot of disk space by limiting
	234	the copy-on-write behaviour: instead of \e{never} modifying an
	235	existing tree node, we now rule that we may modify a tree node \e{if
	236	it has already been modified once since the last tree root we
	237	saved}. In real-world cases, this cuts down the total space usage by
	238	about a factor of five, so it's well worthwhile \dash and we
	239	wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate
	240	instead of pathname.
	241
	242	Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the
	243	top-level lookup to find a tree root is most efficiently done using
	244	neither a balanced tree nor a sorted list, but a trie: tries allow
	245	lookup of a string in time proportional to the length of the string,
	246	whereas either of the other approaches would require \cw{O(log N)}
	247	compare operations \e{each} of which could take time proportional to
	248	the length of the string.
	249
	250	Finally, the two-dimensions limitation on the above data structure
	251	unfortunately imposes limitations on what \cw{agedu} can do. One
	252	would like to run a single \cw{agedu} as a system-wide job on a file
	253	server (perhaps nightly or weekly), and present the results to all
	254	users in such a way that each user's view of the data was filtered
	255	to only what their access permissions permitted them to see. Sadly,
	256	to do this would require a third dimension in the data structure
257	(representing ownership or access control, in some fashion), and
258	hence cannot be done efficiently. \cw{agedu} is therefore instead
259	most sensibly used on demand by an individual user, so that it
260	generates a custom data set for that user every time.