mdw@git.distorted.org.uk Git - sgt/agedu/blame_incremental

... / ...

Commit	Line	Data
	1	\cfg{html-chapter-numeric}{true}
	2	\cfg{html-chapter-suffix}{. }
	3	\cfg{chapter}{Section}
	4
	5	\define{dash} \u2013{-}
	6
	7	\title A tree structure for log-time quadrant counting
	8
	9	This article describes a general form of data structure which
	10	permits the storing of two-dimensional data in such a way that a
	11	log-time lookup can reliably return the total \e{amount} of data in
	12	an arbitrary quadrant (i.e. a region both upward and leftward of a
	13	user-specified point).
	14
	15	The program
	16	\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu},
	17	which uses a data structure of this type for its on-disk index
	18	files, is used as a case study.
	19
	20	\C{problem} The problem
	21
	22	The basic problem can be stated as follows: you have a collection of
	23	\cw{N} points, each specified as an \cw{(x,y)} coordinate pair, and
	24	you want to be able to efficiently answer questions of the general
	25	form \q{How many points have both \cw{x\_<\_x0} and \cw{y\_<\_y0}?},
	26	for arbitrary \cw{(x0,y0)}.
	27
	28	A slightly enhanced form of the problem, which is no more difficult
	29	to solve, allows each point to have a \e{weight} as well as
	30	coordinates; now the question to be answered is \q{What is the
	31	\e{total weight} of the points with both \cw{x\_<\_x0} and \cw{y\_<\_y0}?}. The
	32	previous version of the question, of course, is a special case of
	33	this in which all weights are set to 1.
	34
	35	The data structure presented in this article answers any question of
	36	this type in time \cw{O(log N)}.
	37
	38	\C{onedim} The one-dimensional case
	39
	40	To begin with, we address the one-dimensional analogue of the
	41	problem. If we have a set of points which each have an
	42	\cw{x}-coordinate and a weight, how do we represent them so as to be
	43	able to find the total weight of all points with \cw{x\_<\_x0}?
	44
	45	An obvious solution is to sort the points into order of their
	46	\cw{x}-coordinate, and alongside each point to list the total weight of
	47	everything up to and including that point. Then, given any \cw{x0}, a
	48	log-time binary search will find the last point in the list with
	49	\cw{x\_<\_x0}, and we can immediately read off the cumulative weight
	50	figure.
	51
	52	Now let's be more ambitious. What if the set of points were
	53	constantly changing \dash new points arriving, old points going away
	54	\dash and we wanted a data structure which would cope efficiently
	55	with dynamic updates while maintaining the ability to answer these
	56	count queries?
	57
	58	An efficient solution begins by storing the points in a balanced
	59	tree, sorted by their \cw{x}-coordinates. Any of the usual types \dash
	60	AVL tree, red-black tree, the various forms of B-tree \dash will
	61	suffice. We then annotate every node of the tree with an extra field
	62	giving the total weight of the points in and below that node.
	63
	64	These annotations can be maintained through updates to the tree,
	65	without sacrificing the \cw{O(log N)} time bound on insertions or
	66	deletions. So we can start with an empty tree and insert a complete
	67	set of points, or we can insert and delete points in arbitrary
	68	order, and the tree annotations will remain valid at all times.
	69
	70	A balanced tree sorted by \cw{x} can easily be searched to find the
	71	last point with \cw{x\_<\_x0}, for any \cw{x0}. If the tree has total-weight
	72	annotations as described above, we can arrange that this search
	73	calculates the total weight of all points with \cw{x\_<\_x0}: every time
	74	we examine a tree node and decide which subtree of it to descend to,
	75	we look at the top node of each subtree to the left of that one, and
	76	add their weight annotations to a running total. When we reach a
	77	leaf node and find our chosen subtree is \cw{NULL}, the running
	78	total is the required answer.
	79
	80	So this data structure allows us to maintain a changing set of
	81	points in such a way that we can do an efficient one-dimensional
	82	count query at any time.
	83
	84	\C{incremental} An incremental approach
	85
	86	Now we'll use the above one-dimensional structure to answer a
	87	restricted form of the original 2-D problem. Suppose we have some
	88	large set of \cw{(x0,y0)} pairs for which we want to answer
	89	counted-quadrant queries, and suppose (for the moment) that we have
	90	\e{all of the queries presented in advance}. Is there any way we can
	91	do that efficiently?
	92
	93	There is. We sort our points by their \cw{y}-coordinate, and then go
	94	through them in that order adding them one by one to a balanced tree
	95	sorted by \cw{x} as described above.
	96
	97	So for any query coordinates \cw{(x0,y0)}, there must be some moment
	98	during that process at which we have added to our tree precisely
	99	those points with \cw{y\_<\_y0}. At that moment, we could search the tree
	100	to find the total weight of everything it contained with \cw{x\_<\_x0},
	101	and that would be the answer to our two-dimensional query.
	102
	103	Hence, if we also sorted our queries into order by \cw{y0}, we could
	104	progress through the list of queries in parallel with the list of
	105	points (much like merging two sorted lists), answering each query at
	106	the moment when the tree contained just the right set of points to
	107	make it easy.
	108
	109	\C{cow} Copy-on-write
	110
	111	In real life, of course, we typically don't receive all our queries
	112	in a big batch up front. We want to construct a data structure
	113	capable of answering \e{any} query efficiently, and then be able to
	114	deal with queries as they arise.
	115
	116	A data structure capable of this, it turns out, is only a small step
	117	away from the one described in the previous section. The small step
	118	is \e{copy-on-write}.
	119
	120	As described in the previous section, we go through our list of
	121	points in order of their \cw{y}-coordinate, and we add each one in turn
	122	to a balanced tree sorted by \cw{x} with total-weight annotations. The
	123	catch is that, this time, we never \e{modify} any node in the tree:
	124	whenever the process of inserting an element into a tree wants to
	125	modify a node, we instead make a \e{copy} of that node containing
	126	the modified data. The parent of that node must now be modified
	127	(even if it would previously not have needed modification) so that
	128	it points at the copied node instead of the original \dash so we do
	129	the same thing there, and copy that one too, and so on up to the
	130	root.
	131
	132	So after we have done a single insertion by this method, we end up
	133	with a new tree root which describes the new tree \dash but the old
	134	tree root, describing the tree before the insertion, is still valid,
	135	because every node reachable from the old root is unchanged.
	136
	137	Therefore, if we start with an empty tree and repeatedly do this, we
	138	will end up with a distinct tree root for each point in the process,
	139	and \e{all of them will be valid at once}. It's as if we had done
	140	the incremental tree construction in the previous section, but could
	141	rewind time to any point in the process.
	142
	143	So now all we have to do is make a list of those tree roots, with
	144	their associated \cw{y}-coordinates. Any way of doing this will do
	145	\dash another balanced tree, or a simple sorted list to be
	146	binary-searched, or something more exotic, whatever is convenient.
	147
	148	Then we can answer an arbitrary quadrant query using only a pair of
	149	log-time searches: given a query coordinate pair \cw{(x0,y0)}, we
	150	look through our list of tree roots to find the one describing
	151	precisely the set of points with \cw{y\_<\_y0}, and then do a
	152	one-dimensional count query into that tree for the total weight of
	153	points with \cw{x\_<\_x0}. Done!
	154
	155	The nice thing about all of this is that \e{nearly all} the nodes in
	156	each tree are shared with the next one. Consider: since the
	157	operation of adding an element to a balanced tree takes \cw{O(log N)}
	158	time, it must modify at most \cw{O(log N)} nodes. Each of these
	159	node-copying insertion processes must copy all of those nodes, but
	160	need not copy any others \dash so it creates at most \cw{O(log N)} new
	161	nodes. Hence, the total storage used by the combined set of trees is
	162	\cw{O(N log N)}, much smaller than the \cw{O(N^2 log N)} you'd expect if
	163	the trees were all separate or even mostly separate.
	164
	165	\C{limitations} Limitations
	166
	167	The one-dimensional data structure described at the start of this
	168	article is dynamically updatable: if the set of points to be
	169	searched changes, the structure can be modified efficiently without
	170	losing its searching properties. The two-dimensional structure we
	171	ended up with is not: if a single point changes its coordinates or
	172	weight, or appears, or disappears, then the whole structure must be
	173	rebuilt.
	174
	175	Since the technique I used to add an extra dimension is critically
	176	dependent on the dynamic updatability of the one-dimensional base
	177	structure, but the resulting structure is not dynamically updatable
	178	in turn, it follows that this technique cannot be applied twice: no
	179	analogous transformation will construct a \e{three}-dimensional
	180	structure capable of counting the total weight of an octant
	181	\cw{\{x\_<\_x0, y\_<\_y0, z\_<\_z0\}}. I know of no efficient way
	182	to do that.
	183
	184	The structure as described above uses \cw{O(N log N)} storage. Many
	185	algorithms using \cw{O(N log N)} time are considered efficient (e.g.
	186	sorting), but space is generally more expensive than time, and \cw{O(N
	187	log N)} space is larger than you think!
	188
	189	\C{application} An application: \cw{agedu}
	190
	191	The application for which I developed this data structure is
	192	\W{http://www.chiark.greenend.org.uk/~sgtatham/agedu/}\cw{agedu}, a
	193	program which scans a file system and indexes pathnames against
	194	last-access times (\q{atimes}, in Unix terminology) so as to be able
	195	to point out directories which take up a lot of disk space but whose
	196	contents have not been accessed in years (making them strong
	197	candidates for tidying up or archiving to save space).
	198
	199	So the fundamental searching primitive we want for \cw{agedu} is
	200	\q{What is the total size of the files contained within some
	201	directory path \cw{Ptop} which have atime at most \cw{T0}?}
	202
	203	We begin by sorting the files into order by their full pathname.
	204	This brings every subdirectory, at every level, together into a
	205	contiguous sequence of files. So now our query primitive becomes
	206	\q{What is the total size of files whose pathname falls between
	207	\cw{P0} and \cw{P1}, and whose atime is at most \cw{T0}?}
	208
	209	Clearly, we can simplify this to the same type of quadrant query as
	210	discussed above, by splitting this into two subqueries: the total
	211	size of files with \cw{P0\_<=\_P\_<\_P1} and \cw{T\_<\_T0} is clearly the
	212	total size of files with \cw{P\_<\_P1, T\_<\_T0} minus the total size
	213	of files with \cw{P\_<\_P0, T\_<\_T0}. Each of those subqueries is of
	214	precisely the type we have just derived a data structure to answer.
	215
	216	So we want to sort our files by two \q{coordinates}: one is atime,
	217	and the other is pathname (sorted in ASCII collation order). So
	218	which should be \cw{x} and which \cw{y}, in the above notation?
	219
	220	Well, either way round would work in principle, but two criteria
	221	inform the decision. Firstly, \cw{agedu} typically wants to do many
	222	queries on the same pathname for different atimes, so as to build up
	223	a detailed profile of size against atime for a given subdirectory.
	224	So it makes sense to have the first-level lookup (on \cw{y}, to find a
	225	tree root) be done on pathname, and the secondary lookup (on \cw{x},
	226	within that tree) be done on atime; then we can cache the tree root
	227	found in the first lookup, and use it many times without wasting
	228	time repeating the pathname search.
	229
	230	Another important point for \cw{agedu} is that not all tree roots
	231	are actually used: the only pathnames ever submitted to a quadrant
	232	search are those at the start or the end of a particular
	233	subdirectory. This allows us to save a lot of disk space by limiting
	234	the copy-on-write behaviour: instead of \e{never} modifying an
	235	existing tree node, we now rule that we may modify a tree node \e{if
	236	it has already been modified once since the last tree root we
	237	saved}. In real-world cases, this cuts down the total space usage by
	238	about a factor of five, so it's well worthwhile \dash and we
	239	wouldn't be able to do it if we'd used atime as the \cw{y}-coordinate
	240	instead of pathname.
	241
	242	Since the \q{\cw{y}-coordinates} in \cw{agedu} are strings, the
	243	top-level lookup to find a tree root is most efficiently done using
	244	neither a balanced tree nor a sorted list, but a trie: tries allow
	245	lookup of a string in time proportional to the length of the string,
	246	whereas either of the other approaches would require \cw{O(log N)}
	247	compare operations \e{each} of which could take time proportional to
	248	the length of the string.
	249
	250	Finally, the two-dimensions limitation on the above data structure
	251	unfortunately imposes limitations on what \cw{agedu} can do. One
	252	would like to run a single \cw{agedu} as a system-wide job on a file
	253	server (perhaps nightly or weekly), and present the results to all
	254	users in such a way that each user's view of the data was filtered
	255	to only what their access permissions permitted them to see. Sadly,
	256	to do this would require a third dimension in the data structure
	257	(representing ownership or access control, in some fashion), and
	258	hence cannot be done efficiently. \cw{agedu} is therefore instead
	259	most sensibly used on demand by an individual user, so that it
	260	generates a custom data set for that user every time.