Increase the size of the 'message' buffer, which is currently
[sgt/tweak] / btree.but
CommitLineData
d28a4799 1\cfg{html-leaf-level}{0}
2\cfg{chapter}{Section}
3\cfg{text-title-align}{left}
4\cfg{text-indent}{0}
5\cfg{text-chapter-numeric}{yes}
6\cfg{text-chapter-suffix}{. }
7\cfg{text-chapter-underline}{-}
8\cfg{text-section-numeric}{0}{yes}
9\cfg{text-section-suffix}{0}{. }
10\cfg{text-section-underline}{0}{-}
11\cfg{html-chapter-numeric}{yes}
12\cfg{html-chapter-suffix}{. }
13\cfg{html-section-numeric}{0}{yes}
14\cfg{html-section-suffix}{0}{. }
15\cfg{html-section-numeric}{1}{yes}
16\cfg{html-section-suffix}{1}{. }
17
18\title An Efficient Data Structure For A Hex Editor
19
20by \W{http://pobox.com/~anakin/}{Simon Tatham}
21
22\C{intro} Introduction
23
24Hex editors have been around for a long time, and at the very basic
25level they are very simple to write. Since they are mostly used for
26editing files such as executables, which contain a lot of
27cross-references to particular byte positions in the file, a hex
28editor need not have an insert mode in order to be useful. And a hex
29editor without an insert mode is very easy to implement: you simply
30allocate a large enough array for the input file, and use that as
31your data structure. The only operation you really need to be able
32to do efficiently is to jump to a particular byte position, and
33that's precisely what an array makes easy.
34
35On the other hand, an insert mode can be useful in other
36circumstances. Not \e{all} types of file you might want to edit have
37the same restrictions as an executable. And as soon as you want your
38hex editor to have an insert mode, the data structure question
39becomes much more interesting.
40
41In this article I present an efficient and scalable data structure
42which supports all the operations needed by a hex editor.
43
44\C{simple} Simple options
45
46One technique used to support insert mode in editors is to use an
47array larger than the file size, with a gap in it. The file contents
48up to the current cursor position are stored at the start of the
49array; the file contents from the current cursor position to the end
50are stored at the end of the array; and the gap in the middle moves
51about as the cursor does.
52
53This makes insertion easy. When the user inserts an extra character,
54you just add it to one end or other of the gap. On the other hand,
55\e{moving} through the file now becomes a slow operation; it's not
56noticeable when you're moving by a byte, by a line, or even by a
57screenful at a time, but as soon as you try to jump to the start or
58end of the file, or jump to a particular specified file offset,
59suddenly the editor has to bodily shift enormous amounts of file
60data from one end of the gap to the other.
61
62Another slightly better option is to use a linked list of small
63arrays, and to let the arrays vary in size between K and 2K bytes,
64for some fixed minimum block size K. Inserting a single byte in the
65middle of a block doesn't cost too much; occasionally the block will
66grow beyond size 2K and have to be split into two smaller ones, but
67even that isn't too slow.
68
69Jumping to a particular position, however, is still an O(N)
70operation using this structure. In practice it isn't \e{too} bad,
71since the length of the linked list is at worst 1/K times the size
72of the file; but once the file size becomes seriously big, this
73approach does not scale well.
74
75The common problem in both these methods is that as soon as you make
76insertion a constant-time operation, seeking to a given byte
77position becomes linear-time. Whereas in the original array format,
78of course, seeking was constant-time but \e{insertion} became
79linear-time.
80
81\C{trees} Using balanced trees
82
83This is where trees come in. Balanced tree structures (any of AVL
84trees, red-black trees and B-trees) all solve this sort of problem
85for sorted lists. You can insert an element into a balanced tree in
86\e{log} time, and you can search for a particular element in log
87time as well. This sounds like the kind of compromise we want: if
88making insertion constant-time forces seeking to be linear and vice
89versa, we would prefer to arrange for \e{both} to be log-time.
90
91The conventional use of a balanced tree to store a sorted list,
92however, is not immediately helpful to us. The only criterion we
93could reasonably sort on would be byte position in the file; and as
94soon as we store our data as a set of (position, data) pairs, we're
95back to insertion being linear again, because we would have to alter
96the position field of every tree element after the insertion point.
97
98Is there anything we can do to our balanced trees to make this work
99better?
100
101\C{counted-trees} Counted trees
102
103Yes, there is.
104
105Suppose you add an additional field to every node of a balanced
106tree. In that field, you store a count of the number of elements \e{in
107or below} that node.
108
109Operations which alter the tree (insertion and deletion) now have to
110make sure these counts remain accurate. This can be done without
111sacrificing the log-time characteristics of the operations. For
112example, when you add an element, you increment the count of the
113node containing it, and then work back up the tree to the root
114incrementing the counts in all the nodes you go past. Since the
115height of the tree is O(log N), this only takes you O(log N) time.
116
117So we can add counts to a tree and still maintain it efficiently.
118What have the counts bought us?
119
120Once we have counts in a tree, they introduce an entirely new way to
121\e{search} the tree. Starting at the root, we can search down the
122tree by examining the count fields rather than comparing elements as
123usual; and this allows us to find the Nth item in the tree, for any
124N, in a single log-time search. For example, suppose the root tree node
125contains a child with count 54, then an actual element, then a child
126with count 73. Then:
127
128\b If you are trying to get to a position less than 54, then you
129descend straight to the leftmost child.
130
131\b If you are trying to get to \e{exactly} position 54, you return
132the element out of the root node.
133
134\b If you are trying to get to position 55 or greater, you descend
135to the rightmost child, and subtract 55 from your desired position.
136(If you want element 57 of the tree, then you know there are 55
137elements in the tree before the right-hand subtree, so you know you
138want element 2 of the right-hand subtree.)
139
140So now we have a means of finding the Nth item in a tree in a
141log-time search. This is starting to look promising.
142
143The trouble is, we're still stuck with having some sort of sorting
144order on the tree. Now we need to deal with that.
145
146\C{unsorted-trees} Unsorted trees
147
148The simple answer to the sorting problem is to do away with sorting
149the tree at all!
150
151Conventional balanced trees have a sorting order because it's used
152to find elements in the tree, and to know where to add an element.
153But we don't need a sorting order to find things any more, because
154we can use a count-based search to jump to the Nth position. Can we
155also use counts during the tree add operation, to allow us to
156specify \e{where} we want to add our new element?
157
158We can. Tree add algorithms start by searching down the tree to find
159the position where the new element will be inserted. If we do this
160search using counts, in exactly the same way described in
161\k{counted-trees}, then we can add any element we like at any
162position in the tree. Once we do this, of course, we have to throw
163out the sorting order completely, and never do another order-based
164search or insertion again, because they won't work. But that's OK,
165because we didn't need them anyway.
166
167Now we have exactly what we were after in the first place. We
168have a data structure which stores an unordered list of items, in
169such a way that we can insert or delete an item in log time \e{and}
170find the Nth element in log time.
171
172\C{splitjoin} Splitting and joining trees
173
174Now we can begin to get more ambitious. One issue we have not
175addressed yet is cut and paste.
176
177So far I have discussed tree insertion in the assumption that you
178only ever insert one character at a time into your tree. In fact hex
179editors need cut and paste just as much as normal text editors do;
180so we must think about how to insert or remove a larger block of
181data at a time.
182
183One obvious way is to process each byte individually. A ten-byte
184cut operation is ten individual deletions, and a ten-byte paste is
185ten individual insertions. This is fine if you only ever use cut and
186paste to move tiny chunks of data around a large file, but if you
187need to move \e{half the file} from one place to another, things get
188more interesting.
189
190The linked-list structure discussed in \k{simple} would have helped
191a lot with this problem. Linked lists don't just make it easy to
192insert or delete one item: they make it just as easy to unlink an
193enormous chunk of a list once you've found both ends of the chunk,
194and you can link that chunk in somewhere else easily as well.
195
196It turns out that you \e{can} do the same thing with balanced trees.
197At this point it starts to make a difference what kind of balanced
198tree you use: all three of AVL, red-black and B-trees support these
199operations, but the precise methods vary between them. I'm going to
200use B-trees from here on, because the algorithms are slightly simpler.
201
202What we need are two basic operations. Given a counted, unsorted
203B-tree containing an unordered list of items, we need to be able to:
204
205\b Split the tree down the middle, giving two valid B-trees as output.
206
207\b Take two valid B-trees and join them together end-to-end, giving
208one B-tree containing all the data from tree A followed by the data
209from tree B.
210
211This will provide all the operations we need. To unlink a large
212section from the middle of a tree, we split it in two places and
213then join the outer two parts back together; to link a large section
214\e{into} the middle of a tree, we split it at the insertion point,
215join the left half on to the left side of the inserted section, and
216join the right half on to the right side of the inserted section.
217
218\H{joining} Joining two B-trees together
219
220When you add an element to a B-tree, sometimes it ends up increasing
221the size of a leaf node beyond the size limit. When that happens,
222you deal with it by splitting the node in two, and transforming the
223parent node so that where it previously had a single child pointer,
224it now has two child pointers with an element between them. If that
225makes the parent node too big as well, you do the same thing again,
226and so on until you reach the tree root.
227
228Joining two B-trees is therefore reasonably simple, \e{if} you have
229an additional separating element to place in between them. Position
230the two trees so that their leaf nodes are at the same level; now
231(usually) one tree will be shorter than the other. So you can add
232the root of the shorter tree as a sibling of the node next to it in
233the taller tree; their common parent gains one extra child pointer
234(pointing at the root of the shorter tree), separated from its
235neighbour by the additional separating element. If this causes the
236node to increase beyond the maximum size, just split it in two and
237propagate up to its parent, just as in the ordinary insertion
238process.
239
240If the trees were originally the same height, just combine their
241root nodes into a single larger root node. You need an extra element
242to go in between the rightmost child pointer of the left-hand root
243node, and the leftmost child pointer of the right-hand root node;
244and again, this is where your separating element comes in. Again, if
245the new root is too big to be a single node, split it in two and
246create a new root above it.
247
248So it turns out that it's very easy to join two trees together, but
249the algorithm requires a spare element to go in the middle. However,
250we normally don't have such a spare element: we just have two trees.
251This is easily solved, though: we simply start by removing the
252leftmost element of the right-hand tree using the ordinary tree
253deletion algorithm. Then we just do the join algorithm, as described
254above, using the element we just removed as our separator.
255
256\H{splitting} Splitting a B-tree in two
257
258To split a B-tree in two: we are given a tree, and a means of
259searching down the tree to find the split point. (In this
260application, that will be a numeric position, which we check against
261the node counts on the way down; in other situations, we might
262perfectly well want to split an ordinary \e{sorted} B-tree in half,
263so we might have an ordering-based search criterion. It makes no
264difference.)
265
266We start in the simplest possible way. Start at the root node;
267decide which of its subtree pointers you are going to descend down;
268and saw the node in half at that subtree pointer. The two half-nodes
269thus created will \e{each} need a subtree pointer to go on the cut
270end, but that's OK because we're about to saw the next node down in
271half as well and they can have half each. So descend to the next
272node, decide on a split point again, saw that node in half, and put
273a pointer to each half in the two halves of the parent node.
274
275Once we finish this searching-and-cutting pass, we will have
276successfully separated our tree into two parts at the required
277point. However, the result will almost certainly not be a pair of
278\e{valid} B-trees; the chances are that many of the nodes on the cut
279edges will be below the minimum allowed node size. In fact, if at
280any point our search criterion made us descend through the
281\e{endmost} subtree pointer in any node, some of those nodes will
282have no elements in them whatsoever, just a single subtree pointer!
283
284So now we must make a healing pass down the cut edge of each tree,
285to turn it back into a valid B-tree. We can start by throwing away
286the root node if it has nothing but a single subtree pointer (which
287will happen quite often if we split near one end of the original
288tree, since in that case the output trees will almost certainly need
289to be of different heights). Keep doing that until we find a real
290root node.
291
292One child of that node is on the cut edge, so it may be below the
293minimum size. If it is, we solve this using its (valid) neighbour
294node. If the neighbour is large, we can move some subtrees over into
295the undersized node to make two correctly sized nodes; if the
296neighbour is too small and does not have that many subtrees to
297spare, we can instead \e{combine} the undersized node with its
298neighbour. (And it turns out you can always do at least one of
299these: if the neighbour is too large to combine with the undersized
300node, then it \e{must} have enough subtrees for redistribution to
301give two viable nodes.)
302
303The only interesting case is that combining an undersized node with
304its neighbour reduces the number of subtrees of their common parent
305by one. Therefore:
306
307\b As we go down, we arrange for each node on the cut edge to be at
308least \e{one more than} minimum size, in case its size must drop by
309one when we process its child. (This still just about works in all
310cases.)
311
312\b If the first non-trivial root node had only two children (recall
313that the root node in a B-tree is the only node exempt from the
314minimum size limit), and those two children end up having to be
315combined, then the root node must be thrown away again and the
316combined node is the new root.
317
318Once we have sorted out each node, we descend to its child on the
319cut edge, and do the same thing again. Eventually we reach the
320bottom of the tree and every node is of valid size. Then we do the
321same thing to the cut edge of the other tree, and we're done.
322
323\C{copy-on-write} Cloning trees
324
325The splitting and joining algorithms look as if they make cut and
326paste pretty much trivial. You can split a big chunk out of your
327editing buffer into a separate cut buffer easily enough; and then
328you can \q{paste} it somewhere else by joining it back into the
329middle of the editing buffer at a different position.
330
331However, in real life, cut and paste isn't that simple. People often
332want to paste the same data more than once; so you can't just link
333the cut buffer straight into the editing buffer, because then you
334don't still have it to link in again next time. You need to \e{copy}
335the cut buffer and link in the copy. Equally, users often want to
336press Copy rather than Cut, in which case you have to split the
337buffer tree in two places, \e{copy} the middle section, and join all
338three back together.
339
340Copying a tree, it would seem, is inherently an O(N) operation;
341there's no way you can copy a tree containing megabytes of data
342without actually copying all that data.
343
344Or is there?
345
346It turns out that we \e{can} do better than this, by adding another
347annotation field to each tree node. This time, the annotation is a
348\e{reference count}: it counts the number of pointers to the node,
349either from other tree nodes or from the \q{root} field in a tree
350header structure. To begin with, of course, all reference counts are
3511.
352
353Reference counts are normally used for garbage collection. In this
354case, though, I'm going to use them to implement \e{copy-on-write}.
355All of the tree-altering algorithms (insertion and deletion, plus
356the split and join algorithms described above) will now check the
357reference count of a node before attempting to modify it. If they
358find that they need to modify a node with a reference count greater
359than one, they will not modify it. Instead, they will make a copy of
360that node, and use the copy in place of the original. The copy links
361to all the same child nodes as the original, so the reference count
362in each child must be incremented; and the copied node's parent (or
363tree header structure) now links to the copy rather than to the
364original, so the reference count in the original must be
365decremented. Now we are looking at a node with a reference count of
3661, which means nobody else is using it so we can modify it safely.
367
368The effect of this is that it is now a trivial - not merely log-time
369but \e{constant}-time - operation to \e{clone} an entire B-tree, no
370matter how large. We simply create a new tree header structure; we
371point its root field at the root node of the input tree; and we
372increment the reference count on that root node.
373
374Once we have cloned a tree like this, we can treat the original and
375the clone as if they were entirely independent. If you add an
376element to one of them, for example, then a single string of nodes
377from the root down to one leaf will be duplicated and modified, but
378the rest of the trees will still be held in common. You can split
379either tree into lots of little pieces, or join it into the middle
380of a larger one, and never affect the data stored in what was once
381its clone, because every time you touch a node that the other tree
382is depending on, you make your own copy rather than disturbing it.
383
384This allows us to support \e{really} efficient cut and paste in our
385hex editor. You select a 200Mb chunk and press Copy; the buffer tree
386is split in two places (in log time), the middle section is cloned
387(instantly), and the tree is joined back together. You'd hardly know
388anything was different - but the cut buffer now contains a clone of
389\e{part} of the original buffer, most of which consists of nodes
390that are still shared with it. And you can paste in as many copies
391as you like of that chunk, still in no worse than O(log N) time. The
392best bit is that by the time you've done this a few times and have a
393file that's 1600Mb longer than it started out, the hex editor isn't
394actually using up 1600Mb more memory, because most of it is in
395shared nodes! This technique naturally provides a form of
396compression as well as being fast.
397
398\C{lazy-loading} Lazy file loading
399
400In all of the above I have been tacitly assuming that the data
401elements stored in my tree are individual bytes. This would be
402hideously inefficient if I were using AVL or red-black trees, in
403which each node contains precisely one element: for every \e{byte}
404of the file being edited, there would be an overhead of two child
405pointers, a byte count and a reference count. On a normal 32-bit
406machine, that's 20 bytes per node, not counting overhead from the
407memory allocator. A factor of twenty is just ridiculous.
408
409B-trees are a bit more flexible, since they can be made to have a
410large minimum degree. A B-tree with a minimum node size of (say) 512
411can contain up to 1023 bytes of data plus 1024 subtree pointers, and
412those 1023 bytes can be packed together in memory so the overhead is
413now more like a factor of five. Also, since no node in a B-tree ever
414changes its height above ground level, you can just not bother to
415allocate space for the 512 NULL child pointers in your leaf nodes,
416and since the vast majority of your nodes will \e{be} leaf nodes,
417the structure is now closer to being space-efficient.
418
419There are other improvements one could make. For example, there's no
420reason why a B-tree really needs to have the \e{same} minimum node
421degree at every level; so you could have low-degree nodes everywhere
422above the leaf level, and enormous leaf nodes containing 4-8Kb of
423file data. You could move to B+ trees in which no actual data
424elements were stored anywhere except in the leaf nodes, thus saving
425the tiny alignment overheads in the other nodes.
426
427However, there's a better direction to head in. In \k{simple} I
428mentioned the idea of using a linked list as the main data
429structure, and I said that each element of the linked list would be
430a smallish array of file bytes (between size K and 2K). There's no
431reason we couldn't do that in our B-tree-based approach: each
432element stored in the B-tree is no longer a single byte but a small
433block of bytes. It would mean that our element counts no longer
434allowed us to jump to the Nth byte, only to the Nth \e{block}; but
435we can fix that by replacing the element count with a byte count,
436summing the total \e{size} of all the blocks in or below a
437particular tree node. Now, given any byte position, we can do a
438single log-time search and return a data block plus an offset within
439that block.
440
441This technique adds work to all operations. Inserting a byte, for
442example, is now done by finding the block it needs to go into,
443inserting it in that block, and potentially splitting the block into
444two and doing an extra tree operation. Splitting and joining buffers
445involves splitting and joining blocks at each end, and checking to
446make sure undersized blocks are not created. So what does this
447technique buy us, that makes it worthwhile over just storing single
448bytes in the B-tree?
449
450The answer is: once we have a block data structure as our tree
451element, we can start having different \e{types} of block. In
452particular, we can have a type of block which is a placeholder,
453containing nothing but a file offset and length. A block of this
454type indicates \q{at this point in the tree we have N bytes from
455position P in the original file}. Blocks of this type are exempt
456from the normal maximum size for normal literal-data blocks.
457
458The effect of this is that we no longer need to read the entire file
459into memory when we start up. Instead, we just initialise our tree
460trivially, so that it contains nothing but a single placeholder
461block, with offset zero and size equal to the initial file size.
462
463Now whenever we need to read data from the tree, and it turns out
464the data in question is somewhere in a placeholder block, we must
465refer back to the original input file in order to find the data (and
466the placeholder block will tell us where in the file to read it
467from). So before we do any editing, our hex editor is suddenly a
468low-cost hex \e{file viewer}, which just pages back and forth and
469refers to the disk all the time.
470
471But as soon as we start altering parts of the file, the placeholder
472block gets broken up into smaller blocks, and literal-data blocks
473begin to be created in between them. If we cut and paste a section
474including a placeholder block, then the tree can end up containing
475placeholder blocks in a strange order; it might (for example)
476indicate something like \q{the first 192K of the input file; then
477the literal bytes 5B 49 A7; then 25K of the input file starting from
478position 12345; then 512K of the input file starting from position
479204325}.
480
481Now the hex editor \e{looks} as if it's doing exactly the same thing
482as it did to begin with. I can page around the original file; I can
483insert, delete, overwrite, cut, copy and paste to my heart's
484content, and (provided no other process modifies the original file
485under our feet) the data I am manipulating will remain consistent at
486all times with the editing operations I have performed. But there
487wasn't a big delay at startup when the file was loaded in, because
488most of it \e{wasn't} loaded in; and if I list the running processes
489on my system, the hex editor will not be using memory proportional
490to the size of the file. It will only be using memory proportional
491to the \e{changes} I've made to the file.
492
493When I save the file, if there are any placeholder blocks remaining
494in the buffer tree, the hex editor must write out the new version by
495referring to the original. This is the \e{only} remaining operation,
496apart from searching, that takes time proportional to the size of
497the file. And there are \e{no} remaining operations which take
498\e{memory} proportional to the size of the file.
499
500(There is one thing you need to be careful of. Literal data blocks
501must be permitted to fall below the minimum size K if there is no
502literal block next to them to merge with; in particular, this is
503vital if you are writing a binary file from scratch or you would
504never be able to give it a size between zero and K. But this raises
505the possibility that given a pathological sequence of editing
506operations, your data structure might end up being an interleaving
507of one-byte literal blocks and one-byte placeholder blocks, giving a
508huge space overhead. The simplest solution to this is to impose a
509minimum size of 2K on \e{placeholder} blocks, below which you read
510the relevant piece of file data and convert them into literal
511blocks; then they can be merged with adjacent blocks and the worst
512case is no longer terrible.)
513
514We now have a data structure which does pretty much everything you
515could reasonably ask a hex editor to be able to do, and does it all
516at a reasonable memory cost and (apart from the two genuinely
517necessary operations of searching and saving) \e{all} in O(log N)
518time.
519
520\C{further} Further directions
521
522The data structure as I have presented it is suitable for use in a
523high-performance hex editor with an insert mode.
524
525There are a couple more points worth noting.
526
527\H{further-texted} Conventional text editing
528
529This structure would need only minor modifications to be an
530efficient basis for a conventional text editor. In order to do this,
531you would need to be able to jump quickly to a particular \e{line}
532of the file, which means you'd need a node annotation counting
533newlines.
534
535In fact, it's possible to do slightly better than that: we can
536devise a more complex node annotation which tracks the effect of an
537arbitrary byte sequence on the (line, column) position. Assuming
538that a physical tab character always advances the cursor to the next
539multiple of 8 spaces, there are three possibilities:
540
541\b A sequence of bytes containing no newlines or tabs simply adds
542some number A to the column number, and does not affect the line
543number.
544
545\b A sequence of bytes containing no newlines but at least one tab
546has the overall effect of adding some number A to the column, and
547rounding it up to the next number that is congruent to B mod 8.
548
549\b A sequence of bytes containing at least one newline has the
550effect of adding some number A to the \e{line} number, and setting
551the column number to a fixed value B.
552
553These three function schemas are closed under composition (i.e.
554combining any two of them gives another one). Storing one in each
555node of a buffer tree would provide the ability to search directly
556to \e{a particular (line, column) position} in a single log-time
557search. So the text editor could treat its buffer as a simple
558sequence of bytes (or possibly of Unicode characters). This is
559superior to treating the buffer as a sequence of lines, because it
560removes the distinction between inserting \e{within} a line and
561inserting data \e{between} lines. In particular, cut and paste in a
562line-based model is fiddly because lines must be spliced together at
563each end of the pasted region; but cut and paste in this model is as
564trivial as it was in the hex editor - you just cut a sequence of
565bytes, paste it somewhere else, and the line/column indexing
566automatically keeps up no matter what you do.
567
568The only snag is that if you did this, you would probably no longer
569be able to do the trick with placeholder blocks and lazy file
570loading; a text editor tends to need to know in advance where all
571the newlines are in its buffer, so there would probably be no
572alternative to physically loading the file. But in that, at least,
573this data structure is no worse than any other.
574
575\H{undo} Supporting undo
576
577An undo function in an editor \e{conceptually} stores a sequence of
578previous buffer states, and allows you to return to one of them when
579you need to.
580
581Usually, this is not actually implemented by storing copies of the
582entire buffer, since that would be ludicrously wasteful of space!
583Instead, a journal of changes is kept which allows previous buffer
584states to be \e{reconstructed} by reversing the precise changes
585made.
586
587One could do that using this data structure, if one wanted to.
588However, there's another intriguing option. Since cloning an
589arbitrarily large tree is a cheap operation, you could implement
590undo by \e{actually} storing a sequence of clones of previous buffer
591states! The cost of this would be nothing like as bad as it would
592na\u00EF{i}vely appear.
593
594It might still not be ideal, though. Every time you clone a tree and
595the two clones diverge, several nodes must be copied, and if each
596node contains several blocks of literal data then the cost of
597maintaining too many buffer clones might still become prohibitive.
598But it's an interesting possibility regardless.
599
600\C{summary} Summary
601
602I've presented a design for a data structure which implements
603practically every operation required for a hex editor in O(log N)
604time, apart from one or two which genuinely \e{need} to be O(N).
605
606The structure is:
607
608\b A B-tree, each of whose elements is either a small array of
609literal data bytes, or a placeholder block denoting a section of the
610unmodified input file.
611
612\b Each B-tree node is annotated with the total byte count of all
613the elements in or below that node, allowing a log-time search to
614pinpoint any numeric byte position.
615
616\b Those counts provide the only necessary means of navigating the
617tree, so there is no need for a sorting criterion.
618
619\b Split and join algorithms make it possible to link and unlink
620large chunks from the middle of a buffer at a time.
621
622\b Reference counts implementing copy-on-write allow cloning of
623chunks in constant time.
624
625As a result:
626
627\b Inserting or deleting bytes in the file is a log-time operation.
628
629\b Finding a particular byte position is a log-time operation.
630
631\b Cut and paste is always log-time, no matter how large or complex
632the chunk of data being moved around.
633
634\b Memory usage grows proportionally to the \e{changes} made to the
635file, not the overall file size. (However, memory usage is also
636\e{bounded} by a value proportional to the file size, even if you
637keep editing and re-editing for ever.)
638
639Searching must still be linear (there's no alternative to actually
640reading the data if you need to know anything about its contents),
641and saving the modified output file is linear (because you actually
642must physically write out that much data), but \e{everything} else
643can be done in log time.
644
645I've also sketched a means of converting this into a data structure
646for an ordinary text editor, and suggested interesting implications
647in the area of undo operations.
648
649\C{ref} References
650
651Donald Knuth's \q{The Art of Computer Programming}
652(\W{http://en.wikipedia.org/w/wiki.phtml?title=Special:Booksources&isbn=0201485419}{Addison-Wesley,
653ISBN 0201485419}) presents at least some of the same ideas as this
654article. Counted and unsorted trees are mentioned in volume 3;
655splitting and joining are also described (although Knuth does them
656on AVL trees, which are significantly more fiddly to split than
657B-trees; you have to cut the tree into lots of little pieces, and
658then put them all back together by using the join algorithm
659repeatedly).
660
661\q{Tweak}, a hex editor implementing this data structure, can be
662downloaded at
663\W{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}\cw{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}.
664
665\versionid $Id$