Increase the size of the 'message' buffer, which is currently
[sgt/tweak] / btree.but
1 \cfg{html-leaf-level}{0}
2 \cfg{chapter}{Section}
3 \cfg{text-title-align}{left}
4 \cfg{text-indent}{0}
5 \cfg{text-chapter-numeric}{yes}
6 \cfg{text-chapter-suffix}{. }
7 \cfg{text-chapter-underline}{-}
8 \cfg{text-section-numeric}{0}{yes}
9 \cfg{text-section-suffix}{0}{. }
10 \cfg{text-section-underline}{0}{-}
11 \cfg{html-chapter-numeric}{yes}
12 \cfg{html-chapter-suffix}{. }
13 \cfg{html-section-numeric}{0}{yes}
14 \cfg{html-section-suffix}{0}{. }
15 \cfg{html-section-numeric}{1}{yes}
16 \cfg{html-section-suffix}{1}{. }
17
18 \title An Efficient Data Structure For A Hex Editor
19
20 by \W{http://pobox.com/~anakin/}{Simon Tatham}
21
22 \C{intro} Introduction
23
24 Hex editors have been around for a long time, and at the very basic
25 level they are very simple to write. Since they are mostly used for
26 editing files such as executables, which contain a lot of
27 cross-references to particular byte positions in the file, a hex
28 editor need not have an insert mode in order to be useful. And a hex
29 editor without an insert mode is very easy to implement: you simply
30 allocate a large enough array for the input file, and use that as
31 your data structure. The only operation you really need to be able
32 to do efficiently is to jump to a particular byte position, and
33 that's precisely what an array makes easy.
34
35 On the other hand, an insert mode can be useful in other
36 circumstances. Not \e{all} types of file you might want to edit have
37 the same restrictions as an executable. And as soon as you want your
38 hex editor to have an insert mode, the data structure question
39 becomes much more interesting.
40
41 In this article I present an efficient and scalable data structure
42 which supports all the operations needed by a hex editor.
43
44 \C{simple} Simple options
45
46 One technique used to support insert mode in editors is to use an
47 array larger than the file size, with a gap in it. The file contents
48 up to the current cursor position are stored at the start of the
49 array; the file contents from the current cursor position to the end
50 are stored at the end of the array; and the gap in the middle moves
51 about as the cursor does.
52
53 This makes insertion easy. When the user inserts an extra character,
54 you just add it to one end or other of the gap. On the other hand,
55 \e{moving} through the file now becomes a slow operation; it's not
56 noticeable when you're moving by a byte, by a line, or even by a
57 screenful at a time, but as soon as you try to jump to the start or
58 end of the file, or jump to a particular specified file offset,
59 suddenly the editor has to bodily shift enormous amounts of file
60 data from one end of the gap to the other.
61
62 Another slightly better option is to use a linked list of small
63 arrays, and to let the arrays vary in size between K and 2K bytes,
64 for some fixed minimum block size K. Inserting a single byte in the
65 middle of a block doesn't cost too much; occasionally the block will
66 grow beyond size 2K and have to be split into two smaller ones, but
67 even that isn't too slow.
68
69 Jumping to a particular position, however, is still an O(N)
70 operation using this structure. In practice it isn't \e{too} bad,
71 since the length of the linked list is at worst 1/K times the size
72 of the file; but once the file size becomes seriously big, this
73 approach does not scale well.
74
75 The common problem in both these methods is that as soon as you make
76 insertion a constant-time operation, seeking to a given byte
77 position becomes linear-time. Whereas in the original array format,
78 of course, seeking was constant-time but \e{insertion} became
79 linear-time.
80
81 \C{trees} Using balanced trees
82
83 This is where trees come in. Balanced tree structures (any of AVL
84 trees, red-black trees and B-trees) all solve this sort of problem
85 for sorted lists. You can insert an element into a balanced tree in
86 \e{log} time, and you can search for a particular element in log
87 time as well. This sounds like the kind of compromise we want: if
88 making insertion constant-time forces seeking to be linear and vice
89 versa, we would prefer to arrange for \e{both} to be log-time.
90
91 The conventional use of a balanced tree to store a sorted list,
92 however, is not immediately helpful to us. The only criterion we
93 could reasonably sort on would be byte position in the file; and as
94 soon as we store our data as a set of (position, data) pairs, we're
95 back to insertion being linear again, because we would have to alter
96 the position field of every tree element after the insertion point.
97
98 Is there anything we can do to our balanced trees to make this work
99 better?
100
101 \C{counted-trees} Counted trees
102
103 Yes, there is.
104
105 Suppose you add an additional field to every node of a balanced
106 tree. In that field, you store a count of the number of elements \e{in
107 or below} that node.
108
109 Operations which alter the tree (insertion and deletion) now have to
110 make sure these counts remain accurate. This can be done without
111 sacrificing the log-time characteristics of the operations. For
112 example, when you add an element, you increment the count of the
113 node containing it, and then work back up the tree to the root
114 incrementing the counts in all the nodes you go past. Since the
115 height of the tree is O(log N), this only takes you O(log N) time.
116
117 So we can add counts to a tree and still maintain it efficiently.
118 What have the counts bought us?
119
120 Once we have counts in a tree, they introduce an entirely new way to
121 \e{search} the tree. Starting at the root, we can search down the
122 tree by examining the count fields rather than comparing elements as
123 usual; and this allows us to find the Nth item in the tree, for any
124 N, in a single log-time search. For example, suppose the root tree node
125 contains a child with count 54, then an actual element, then a child
126 with count 73. Then:
127
128 \b If you are trying to get to a position less than 54, then you
129 descend straight to the leftmost child.
130
131 \b If you are trying to get to \e{exactly} position 54, you return
132 the element out of the root node.
133
134 \b If you are trying to get to position 55 or greater, you descend
135 to the rightmost child, and subtract 55 from your desired position.
136 (If you want element 57 of the tree, then you know there are 55
137 elements in the tree before the right-hand subtree, so you know you
138 want element 2 of the right-hand subtree.)
139
140 So now we have a means of finding the Nth item in a tree in a
141 log-time search. This is starting to look promising.
142
143 The trouble is, we're still stuck with having some sort of sorting
144 order on the tree. Now we need to deal with that.
145
146 \C{unsorted-trees} Unsorted trees
147
148 The simple answer to the sorting problem is to do away with sorting
149 the tree at all!
150
151 Conventional balanced trees have a sorting order because it's used
152 to find elements in the tree, and to know where to add an element.
153 But we don't need a sorting order to find things any more, because
154 we can use a count-based search to jump to the Nth position. Can we
155 also use counts during the tree add operation, to allow us to
156 specify \e{where} we want to add our new element?
157
158 We can. Tree add algorithms start by searching down the tree to find
159 the position where the new element will be inserted. If we do this
160 search using counts, in exactly the same way described in
161 \k{counted-trees}, then we can add any element we like at any
162 position in the tree. Once we do this, of course, we have to throw
163 out the sorting order completely, and never do another order-based
164 search or insertion again, because they won't work. But that's OK,
165 because we didn't need them anyway.
166
167 Now we have exactly what we were after in the first place. We
168 have a data structure which stores an unordered list of items, in
169 such a way that we can insert or delete an item in log time \e{and}
170 find the Nth element in log time.
171
172 \C{splitjoin} Splitting and joining trees
173
174 Now we can begin to get more ambitious. One issue we have not
175 addressed yet is cut and paste.
176
177 So far I have discussed tree insertion in the assumption that you
178 only ever insert one character at a time into your tree. In fact hex
179 editors need cut and paste just as much as normal text editors do;
180 so we must think about how to insert or remove a larger block of
181 data at a time.
182
183 One obvious way is to process each byte individually. A ten-byte
184 cut operation is ten individual deletions, and a ten-byte paste is
185 ten individual insertions. This is fine if you only ever use cut and
186 paste to move tiny chunks of data around a large file, but if you
187 need to move \e{half the file} from one place to another, things get
188 more interesting.
189
190 The linked-list structure discussed in \k{simple} would have helped
191 a lot with this problem. Linked lists don't just make it easy to
192 insert or delete one item: they make it just as easy to unlink an
193 enormous chunk of a list once you've found both ends of the chunk,
194 and you can link that chunk in somewhere else easily as well.
195
196 It turns out that you \e{can} do the same thing with balanced trees.
197 At this point it starts to make a difference what kind of balanced
198 tree you use: all three of AVL, red-black and B-trees support these
199 operations, but the precise methods vary between them. I'm going to
200 use B-trees from here on, because the algorithms are slightly simpler.
201
202 What we need are two basic operations. Given a counted, unsorted
203 B-tree containing an unordered list of items, we need to be able to:
204
205 \b Split the tree down the middle, giving two valid B-trees as output.
206
207 \b Take two valid B-trees and join them together end-to-end, giving
208 one B-tree containing all the data from tree A followed by the data
209 from tree B.
210
211 This will provide all the operations we need. To unlink a large
212 section from the middle of a tree, we split it in two places and
213 then join the outer two parts back together; to link a large section
214 \e{into} the middle of a tree, we split it at the insertion point,
215 join the left half on to the left side of the inserted section, and
216 join the right half on to the right side of the inserted section.
217
218 \H{joining} Joining two B-trees together
219
220 When you add an element to a B-tree, sometimes it ends up increasing
221 the size of a leaf node beyond the size limit. When that happens,
222 you deal with it by splitting the node in two, and transforming the
223 parent node so that where it previously had a single child pointer,
224 it now has two child pointers with an element between them. If that
225 makes the parent node too big as well, you do the same thing again,
226 and so on until you reach the tree root.
227
228 Joining two B-trees is therefore reasonably simple, \e{if} you have
229 an additional separating element to place in between them. Position
230 the two trees so that their leaf nodes are at the same level; now
231 (usually) one tree will be shorter than the other. So you can add
232 the root of the shorter tree as a sibling of the node next to it in
233 the taller tree; their common parent gains one extra child pointer
234 (pointing at the root of the shorter tree), separated from its
235 neighbour by the additional separating element. If this causes the
236 node to increase beyond the maximum size, just split it in two and
237 propagate up to its parent, just as in the ordinary insertion
238 process.
239
240 If the trees were originally the same height, just combine their
241 root nodes into a single larger root node. You need an extra element
242 to go in between the rightmost child pointer of the left-hand root
243 node, and the leftmost child pointer of the right-hand root node;
244 and again, this is where your separating element comes in. Again, if
245 the new root is too big to be a single node, split it in two and
246 create a new root above it.
247
248 So it turns out that it's very easy to join two trees together, but
249 the algorithm requires a spare element to go in the middle. However,
250 we normally don't have such a spare element: we just have two trees.
251 This is easily solved, though: we simply start by removing the
252 leftmost element of the right-hand tree using the ordinary tree
253 deletion algorithm. Then we just do the join algorithm, as described
254 above, using the element we just removed as our separator.
255
256 \H{splitting} Splitting a B-tree in two
257
258 To split a B-tree in two: we are given a tree, and a means of
259 searching down the tree to find the split point. (In this
260 application, that will be a numeric position, which we check against
261 the node counts on the way down; in other situations, we might
262 perfectly well want to split an ordinary \e{sorted} B-tree in half,
263 so we might have an ordering-based search criterion. It makes no
264 difference.)
265
266 We start in the simplest possible way. Start at the root node;
267 decide which of its subtree pointers you are going to descend down;
268 and saw the node in half at that subtree pointer. The two half-nodes
269 thus created will \e{each} need a subtree pointer to go on the cut
270 end, but that's OK because we're about to saw the next node down in
271 half as well and they can have half each. So descend to the next
272 node, decide on a split point again, saw that node in half, and put
273 a pointer to each half in the two halves of the parent node.
274
275 Once we finish this searching-and-cutting pass, we will have
276 successfully separated our tree into two parts at the required
277 point. However, the result will almost certainly not be a pair of
278 \e{valid} B-trees; the chances are that many of the nodes on the cut
279 edges will be below the minimum allowed node size. In fact, if at
280 any point our search criterion made us descend through the
281 \e{endmost} subtree pointer in any node, some of those nodes will
282 have no elements in them whatsoever, just a single subtree pointer!
283
284 So now we must make a healing pass down the cut edge of each tree,
285 to turn it back into a valid B-tree. We can start by throwing away
286 the root node if it has nothing but a single subtree pointer (which
287 will happen quite often if we split near one end of the original
288 tree, since in that case the output trees will almost certainly need
289 to be of different heights). Keep doing that until we find a real
290 root node.
291
292 One child of that node is on the cut edge, so it may be below the
293 minimum size. If it is, we solve this using its (valid) neighbour
294 node. If the neighbour is large, we can move some subtrees over into
295 the undersized node to make two correctly sized nodes; if the
296 neighbour is too small and does not have that many subtrees to
297 spare, we can instead \e{combine} the undersized node with its
298 neighbour. (And it turns out you can always do at least one of
299 these: if the neighbour is too large to combine with the undersized
300 node, then it \e{must} have enough subtrees for redistribution to
301 give two viable nodes.)
302
303 The only interesting case is that combining an undersized node with
304 its neighbour reduces the number of subtrees of their common parent
305 by one. Therefore:
306
307 \b As we go down, we arrange for each node on the cut edge to be at
308 least \e{one more than} minimum size, in case its size must drop by
309 one when we process its child. (This still just about works in all
310 cases.)
311
312 \b If the first non-trivial root node had only two children (recall
313 that the root node in a B-tree is the only node exempt from the
314 minimum size limit), and those two children end up having to be
315 combined, then the root node must be thrown away again and the
316 combined node is the new root.
317
318 Once we have sorted out each node, we descend to its child on the
319 cut edge, and do the same thing again. Eventually we reach the
320 bottom of the tree and every node is of valid size. Then we do the
321 same thing to the cut edge of the other tree, and we're done.
322
323 \C{copy-on-write} Cloning trees
324
325 The splitting and joining algorithms look as if they make cut and
326 paste pretty much trivial. You can split a big chunk out of your
327 editing buffer into a separate cut buffer easily enough; and then
328 you can \q{paste} it somewhere else by joining it back into the
329 middle of the editing buffer at a different position.
330
331 However, in real life, cut and paste isn't that simple. People often
332 want to paste the same data more than once; so you can't just link
333 the cut buffer straight into the editing buffer, because then you
334 don't still have it to link in again next time. You need to \e{copy}
335 the cut buffer and link in the copy. Equally, users often want to
336 press Copy rather than Cut, in which case you have to split the
337 buffer tree in two places, \e{copy} the middle section, and join all
338 three back together.
339
340 Copying a tree, it would seem, is inherently an O(N) operation;
341 there's no way you can copy a tree containing megabytes of data
342 without actually copying all that data.
343
344 Or is there?
345
346 It turns out that we \e{can} do better than this, by adding another
347 annotation field to each tree node. This time, the annotation is a
348 \e{reference count}: it counts the number of pointers to the node,
349 either from other tree nodes or from the \q{root} field in a tree
350 header structure. To begin with, of course, all reference counts are
351 1.
352
353 Reference counts are normally used for garbage collection. In this
354 case, though, I'm going to use them to implement \e{copy-on-write}.
355 All of the tree-altering algorithms (insertion and deletion, plus
356 the split and join algorithms described above) will now check the
357 reference count of a node before attempting to modify it. If they
358 find that they need to modify a node with a reference count greater
359 than one, they will not modify it. Instead, they will make a copy of
360 that node, and use the copy in place of the original. The copy links
361 to all the same child nodes as the original, so the reference count
362 in each child must be incremented; and the copied node's parent (or
363 tree header structure) now links to the copy rather than to the
364 original, so the reference count in the original must be
365 decremented. Now we are looking at a node with a reference count of
366 1, which means nobody else is using it so we can modify it safely.
367
368 The effect of this is that it is now a trivial - not merely log-time
369 but \e{constant}-time - operation to \e{clone} an entire B-tree, no
370 matter how large. We simply create a new tree header structure; we
371 point its root field at the root node of the input tree; and we
372 increment the reference count on that root node.
373
374 Once we have cloned a tree like this, we can treat the original and
375 the clone as if they were entirely independent. If you add an
376 element to one of them, for example, then a single string of nodes
377 from the root down to one leaf will be duplicated and modified, but
378 the rest of the trees will still be held in common. You can split
379 either tree into lots of little pieces, or join it into the middle
380 of a larger one, and never affect the data stored in what was once
381 its clone, because every time you touch a node that the other tree
382 is depending on, you make your own copy rather than disturbing it.
383
384 This allows us to support \e{really} efficient cut and paste in our
385 hex editor. You select a 200Mb chunk and press Copy; the buffer tree
386 is split in two places (in log time), the middle section is cloned
387 (instantly), and the tree is joined back together. You'd hardly know
388 anything was different - but the cut buffer now contains a clone of
389 \e{part} of the original buffer, most of which consists of nodes
390 that are still shared with it. And you can paste in as many copies
391 as you like of that chunk, still in no worse than O(log N) time. The
392 best bit is that by the time you've done this a few times and have a
393 file that's 1600Mb longer than it started out, the hex editor isn't
394 actually using up 1600Mb more memory, because most of it is in
395 shared nodes! This technique naturally provides a form of
396 compression as well as being fast.
397
398 \C{lazy-loading} Lazy file loading
399
400 In all of the above I have been tacitly assuming that the data
401 elements stored in my tree are individual bytes. This would be
402 hideously inefficient if I were using AVL or red-black trees, in
403 which each node contains precisely one element: for every \e{byte}
404 of the file being edited, there would be an overhead of two child
405 pointers, a byte count and a reference count. On a normal 32-bit
406 machine, that's 20 bytes per node, not counting overhead from the
407 memory allocator. A factor of twenty is just ridiculous.
408
409 B-trees are a bit more flexible, since they can be made to have a
410 large minimum degree. A B-tree with a minimum node size of (say) 512
411 can contain up to 1023 bytes of data plus 1024 subtree pointers, and
412 those 1023 bytes can be packed together in memory so the overhead is
413 now more like a factor of five. Also, since no node in a B-tree ever
414 changes its height above ground level, you can just not bother to
415 allocate space for the 512 NULL child pointers in your leaf nodes,
416 and since the vast majority of your nodes will \e{be} leaf nodes,
417 the structure is now closer to being space-efficient.
418
419 There are other improvements one could make. For example, there's no
420 reason why a B-tree really needs to have the \e{same} minimum node
421 degree at every level; so you could have low-degree nodes everywhere
422 above the leaf level, and enormous leaf nodes containing 4-8Kb of
423 file data. You could move to B+ trees in which no actual data
424 elements were stored anywhere except in the leaf nodes, thus saving
425 the tiny alignment overheads in the other nodes.
426
427 However, there's a better direction to head in. In \k{simple} I
428 mentioned the idea of using a linked list as the main data
429 structure, and I said that each element of the linked list would be
430 a smallish array of file bytes (between size K and 2K). There's no
431 reason we couldn't do that in our B-tree-based approach: each
432 element stored in the B-tree is no longer a single byte but a small
433 block of bytes. It would mean that our element counts no longer
434 allowed us to jump to the Nth byte, only to the Nth \e{block}; but
435 we can fix that by replacing the element count with a byte count,
436 summing the total \e{size} of all the blocks in or below a
437 particular tree node. Now, given any byte position, we can do a
438 single log-time search and return a data block plus an offset within
439 that block.
440
441 This technique adds work to all operations. Inserting a byte, for
442 example, is now done by finding the block it needs to go into,
443 inserting it in that block, and potentially splitting the block into
444 two and doing an extra tree operation. Splitting and joining buffers
445 involves splitting and joining blocks at each end, and checking to
446 make sure undersized blocks are not created. So what does this
447 technique buy us, that makes it worthwhile over just storing single
448 bytes in the B-tree?
449
450 The answer is: once we have a block data structure as our tree
451 element, we can start having different \e{types} of block. In
452 particular, we can have a type of block which is a placeholder,
453 containing nothing but a file offset and length. A block of this
454 type indicates \q{at this point in the tree we have N bytes from
455 position P in the original file}. Blocks of this type are exempt
456 from the normal maximum size for normal literal-data blocks.
457
458 The effect of this is that we no longer need to read the entire file
459 into memory when we start up. Instead, we just initialise our tree
460 trivially, so that it contains nothing but a single placeholder
461 block, with offset zero and size equal to the initial file size.
462
463 Now whenever we need to read data from the tree, and it turns out
464 the data in question is somewhere in a placeholder block, we must
465 refer back to the original input file in order to find the data (and
466 the placeholder block will tell us where in the file to read it
467 from). So before we do any editing, our hex editor is suddenly a
468 low-cost hex \e{file viewer}, which just pages back and forth and
469 refers to the disk all the time.
470
471 But as soon as we start altering parts of the file, the placeholder
472 block gets broken up into smaller blocks, and literal-data blocks
473 begin to be created in between them. If we cut and paste a section
474 including a placeholder block, then the tree can end up containing
475 placeholder blocks in a strange order; it might (for example)
476 indicate something like \q{the first 192K of the input file; then
477 the literal bytes 5B 49 A7; then 25K of the input file starting from
478 position 12345; then 512K of the input file starting from position
479 204325}.
480
481 Now the hex editor \e{looks} as if it's doing exactly the same thing
482 as it did to begin with. I can page around the original file; I can
483 insert, delete, overwrite, cut, copy and paste to my heart's
484 content, and (provided no other process modifies the original file
485 under our feet) the data I am manipulating will remain consistent at
486 all times with the editing operations I have performed. But there
487 wasn't a big delay at startup when the file was loaded in, because
488 most of it \e{wasn't} loaded in; and if I list the running processes
489 on my system, the hex editor will not be using memory proportional
490 to the size of the file. It will only be using memory proportional
491 to the \e{changes} I've made to the file.
492
493 When I save the file, if there are any placeholder blocks remaining
494 in the buffer tree, the hex editor must write out the new version by
495 referring to the original. This is the \e{only} remaining operation,
496 apart from searching, that takes time proportional to the size of
497 the file. And there are \e{no} remaining operations which take
498 \e{memory} proportional to the size of the file.
499
500 (There is one thing you need to be careful of. Literal data blocks
501 must be permitted to fall below the minimum size K if there is no
502 literal block next to them to merge with; in particular, this is
503 vital if you are writing a binary file from scratch or you would
504 never be able to give it a size between zero and K. But this raises
505 the possibility that given a pathological sequence of editing
506 operations, your data structure might end up being an interleaving
507 of one-byte literal blocks and one-byte placeholder blocks, giving a
508 huge space overhead. The simplest solution to this is to impose a
509 minimum size of 2K on \e{placeholder} blocks, below which you read
510 the relevant piece of file data and convert them into literal
511 blocks; then they can be merged with adjacent blocks and the worst
512 case is no longer terrible.)
513
514 We now have a data structure which does pretty much everything you
515 could reasonably ask a hex editor to be able to do, and does it all
516 at a reasonable memory cost and (apart from the two genuinely
517 necessary operations of searching and saving) \e{all} in O(log N)
518 time.
519
520 \C{further} Further directions
521
522 The data structure as I have presented it is suitable for use in a
523 high-performance hex editor with an insert mode.
524
525 There are a couple more points worth noting.
526
527 \H{further-texted} Conventional text editing
528
529 This structure would need only minor modifications to be an
530 efficient basis for a conventional text editor. In order to do this,
531 you would need to be able to jump quickly to a particular \e{line}
532 of the file, which means you'd need a node annotation counting
533 newlines.
534
535 In fact, it's possible to do slightly better than that: we can
536 devise a more complex node annotation which tracks the effect of an
537 arbitrary byte sequence on the (line, column) position. Assuming
538 that a physical tab character always advances the cursor to the next
539 multiple of 8 spaces, there are three possibilities:
540
541 \b A sequence of bytes containing no newlines or tabs simply adds
542 some number A to the column number, and does not affect the line
543 number.
544
545 \b A sequence of bytes containing no newlines but at least one tab
546 has the overall effect of adding some number A to the column, and
547 rounding it up to the next number that is congruent to B mod 8.
548
549 \b A sequence of bytes containing at least one newline has the
550 effect of adding some number A to the \e{line} number, and setting
551 the column number to a fixed value B.
552
553 These three function schemas are closed under composition (i.e.
554 combining any two of them gives another one). Storing one in each
555 node of a buffer tree would provide the ability to search directly
556 to \e{a particular (line, column) position} in a single log-time
557 search. So the text editor could treat its buffer as a simple
558 sequence of bytes (or possibly of Unicode characters). This is
559 superior to treating the buffer as a sequence of lines, because it
560 removes the distinction between inserting \e{within} a line and
561 inserting data \e{between} lines. In particular, cut and paste in a
562 line-based model is fiddly because lines must be spliced together at
563 each end of the pasted region; but cut and paste in this model is as
564 trivial as it was in the hex editor - you just cut a sequence of
565 bytes, paste it somewhere else, and the line/column indexing
566 automatically keeps up no matter what you do.
567
568 The only snag is that if you did this, you would probably no longer
569 be able to do the trick with placeholder blocks and lazy file
570 loading; a text editor tends to need to know in advance where all
571 the newlines are in its buffer, so there would probably be no
572 alternative to physically loading the file. But in that, at least,
573 this data structure is no worse than any other.
574
575 \H{undo} Supporting undo
576
577 An undo function in an editor \e{conceptually} stores a sequence of
578 previous buffer states, and allows you to return to one of them when
579 you need to.
580
581 Usually, this is not actually implemented by storing copies of the
582 entire buffer, since that would be ludicrously wasteful of space!
583 Instead, a journal of changes is kept which allows previous buffer
584 states to be \e{reconstructed} by reversing the precise changes
585 made.
586
587 One could do that using this data structure, if one wanted to.
588 However, there's another intriguing option. Since cloning an
589 arbitrarily large tree is a cheap operation, you could implement
590 undo by \e{actually} storing a sequence of clones of previous buffer
591 states! The cost of this would be nothing like as bad as it would
592 na\u00EF{i}vely appear.
593
594 It might still not be ideal, though. Every time you clone a tree and
595 the two clones diverge, several nodes must be copied, and if each
596 node contains several blocks of literal data then the cost of
597 maintaining too many buffer clones might still become prohibitive.
598 But it's an interesting possibility regardless.
599
600 \C{summary} Summary
601
602 I've presented a design for a data structure which implements
603 practically every operation required for a hex editor in O(log N)
604 time, apart from one or two which genuinely \e{need} to be O(N).
605
606 The structure is:
607
608 \b A B-tree, each of whose elements is either a small array of
609 literal data bytes, or a placeholder block denoting a section of the
610 unmodified input file.
611
612 \b Each B-tree node is annotated with the total byte count of all
613 the elements in or below that node, allowing a log-time search to
614 pinpoint any numeric byte position.
615
616 \b Those counts provide the only necessary means of navigating the
617 tree, so there is no need for a sorting criterion.
618
619 \b Split and join algorithms make it possible to link and unlink
620 large chunks from the middle of a buffer at a time.
621
622 \b Reference counts implementing copy-on-write allow cloning of
623 chunks in constant time.
624
625 As a result:
626
627 \b Inserting or deleting bytes in the file is a log-time operation.
628
629 \b Finding a particular byte position is a log-time operation.
630
631 \b Cut and paste is always log-time, no matter how large or complex
632 the chunk of data being moved around.
633
634 \b Memory usage grows proportionally to the \e{changes} made to the
635 file, not the overall file size. (However, memory usage is also
636 \e{bounded} by a value proportional to the file size, even if you
637 keep editing and re-editing for ever.)
638
639 Searching must still be linear (there's no alternative to actually
640 reading the data if you need to know anything about its contents),
641 and saving the modified output file is linear (because you actually
642 must physically write out that much data), but \e{everything} else
643 can be done in log time.
644
645 I've also sketched a means of converting this into a data structure
646 for an ordinary text editor, and suggested interesting implications
647 in the area of undo operations.
648
649 \C{ref} References
650
651 Donald Knuth's \q{The Art of Computer Programming}
652 (\W{http://en.wikipedia.org/w/wiki.phtml?title=Special:Booksources&isbn=0201485419}{Addison-Wesley,
653 ISBN 0201485419}) presents at least some of the same ideas as this
654 article. Counted and unsorted trees are mentioned in volume 3;
655 splitting and joining are also described (although Knuth does them
656 on AVL trees, which are significantly more fiddly to split than
657 B-trees; you have to cut the tree into lots of little pieces, and
658 then put them all back together by using the join algorithm
659 repeatedly).
660
661 \q{Tweak}, a hex editor implementing this data structure, can be
662 downloaded at
663 \W{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}\cw{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}.
664
665 \versionid $Id$