mdw@git.distorted.org.uk Git - sgt/tweak/blob - btree.but

   1 \cfg{html-leaf-level}{0}
   2 \cfg{chapter}{Section}
   3 \cfg{text-title-align}{left}
   4 \cfg{text-indent}{0}
   5 \cfg{text-chapter-numeric}{yes}
   6 \cfg{text-chapter-suffix}{. }
   7 \cfg{text-chapter-underline}{-}
   8 \cfg{text-section-numeric}{0}{yes}
   9 \cfg{text-section-suffix}{0}{. }
  10 \cfg{text-section-underline}{0}{-}
  11 \cfg{html-chapter-numeric}{yes}
  12 \cfg{html-chapter-suffix}{. }
  13 \cfg{html-section-numeric}{0}{yes}
  14 \cfg{html-section-suffix}{0}{. }
  15 \cfg{html-section-numeric}{1}{yes}
  16 \cfg{html-section-suffix}{1}{. }
  17
  18 \title An Efficient Data Structure For A Hex Editor
  19
  20 by \W{http://pobox.com/~anakin/}{Simon Tatham}
  21
  22 \C{intro} Introduction
  23
  24 Hex editors have been around for a long time, and at the very basic
  25 level they are very simple to write. Since they are mostly used for
  26 editing files such as executables, which contain a lot of
  27 cross-references to particular byte positions in the file, a hex
  28 editor need not have an insert mode in order to be useful. And a hex
  29 editor without an insert mode is very easy to implement: you simply
  30 allocate a large enough array for the input file, and use that as
  31 your data structure. The only operation you really need to be able
  32 to do efficiently is to jump to a particular byte position, and
  33 that's precisely what an array makes easy.
  34
  35 On the other hand, an insert mode can be useful in other
  36 circumstances. Not \e{all} types of file you might want to edit have
  37 the same restrictions as an executable. And as soon as you want your
  38 hex editor to have an insert mode, the data structure question
  39 becomes much more interesting.
  40
  41 In this article I present an efficient and scalable data structure
  42 which supports all the operations needed by a hex editor.
  43
  44 \C{simple} Simple options
  45
  46 One technique used to support insert mode in editors is to use an
  47 array larger than the file size, with a gap in it. The file contents
  48 up to the current cursor position are stored at the start of the
  49 array; the file contents from the current cursor position to the end
  50 are stored at the end of the array; and the gap in the middle moves
  51 about as the cursor does.
  52
  53 This makes insertion easy. When the user inserts an extra character,
  54 you just add it to one end or other of the gap. On the other hand,
  55 \e{moving} through the file now becomes a slow operation; it's not
  56 noticeable when you're moving by a byte, by a line, or even by a
  57 screenful at a time, but as soon as you try to jump to the start or
  58 end of the file, or jump to a particular specified file offset,
  59 suddenly the editor has to bodily shift enormous amounts of file
  60 data from one end of the gap to the other.
  61
  62 Another slightly better option is to use a linked list of small
  63 arrays, and to let the arrays vary in size between K and 2K bytes,
  64 for some fixed minimum block size K. Inserting a single byte in the
  65 middle of a block doesn't cost too much; occasionally the block will
  66 grow beyond size 2K and have to be split into two smaller ones, but
  67 even that isn't too slow.
  68
  69 Jumping to a particular position, however, is still an O(N)
  70 operation using this structure. In practice it isn't \e{too} bad,
  71 since the length of the linked list is at worst 1/K times the size
  72 of the file; but once the file size becomes seriously big, this
  73 approach does not scale well.
  74
  75 The common problem in both these methods is that as soon as you make
  76 insertion a constant-time operation, seeking to a given byte
  77 position becomes linear-time. Whereas in the original array format,
  78 of course, seeking was constant-time but \e{insertion} became
  79 linear-time.
  80
  81 \C{trees} Using balanced trees
  82
  83 This is where trees come in. Balanced tree structures (any of AVL
  84 trees, red-black trees and B-trees) all solve this sort of problem
  85 for sorted lists. You can insert an element into a balanced tree in
  86 \e{log} time, and you can search for a particular element in log
  87 time as well. This sounds like the kind of compromise we want: if
  88 making insertion constant-time forces seeking to be linear and vice
  89 versa, we would prefer to arrange for \e{both} to be log-time.
  90
  91 The conventional use of a balanced tree to store a sorted list,
  92 however, is not immediately helpful to us. The only criterion we
  93 could reasonably sort on would be byte position in the file; and as
  94 soon as we store our data as a set of (position, data) pairs, we're
  95 back to insertion being linear again, because we would have to alter
  96 the position field of every tree element after the insertion point.
  97
  98 Is there anything we can do to our balanced trees to make this work
  99 better?
 100
 101 \C{counted-trees} Counted trees
 102
 103 Yes, there is.
 104
 105 Suppose you add an additional field to every node of a balanced
 106 tree. In that field, you store a count of the number of elements \e{in
 107 or below} that node.
 108
 109 Operations which alter the tree (insertion and deletion) now have to
 110 make sure these counts remain accurate. This can be done without
 111 sacrificing the log-time characteristics of the operations. For
 112 example, when you add an element, you increment the count of the
 113 node containing it, and then work back up the tree to the root
 114 incrementing the counts in all the nodes you go past. Since the
 115 height of the tree is O(log N), this only takes you O(log N) time.
 116
 117 So we can add counts to a tree and still maintain it efficiently.
 118 What have the counts bought us?
 119
 120 Once we have counts in a tree, they introduce an entirely new way to
 121 \e{search} the tree. Starting at the root, we can search down the
 122 tree by examining the count fields rather than comparing elements as
 123 usual; and this allows us to find the Nth item in the tree, for any
 124 N, in a single log-time search. For example, suppose the root tree node
 125 contains a child with count 54, then an actual element, then a child
 126 with count 73. Then:
 127
 128 \b If you are trying to get to a position less than 54, then you
 129 descend straight to the leftmost child.
 130
 131 \b If you are trying to get to \e{exactly} position 54, you return
 132 the element out of the root node.
 133
 134 \b If you are trying to get to position 55 or greater, you descend
 135 to the rightmost child, and subtract 55 from your desired position.
 136 (If you want element 57 of the tree, then you know there are 55
 137 elements in the tree before the right-hand subtree, so you know you
 138 want element 2 of the right-hand subtree.)
 139
 140 So now we have a means of finding the Nth item in a tree in a
 141 log-time search. This is starting to look promising.
 142
 143 The trouble is, we're still stuck with having some sort of sorting
 144 order on the tree. Now we need to deal with that.
 145
 146 \C{unsorted-trees} Unsorted trees
 147
 148 The simple answer to the sorting problem is to do away with sorting
 149 the tree at all!
 150
 151 Conventional balanced trees have a sorting order because it's used
 152 to find elements in the tree, and to know where to add an element.
 153 But we don't need a sorting order to find things any more, because
 154 we can use a count-based search to jump to the Nth position. Can we
 155 also use counts during the tree add operation, to allow us to
 156 specify \e{where} we want to add our new element?
 157
 158 We can. Tree add algorithms start by searching down the tree to find
 159 the position where the new element will be inserted. If we do this
 160 search using counts, in exactly the same way described in
 161 \k{counted-trees}, then we can add any element we like at any
 162 position in the tree. Once we do this, of course, we have to throw
 163 out the sorting order completely, and never do another order-based
 164 search or insertion again, because they won't work. But that's OK,
 165 because we didn't need them anyway.
 166
 167 Now we have exactly what we were after in the first place. We
 168 have a data structure which stores an unordered list of items, in
 169 such a way that we can insert or delete an item in log time \e{and}
 170 find the Nth element in log time.
 171
 172 \C{splitjoin} Splitting and joining trees
 173
 174 Now we can begin to get more ambitious. One issue we have not
 175 addressed yet is cut and paste.
 176
 177 So far I have discussed tree insertion in the assumption that you
 178 only ever insert one character at a time into your tree. In fact hex
 179 editors need cut and paste just as much as normal text editors do;
 180 so we must think about how to insert or remove a larger block of
 181 data at a time.
 182
 183 One obvious way is to process each byte individually. A ten-byte
 184 cut operation is ten individual deletions, and a ten-byte paste is
 185 ten individual insertions. This is fine if you only ever use cut and
 186 paste to move tiny chunks of data around a large file, but if you
 187 need to move \e{half the file} from one place to another, things get
 188 more interesting.
 189
 190 The linked-list structure discussed in \k{simple} would have helped
 191 a lot with this problem. Linked lists don't just make it easy to
 192 insert or delete one item: they make it just as easy to unlink an
 193 enormous chunk of a list once you've found both ends of the chunk,
 194 and you can link that chunk in somewhere else easily as well.
 195
 196 It turns out that you \e{can} do the same thing with balanced trees.
 197 At this point it starts to make a difference what kind of balanced
 198 tree you use: all three of AVL, red-black and B-trees support these
 199 operations, but the precise methods vary between them. I'm going to
 200 use B-trees from here on, because the algorithms are slightly simpler.
 201
 202 What we need are two basic operations. Given a counted, unsorted
 203 B-tree containing an unordered list of items, we need to be able to:
 204
 205 \b Split the tree down the middle, giving two valid B-trees as output.
 206
 207 \b Take two valid B-trees and join them together end-to-end, giving
 208 one B-tree containing all the data from tree A followed by the data
 209 from tree B.
 210
 211 This will provide all the operations we need. To unlink a large
 212 section from the middle of a tree, we split it in two places and
 213 then join the outer two parts back together; to link a large section
 214 \e{into} the middle of a tree, we split it at the insertion point,
 215 join the left half on to the left side of the inserted section, and
 216 join the right half on to the right side of the inserted section.
 217
 218 \H{joining} Joining two B-trees together
 219
 220 When you add an element to a B-tree, sometimes it ends up increasing
 221 the size of a leaf node beyond the size limit. When that happens,
 222 you deal with it by splitting the node in two, and transforming the
 223 parent node so that where it previously had a single child pointer,
 224 it now has two child pointers with an element between them. If that
 225 makes the parent node too big as well, you do the same thing again,
 226 and so on until you reach the tree root.
 227
 228 Joining two B-trees is therefore reasonably simple, \e{if} you have
 229 an additional separating element to place in between them. Position
 230 the two trees so that their leaf nodes are at the same level; now
 231 (usually) one tree will be shorter than the other. So you can add
 232 the root of the shorter tree as a sibling of the node next to it in
 233 the taller tree; their common parent gains one extra child pointer
 234 (pointing at the root of the shorter tree), separated from its
 235 neighbour by the additional separating element. If this causes the
 236 node to increase beyond the maximum size, just split it in two and
 237 propagate up to its parent, just as in the ordinary insertion
 238 process.
 239
 240 If the trees were originally the same height, just combine their
 241 root nodes into a single larger root node. You need an extra element
 242 to go in between the rightmost child pointer of the left-hand root
 243 node, and the leftmost child pointer of the right-hand root node;
 244 and again, this is where your separating element comes in. Again, if
 245 the new root is too big to be a single node, split it in two and
 246 create a new root above it.
 247
 248 So it turns out that it's very easy to join two trees together, but
 249 the algorithm requires a spare element to go in the middle. However,
 250 we normally don't have such a spare element: we just have two trees.
 251 This is easily solved, though: we simply start by removing the
 252 leftmost element of the right-hand tree using the ordinary tree
 253 deletion algorithm. Then we just do the join algorithm, as described
 254 above, using the element we just removed as our separator.
 255
 256 \H{splitting} Splitting a B-tree in two
 257
 258 To split a B-tree in two: we are given a tree, and a means of
 259 searching down the tree to find the split point. (In this
 260 application, that will be a numeric position, which we check against
 261 the node counts on the way down; in other situations, we might
 262 perfectly well want to split an ordinary \e{sorted} B-tree in half,
 263 so we might have an ordering-based search criterion. It makes no
 264 difference.)
 265
 266 We start in the simplest possible way. Start at the root node;
 267 decide which of its subtree pointers you are going to descend down;
 268 and saw the node in half at that subtree pointer. The two half-nodes
 269 thus created will \e{each} need a subtree pointer to go on the cut
 270 end, but that's OK because we're about to saw the next node down in
 271 half as well and they can have half each. So descend to the next
 272 node, decide on a split point again, saw that node in half, and put
 273 a pointer to each half in the two halves of the parent node.
 274
 275 Once we finish this searching-and-cutting pass, we will have
 276 successfully separated our tree into two parts at the required
 277 point. However, the result will almost certainly not be a pair of
 278 \e{valid} B-trees; the chances are that many of the nodes on the cut
 279 edges will be below the minimum allowed node size. In fact, if at
 280 any point our search criterion made us descend through the
 281 \e{endmost} subtree pointer in any node, some of those nodes will
 282 have no elements in them whatsoever, just a single subtree pointer!
 283
 284 So now we must make a healing pass down the cut edge of each tree,
 285 to turn it back into a valid B-tree. We can start by throwing away
 286 the root node if it has nothing but a single subtree pointer (which
 287 will happen quite often if we split near one end of the original
 288 tree, since in that case the output trees will almost certainly need
 289 to be of different heights). Keep doing that until we find a real
 290 root node.
 291
 292 One child of that node is on the cut edge, so it may be below the
 293 minimum size. If it is, we solve this using its (valid) neighbour
 294 node. If the neighbour is large, we can move some subtrees over into
 295 the undersized node to make two correctly sized nodes; if the
 296 neighbour is too small and does not have that many subtrees to
 297 spare, we can instead \e{combine} the undersized node with its
 298 neighbour. (And it turns out you can always do at least one of
 299 these: if the neighbour is too large to combine with the undersized
 300 node, then it \e{must} have enough subtrees for redistribution to
 301 give two viable nodes.)
 302
 303 The only interesting case is that combining an undersized node with
 304 its neighbour reduces the number of subtrees of their common parent
 305 by one. Therefore:
 306
 307 \b As we go down, we arrange for each node on the cut edge to be at
 308 least \e{one more than} minimum size, in case its size must drop by
 309 one when we process its child. (This still just about works in all
 310 cases.)
 311
 312 \b If the first non-trivial root node had only two children (recall
 313 that the root node in a B-tree is the only node exempt from the
 314 minimum size limit), and those two children end up having to be
 315 combined, then the root node must be thrown away again and the
 316 combined node is the new root.
 317
 318 Once we have sorted out each node, we descend to its child on the
 319 cut edge, and do the same thing again. Eventually we reach the
 320 bottom of the tree and every node is of valid size. Then we do the
 321 same thing to the cut edge of the other tree, and we're done.
 322
 323 \C{copy-on-write} Cloning trees
 324
 325 The splitting and joining algorithms look as if they make cut and
 326 paste pretty much trivial. You can split a big chunk out of your
 327 editing buffer into a separate cut buffer easily enough; and then
 328 you can \q{paste} it somewhere else by joining it back into the
 329 middle of the editing buffer at a different position.
 330
 331 However, in real life, cut and paste isn't that simple. People often
 332 want to paste the same data more than once; so you can't just link
 333 the cut buffer straight into the editing buffer, because then you
 334 don't still have it to link in again next time. You need to \e{copy}
 335 the cut buffer and link in the copy. Equally, users often want to
 336 press Copy rather than Cut, in which case you have to split the
 337 buffer tree in two places, \e{copy} the middle section, and join all
 338 three back together.
 339
 340 Copying a tree, it would seem, is inherently an O(N) operation;
 341 there's no way you can copy a tree containing megabytes of data
 342 without actually copying all that data.
 343
 344 Or is there?
 345
 346 It turns out that we \e{can} do better than this, by adding another
 347 annotation field to each tree node. This time, the annotation is a
 348 \e{reference count}: it counts the number of pointers to the node,
 349 either from other tree nodes or from the \q{root} field in a tree
 350 header structure. To begin with, of course, all reference counts are
 351 1.
 352
 353 Reference counts are normally used for garbage collection. In this
 354 case, though, I'm going to use them to implement \e{copy-on-write}.
 355 All of the tree-altering algorithms (insertion and deletion, plus
 356 the split and join algorithms described above) will now check the
 357 reference count of a node before attempting to modify it. If they
 358 find that they need to modify a node with a reference count greater
 359 than one, they will not modify it. Instead, they will make a copy of
 360 that node, and use the copy in place of the original. The copy links
 361 to all the same child nodes as the original, so the reference count
 362 in each child must be incremented; and the copied node's parent (or
 363 tree header structure) now links to the copy rather than to the
 364 original, so the reference count in the original must be
 365 decremented. Now we are looking at a node with a reference count of
 366 1, which means nobody else is using it so we can modify it safely.
 367
 368 The effect of this is that it is now a trivial - not merely log-time
 369 but \e{constant}-time - operation to \e{clone} an entire B-tree, no
 370 matter how large. We simply create a new tree header structure; we
 371 point its root field at the root node of the input tree; and we
 372 increment the reference count on that root node.
 373
 374 Once we have cloned a tree like this, we can treat the original and
 375 the clone as if they were entirely independent. If you add an
 376 element to one of them, for example, then a single string of nodes
 377 from the root down to one leaf will be duplicated and modified, but
 378 the rest of the trees will still be held in common. You can split
 379 either tree into lots of little pieces, or join it into the middle
 380 of a larger one, and never affect the data stored in what was once
 381 its clone, because every time you touch a node that the other tree
 382 is depending on, you make your own copy rather than disturbing it.
 383
 384 This allows us to support \e{really} efficient cut and paste in our
 385 hex editor. You select a 200Mb chunk and press Copy; the buffer tree
 386 is split in two places (in log time), the middle section is cloned
 387 (instantly), and the tree is joined back together. You'd hardly know
 388 anything was different - but the cut buffer now contains a clone of
 389 \e{part} of the original buffer, most of which consists of nodes
 390 that are still shared with it. And you can paste in as many copies
 391 as you like of that chunk, still in no worse than O(log N) time. The
 392 best bit is that by the time you've done this a few times and have a
 393 file that's 1600Mb longer than it started out, the hex editor isn't
 394 actually using up 1600Mb more memory, because most of it is in
 395 shared nodes! This technique naturally provides a form of
 396 compression as well as being fast.
 397
 398 \C{lazy-loading} Lazy file loading
 399
 400 In all of the above I have been tacitly assuming that the data
 401 elements stored in my tree are individual bytes. This would be
 402 hideously inefficient if I were using AVL or red-black trees, in
 403 which each node contains precisely one element: for every \e{byte}
 404 of the file being edited, there would be an overhead of two child
 405 pointers, a byte count and a reference count. On a normal 32-bit
 406 machine, that's 20 bytes per node, not counting overhead from the
 407 memory allocator. A factor of twenty is just ridiculous.
 408
 409 B-trees are a bit more flexible, since they can be made to have a
 410 large minimum degree. A B-tree with a minimum node size of (say) 512
 411 can contain up to 1023 bytes of data plus 1024 subtree pointers, and
 412 those 1023 bytes can be packed together in memory so the overhead is
 413 now more like a factor of five. Also, since no node in a B-tree ever
 414 changes its height above ground level, you can just not bother to
 415 allocate space for the 512 NULL child pointers in your leaf nodes,
 416 and since the vast majority of your nodes will \e{be} leaf nodes,
 417 the structure is now closer to being space-efficient.
 418
 419 There are other improvements one could make. For example, there's no
 420 reason why a B-tree really needs to have the \e{same} minimum node
 421 degree at every level; so you could have low-degree nodes everywhere
 422 above the leaf level, and enormous leaf nodes containing 4-8Kb of
 423 file data. You could move to B+ trees in which no actual data
 424 elements were stored anywhere except in the leaf nodes, thus saving
 425 the tiny alignment overheads in the other nodes.
 426
 427 However, there's a better direction to head in. In \k{simple} I
 428 mentioned the idea of using a linked list as the main data
 429 structure, and I said that each element of the linked list would be
 430 a smallish array of file bytes (between size K and 2K). There's no
 431 reason we couldn't do that in our B-tree-based approach: each
 432 element stored in the B-tree is no longer a single byte but a small
 433 block of bytes. It would mean that our element counts no longer
 434 allowed us to jump to the Nth byte, only to the Nth \e{block}; but
 435 we can fix that by replacing the element count with a byte count,
 436 summing the total \e{size} of all the blocks in or below a
 437 particular tree node. Now, given any byte position, we can do a
 438 single log-time search and return a data block plus an offset within
 439 that block.
 440
 441 This technique adds work to all operations. Inserting a byte, for
 442 example, is now done by finding the block it needs to go into,
 443 inserting it in that block, and potentially splitting the block into
 444 two and doing an extra tree operation. Splitting and joining buffers
 445 involves splitting and joining blocks at each end, and checking to
 446 make sure undersized blocks are not created. So what does this
 447 technique buy us, that makes it worthwhile over just storing single
 448 bytes in the B-tree?
 449
 450 The answer is: once we have a block data structure as our tree
 451 element, we can start having different \e{types} of block. In
 452 particular, we can have a type of block which is a placeholder,
 453 containing nothing but a file offset and length. A block of this
 454 type indicates \q{at this point in the tree we have N bytes from
 455 position P in the original file}. Blocks of this type are exempt
 456 from the normal maximum size for normal literal-data blocks.
 457
 458 The effect of this is that we no longer need to read the entire file
 459 into memory when we start up. Instead, we just initialise our tree
 460 trivially, so that it contains nothing but a single placeholder
 461 block, with offset zero and size equal to the initial file size.
 462
 463 Now whenever we need to read data from the tree, and it turns out
 464 the data in question is somewhere in a placeholder block, we must
 465 refer back to the original input file in order to find the data (and
 466 the placeholder block will tell us where in the file to read it
 467 from). So before we do any editing, our hex editor is suddenly a
 468 low-cost hex \e{file viewer}, which just pages back and forth and
 469 refers to the disk all the time.
 470
 471 But as soon as we start altering parts of the file, the placeholder
 472 block gets broken up into smaller blocks, and literal-data blocks
 473 begin to be created in between them. If we cut and paste a section
 474 including a placeholder block, then the tree can end up containing
 475 placeholder blocks in a strange order; it might (for example)
 476 indicate something like \q{the first 192K of the input file; then
 477 the literal bytes 5B 49 A7; then 25K of the input file starting from
 478 position 12345; then 512K of the input file starting from position
 479 204325}.
 480
 481 Now the hex editor \e{looks} as if it's doing exactly the same thing
 482 as it did to begin with. I can page around the original file; I can
 483 insert, delete, overwrite, cut, copy and paste to my heart's
 484 content, and (provided no other process modifies the original file
 485 under our feet) the data I am manipulating will remain consistent at
 486 all times with the editing operations I have performed. But there
 487 wasn't a big delay at startup when the file was loaded in, because
 488 most of it \e{wasn't} loaded in; and if I list the running processes
 489 on my system, the hex editor will not be using memory proportional
 490 to the size of the file. It will only be using memory proportional
 491 to the \e{changes} I've made to the file.
 492
 493 When I save the file, if there are any placeholder blocks remaining
 494 in the buffer tree, the hex editor must write out the new version by
 495 referring to the original. This is the \e{only} remaining operation,
 496 apart from searching, that takes time proportional to the size of
 497 the file. And there are \e{no} remaining operations which take
 498 \e{memory} proportional to the size of the file.
 499
 500 (There is one thing you need to be careful of. Literal data blocks
 501 must be permitted to fall below the minimum size K if there is no
 502 literal block next to them to merge with; in particular, this is
 503 vital if you are writing a binary file from scratch or you would
 504 never be able to give it a size between zero and K. But this raises
 505 the possibility that given a pathological sequence of editing
 506 operations, your data structure might end up being an interleaving
 507 of one-byte literal blocks and one-byte placeholder blocks, giving a
 508 huge space overhead. The simplest solution to this is to impose a
 509 minimum size of 2K on \e{placeholder} blocks, below which you read
 510 the relevant piece of file data and convert them into literal
 511 blocks; then they can be merged with adjacent blocks and the worst
 512 case is no longer terrible.)
 513
 514 We now have a data structure which does pretty much everything you
 515 could reasonably ask a hex editor to be able to do, and does it all
 516 at a reasonable memory cost and (apart from the two genuinely
 517 necessary operations of searching and saving) \e{all} in O(log N)
 518 time.
 519
 520 \C{further} Further directions
 521
 522 The data structure as I have presented it is suitable for use in a
 523 high-performance hex editor with an insert mode.
 524
 525 There are a couple more points worth noting.
 526
 527 \H{further-texted} Conventional text editing
 528
 529 This structure would need only minor modifications to be an
 530 efficient basis for a conventional text editor. In order to do this,
 531 you would need to be able to jump quickly to a particular \e{line}
 532 of the file, which means you'd need a node annotation counting
 533 newlines.
 534
 535 In fact, it's possible to do slightly better than that: we can
 536 devise a more complex node annotation which tracks the effect of an
 537 arbitrary byte sequence on the (line, column) position. Assuming
 538 that a physical tab character always advances the cursor to the next
 539 multiple of 8 spaces, there are three possibilities:
 540
 541 \b A sequence of bytes containing no newlines or tabs simply adds
 542 some number A to the column number, and does not affect the line
 543 number.
 544
 545 \b A sequence of bytes containing no newlines but at least one tab
 546 has the overall effect of adding some number A to the column, and
 547 rounding it up to the next number that is congruent to B mod 8.
 548
 549 \b A sequence of bytes containing at least one newline has the
 550 effect of adding some number A to the \e{line} number, and setting
 551 the column number to a fixed value B.
 552
 553 These three function schemas are closed under composition (i.e.
 554 combining any two of them gives another one). Storing one in each
 555 node of a buffer tree would provide the ability to search directly
 556 to \e{a particular (line, column) position} in a single log-time
 557 search. So the text editor could treat its buffer as a simple
 558 sequence of bytes (or possibly of Unicode characters). This is
 559 superior to treating the buffer as a sequence of lines, because it
 560 removes the distinction between inserting \e{within} a line and
 561 inserting data \e{between} lines. In particular, cut and paste in a
 562 line-based model is fiddly because lines must be spliced together at
 563 each end of the pasted region; but cut and paste in this model is as
 564 trivial as it was in the hex editor - you just cut a sequence of
 565 bytes, paste it somewhere else, and the line/column indexing
 566 automatically keeps up no matter what you do.
 567
 568 The only snag is that if you did this, you would probably no longer
 569 be able to do the trick with placeholder blocks and lazy file
 570 loading; a text editor tends to need to know in advance where all
 571 the newlines are in its buffer, so there would probably be no
 572 alternative to physically loading the file. But in that, at least,
 573 this data structure is no worse than any other.
 574
 575 \H{undo} Supporting undo
 576
 577 An undo function in an editor \e{conceptually} stores a sequence of
 578 previous buffer states, and allows you to return to one of them when
 579 you need to.
 580
 581 Usually, this is not actually implemented by storing copies of the
 582 entire buffer, since that would be ludicrously wasteful of space!
 583 Instead, a journal of changes is kept which allows previous buffer
 584 states to be \e{reconstructed} by reversing the precise changes
 585 made.
 586
 587 One could do that using this data structure, if one wanted to.
 588 However, there's another intriguing option. Since cloning an
 589 arbitrarily large tree is a cheap operation, you could implement
 590 undo by \e{actually} storing a sequence of clones of previous buffer
 591 states! The cost of this would be nothing like as bad as it would
 592 na\u00EF{i}vely appear.
 593
 594 It might still not be ideal, though. Every time you clone a tree and
 595 the two clones diverge, several nodes must be copied, and if each
 596 node contains several blocks of literal data then the cost of
 597 maintaining too many buffer clones might still become prohibitive.
 598 But it's an interesting possibility regardless.
 599
 600 \C{summary} Summary
 601
 602 I've presented a design for a data structure which implements
 603 practically every operation required for a hex editor in O(log N)
 604 time, apart from one or two which genuinely \e{need} to be O(N).
 605
 606 The structure is:
 607
 608 \b A B-tree, each of whose elements is either a small array of
 609 literal data bytes, or a placeholder block denoting a section of the
 610 unmodified input file.
 611
 612 \b Each B-tree node is annotated with the total byte count of all
 613 the elements in or below that node, allowing a log-time search to
 614 pinpoint any numeric byte position.
 615
 616 \b Those counts provide the only necessary means of navigating the
 617 tree, so there is no need for a sorting criterion.
 618
 619 \b Split and join algorithms make it possible to link and unlink
 620 large chunks from the middle of a buffer at a time.
 621
 622 \b Reference counts implementing copy-on-write allow cloning of
 623 chunks in constant time.
 624
 625 As a result:
 626
 627 \b Inserting or deleting bytes in the file is a log-time operation.
 628
 629 \b Finding a particular byte position is a log-time operation.
 630
 631 \b Cut and paste is always log-time, no matter how large or complex
 632 the chunk of data being moved around.
 633
 634 \b Memory usage grows proportionally to the \e{changes} made to the
 635 file, not the overall file size. (However, memory usage is also
 636 \e{bounded} by a value proportional to the file size, even if you
 637 keep editing and re-editing for ever.)
 638
 639 Searching must still be linear (there's no alternative to actually
 640 reading the data if you need to know anything about its contents),
 641 and saving the modified output file is linear (because you actually
 642 must physically write out that much data), but \e{everything} else
 643 can be done in log time.
 644
 645 I've also sketched a means of converting this into a data structure
 646 for an ordinary text editor, and suggested interesting implications
 647 in the area of undo operations.
 648
 649 \C{ref} References
 650
 651 Donald Knuth's \q{The Art of Computer Programming}
 652 (\W{http://en.wikipedia.org/w/wiki.phtml?title=Special:Booksources&isbn=0201485419}{Addison-Wesley,
 653 ISBN 0201485419}) presents at least some of the same ideas as this
 654 article. Counted and unsorted trees are mentioned in volume 3;
 655 splitting and joining are also described (although Knuth does them
 656 on AVL trees, which are significantly more fiddly to split than
 657 B-trees; you have to cut the tree into lots of little pieces, and
 658 then put them all back together by using the join algorithm
 659 repeatedly).
 660
 661 \q{Tweak}, a hex editor implementing this data structure, can be
 662 downloaded at
 663 \W{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}\cw{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}.
 664
 665 \versionid $Id$