d28a4799 |
1 | \cfg{html-leaf-level}{0} |
2 | \cfg{chapter}{Section} |
3 | \cfg{text-title-align}{left} |
4 | \cfg{text-indent}{0} |
5 | \cfg{text-chapter-numeric}{yes} |
6 | \cfg{text-chapter-suffix}{. } |
7 | \cfg{text-chapter-underline}{-} |
8 | \cfg{text-section-numeric}{0}{yes} |
9 | \cfg{text-section-suffix}{0}{. } |
10 | \cfg{text-section-underline}{0}{-} |
11 | \cfg{html-chapter-numeric}{yes} |
12 | \cfg{html-chapter-suffix}{. } |
13 | \cfg{html-section-numeric}{0}{yes} |
14 | \cfg{html-section-suffix}{0}{. } |
15 | \cfg{html-section-numeric}{1}{yes} |
16 | \cfg{html-section-suffix}{1}{. } |
17 | |
18 | \title An Efficient Data Structure For A Hex Editor |
19 | |
20 | by \W{http://pobox.com/~anakin/}{Simon Tatham} |
21 | |
22 | \C{intro} Introduction |
23 | |
24 | Hex editors have been around for a long time, and at the very basic |
25 | level they are very simple to write. Since they are mostly used for |
26 | editing files such as executables, which contain a lot of |
27 | cross-references to particular byte positions in the file, a hex |
28 | editor need not have an insert mode in order to be useful. And a hex |
29 | editor without an insert mode is very easy to implement: you simply |
30 | allocate a large enough array for the input file, and use that as |
31 | your data structure. The only operation you really need to be able |
32 | to do efficiently is to jump to a particular byte position, and |
33 | that's precisely what an array makes easy. |
34 | |
35 | On the other hand, an insert mode can be useful in other |
36 | circumstances. Not \e{all} types of file you might want to edit have |
37 | the same restrictions as an executable. And as soon as you want your |
38 | hex editor to have an insert mode, the data structure question |
39 | becomes much more interesting. |
40 | |
41 | In this article I present an efficient and scalable data structure |
42 | which supports all the operations needed by a hex editor. |
43 | |
44 | \C{simple} Simple options |
45 | |
46 | One technique used to support insert mode in editors is to use an |
47 | array larger than the file size, with a gap in it. The file contents |
48 | up to the current cursor position are stored at the start of the |
49 | array; the file contents from the current cursor position to the end |
50 | are stored at the end of the array; and the gap in the middle moves |
51 | about as the cursor does. |
52 | |
53 | This makes insertion easy. When the user inserts an extra character, |
54 | you just add it to one end or other of the gap. On the other hand, |
55 | \e{moving} through the file now becomes a slow operation; it's not |
56 | noticeable when you're moving by a byte, by a line, or even by a |
57 | screenful at a time, but as soon as you try to jump to the start or |
58 | end of the file, or jump to a particular specified file offset, |
59 | suddenly the editor has to bodily shift enormous amounts of file |
60 | data from one end of the gap to the other. |
61 | |
62 | Another slightly better option is to use a linked list of small |
63 | arrays, and to let the arrays vary in size between K and 2K bytes, |
64 | for some fixed minimum block size K. Inserting a single byte in the |
65 | middle of a block doesn't cost too much; occasionally the block will |
66 | grow beyond size 2K and have to be split into two smaller ones, but |
67 | even that isn't too slow. |
68 | |
69 | Jumping to a particular position, however, is still an O(N) |
70 | operation using this structure. In practice it isn't \e{too} bad, |
71 | since the length of the linked list is at worst 1/K times the size |
72 | of the file; but once the file size becomes seriously big, this |
73 | approach does not scale well. |
74 | |
75 | The common problem in both these methods is that as soon as you make |
76 | insertion a constant-time operation, seeking to a given byte |
77 | position becomes linear-time. Whereas in the original array format, |
78 | of course, seeking was constant-time but \e{insertion} became |
79 | linear-time. |
80 | |
81 | \C{trees} Using balanced trees |
82 | |
83 | This is where trees come in. Balanced tree structures (any of AVL |
84 | trees, red-black trees and B-trees) all solve this sort of problem |
85 | for sorted lists. You can insert an element into a balanced tree in |
86 | \e{log} time, and you can search for a particular element in log |
87 | time as well. This sounds like the kind of compromise we want: if |
88 | making insertion constant-time forces seeking to be linear and vice |
89 | versa, we would prefer to arrange for \e{both} to be log-time. |
90 | |
91 | The conventional use of a balanced tree to store a sorted list, |
92 | however, is not immediately helpful to us. The only criterion we |
93 | could reasonably sort on would be byte position in the file; and as |
94 | soon as we store our data as a set of (position, data) pairs, we're |
95 | back to insertion being linear again, because we would have to alter |
96 | the position field of every tree element after the insertion point. |
97 | |
98 | Is there anything we can do to our balanced trees to make this work |
99 | better? |
100 | |
101 | \C{counted-trees} Counted trees |
102 | |
103 | Yes, there is. |
104 | |
105 | Suppose you add an additional field to every node of a balanced |
106 | tree. In that field, you store a count of the number of elements \e{in |
107 | or below} that node. |
108 | |
109 | Operations which alter the tree (insertion and deletion) now have to |
110 | make sure these counts remain accurate. This can be done without |
111 | sacrificing the log-time characteristics of the operations. For |
112 | example, when you add an element, you increment the count of the |
113 | node containing it, and then work back up the tree to the root |
114 | incrementing the counts in all the nodes you go past. Since the |
115 | height of the tree is O(log N), this only takes you O(log N) time. |
116 | |
117 | So we can add counts to a tree and still maintain it efficiently. |
118 | What have the counts bought us? |
119 | |
120 | Once we have counts in a tree, they introduce an entirely new way to |
121 | \e{search} the tree. Starting at the root, we can search down the |
122 | tree by examining the count fields rather than comparing elements as |
123 | usual; and this allows us to find the Nth item in the tree, for any |
124 | N, in a single log-time search. For example, suppose the root tree node |
125 | contains a child with count 54, then an actual element, then a child |
126 | with count 73. Then: |
127 | |
128 | \b If you are trying to get to a position less than 54, then you |
129 | descend straight to the leftmost child. |
130 | |
131 | \b If you are trying to get to \e{exactly} position 54, you return |
132 | the element out of the root node. |
133 | |
134 | \b If you are trying to get to position 55 or greater, you descend |
135 | to the rightmost child, and subtract 55 from your desired position. |
136 | (If you want element 57 of the tree, then you know there are 55 |
137 | elements in the tree before the right-hand subtree, so you know you |
138 | want element 2 of the right-hand subtree.) |
139 | |
140 | So now we have a means of finding the Nth item in a tree in a |
141 | log-time search. This is starting to look promising. |
142 | |
143 | The trouble is, we're still stuck with having some sort of sorting |
144 | order on the tree. Now we need to deal with that. |
145 | |
146 | \C{unsorted-trees} Unsorted trees |
147 | |
148 | The simple answer to the sorting problem is to do away with sorting |
149 | the tree at all! |
150 | |
151 | Conventional balanced trees have a sorting order because it's used |
152 | to find elements in the tree, and to know where to add an element. |
153 | But we don't need a sorting order to find things any more, because |
154 | we can use a count-based search to jump to the Nth position. Can we |
155 | also use counts during the tree add operation, to allow us to |
156 | specify \e{where} we want to add our new element? |
157 | |
158 | We can. Tree add algorithms start by searching down the tree to find |
159 | the position where the new element will be inserted. If we do this |
160 | search using counts, in exactly the same way described in |
161 | \k{counted-trees}, then we can add any element we like at any |
162 | position in the tree. Once we do this, of course, we have to throw |
163 | out the sorting order completely, and never do another order-based |
164 | search or insertion again, because they won't work. But that's OK, |
165 | because we didn't need them anyway. |
166 | |
167 | Now we have exactly what we were after in the first place. We |
168 | have a data structure which stores an unordered list of items, in |
169 | such a way that we can insert or delete an item in log time \e{and} |
170 | find the Nth element in log time. |
171 | |
172 | \C{splitjoin} Splitting and joining trees |
173 | |
174 | Now we can begin to get more ambitious. One issue we have not |
175 | addressed yet is cut and paste. |
176 | |
177 | So far I have discussed tree insertion in the assumption that you |
178 | only ever insert one character at a time into your tree. In fact hex |
179 | editors need cut and paste just as much as normal text editors do; |
180 | so we must think about how to insert or remove a larger block of |
181 | data at a time. |
182 | |
183 | One obvious way is to process each byte individually. A ten-byte |
184 | cut operation is ten individual deletions, and a ten-byte paste is |
185 | ten individual insertions. This is fine if you only ever use cut and |
186 | paste to move tiny chunks of data around a large file, but if you |
187 | need to move \e{half the file} from one place to another, things get |
188 | more interesting. |
189 | |
190 | The linked-list structure discussed in \k{simple} would have helped |
191 | a lot with this problem. Linked lists don't just make it easy to |
192 | insert or delete one item: they make it just as easy to unlink an |
193 | enormous chunk of a list once you've found both ends of the chunk, |
194 | and you can link that chunk in somewhere else easily as well. |
195 | |
196 | It turns out that you \e{can} do the same thing with balanced trees. |
197 | At this point it starts to make a difference what kind of balanced |
198 | tree you use: all three of AVL, red-black and B-trees support these |
199 | operations, but the precise methods vary between them. I'm going to |
200 | use B-trees from here on, because the algorithms are slightly simpler. |
201 | |
202 | What we need are two basic operations. Given a counted, unsorted |
203 | B-tree containing an unordered list of items, we need to be able to: |
204 | |
205 | \b Split the tree down the middle, giving two valid B-trees as output. |
206 | |
207 | \b Take two valid B-trees and join them together end-to-end, giving |
208 | one B-tree containing all the data from tree A followed by the data |
209 | from tree B. |
210 | |
211 | This will provide all the operations we need. To unlink a large |
212 | section from the middle of a tree, we split it in two places and |
213 | then join the outer two parts back together; to link a large section |
214 | \e{into} the middle of a tree, we split it at the insertion point, |
215 | join the left half on to the left side of the inserted section, and |
216 | join the right half on to the right side of the inserted section. |
217 | |
218 | \H{joining} Joining two B-trees together |
219 | |
220 | When you add an element to a B-tree, sometimes it ends up increasing |
221 | the size of a leaf node beyond the size limit. When that happens, |
222 | you deal with it by splitting the node in two, and transforming the |
223 | parent node so that where it previously had a single child pointer, |
224 | it now has two child pointers with an element between them. If that |
225 | makes the parent node too big as well, you do the same thing again, |
226 | and so on until you reach the tree root. |
227 | |
228 | Joining two B-trees is therefore reasonably simple, \e{if} you have |
229 | an additional separating element to place in between them. Position |
230 | the two trees so that their leaf nodes are at the same level; now |
231 | (usually) one tree will be shorter than the other. So you can add |
232 | the root of the shorter tree as a sibling of the node next to it in |
233 | the taller tree; their common parent gains one extra child pointer |
234 | (pointing at the root of the shorter tree), separated from its |
235 | neighbour by the additional separating element. If this causes the |
236 | node to increase beyond the maximum size, just split it in two and |
237 | propagate up to its parent, just as in the ordinary insertion |
238 | process. |
239 | |
240 | If the trees were originally the same height, just combine their |
241 | root nodes into a single larger root node. You need an extra element |
242 | to go in between the rightmost child pointer of the left-hand root |
243 | node, and the leftmost child pointer of the right-hand root node; |
244 | and again, this is where your separating element comes in. Again, if |
245 | the new root is too big to be a single node, split it in two and |
246 | create a new root above it. |
247 | |
248 | So it turns out that it's very easy to join two trees together, but |
249 | the algorithm requires a spare element to go in the middle. However, |
250 | we normally don't have such a spare element: we just have two trees. |
251 | This is easily solved, though: we simply start by removing the |
252 | leftmost element of the right-hand tree using the ordinary tree |
253 | deletion algorithm. Then we just do the join algorithm, as described |
254 | above, using the element we just removed as our separator. |
255 | |
256 | \H{splitting} Splitting a B-tree in two |
257 | |
258 | To split a B-tree in two: we are given a tree, and a means of |
259 | searching down the tree to find the split point. (In this |
260 | application, that will be a numeric position, which we check against |
261 | the node counts on the way down; in other situations, we might |
262 | perfectly well want to split an ordinary \e{sorted} B-tree in half, |
263 | so we might have an ordering-based search criterion. It makes no |
264 | difference.) |
265 | |
266 | We start in the simplest possible way. Start at the root node; |
267 | decide which of its subtree pointers you are going to descend down; |
268 | and saw the node in half at that subtree pointer. The two half-nodes |
269 | thus created will \e{each} need a subtree pointer to go on the cut |
270 | end, but that's OK because we're about to saw the next node down in |
271 | half as well and they can have half each. So descend to the next |
272 | node, decide on a split point again, saw that node in half, and put |
273 | a pointer to each half in the two halves of the parent node. |
274 | |
275 | Once we finish this searching-and-cutting pass, we will have |
276 | successfully separated our tree into two parts at the required |
277 | point. However, the result will almost certainly not be a pair of |
278 | \e{valid} B-trees; the chances are that many of the nodes on the cut |
279 | edges will be below the minimum allowed node size. In fact, if at |
280 | any point our search criterion made us descend through the |
281 | \e{endmost} subtree pointer in any node, some of those nodes will |
282 | have no elements in them whatsoever, just a single subtree pointer! |
283 | |
284 | So now we must make a healing pass down the cut edge of each tree, |
285 | to turn it back into a valid B-tree. We can start by throwing away |
286 | the root node if it has nothing but a single subtree pointer (which |
287 | will happen quite often if we split near one end of the original |
288 | tree, since in that case the output trees will almost certainly need |
289 | to be of different heights). Keep doing that until we find a real |
290 | root node. |
291 | |
292 | One child of that node is on the cut edge, so it may be below the |
293 | minimum size. If it is, we solve this using its (valid) neighbour |
294 | node. If the neighbour is large, we can move some subtrees over into |
295 | the undersized node to make two correctly sized nodes; if the |
296 | neighbour is too small and does not have that many subtrees to |
297 | spare, we can instead \e{combine} the undersized node with its |
298 | neighbour. (And it turns out you can always do at least one of |
299 | these: if the neighbour is too large to combine with the undersized |
300 | node, then it \e{must} have enough subtrees for redistribution to |
301 | give two viable nodes.) |
302 | |
303 | The only interesting case is that combining an undersized node with |
304 | its neighbour reduces the number of subtrees of their common parent |
305 | by one. Therefore: |
306 | |
307 | \b As we go down, we arrange for each node on the cut edge to be at |
308 | least \e{one more than} minimum size, in case its size must drop by |
309 | one when we process its child. (This still just about works in all |
310 | cases.) |
311 | |
312 | \b If the first non-trivial root node had only two children (recall |
313 | that the root node in a B-tree is the only node exempt from the |
314 | minimum size limit), and those two children end up having to be |
315 | combined, then the root node must be thrown away again and the |
316 | combined node is the new root. |
317 | |
318 | Once we have sorted out each node, we descend to its child on the |
319 | cut edge, and do the same thing again. Eventually we reach the |
320 | bottom of the tree and every node is of valid size. Then we do the |
321 | same thing to the cut edge of the other tree, and we're done. |
322 | |
323 | \C{copy-on-write} Cloning trees |
324 | |
325 | The splitting and joining algorithms look as if they make cut and |
326 | paste pretty much trivial. You can split a big chunk out of your |
327 | editing buffer into a separate cut buffer easily enough; and then |
328 | you can \q{paste} it somewhere else by joining it back into the |
329 | middle of the editing buffer at a different position. |
330 | |
331 | However, in real life, cut and paste isn't that simple. People often |
332 | want to paste the same data more than once; so you can't just link |
333 | the cut buffer straight into the editing buffer, because then you |
334 | don't still have it to link in again next time. You need to \e{copy} |
335 | the cut buffer and link in the copy. Equally, users often want to |
336 | press Copy rather than Cut, in which case you have to split the |
337 | buffer tree in two places, \e{copy} the middle section, and join all |
338 | three back together. |
339 | |
340 | Copying a tree, it would seem, is inherently an O(N) operation; |
341 | there's no way you can copy a tree containing megabytes of data |
342 | without actually copying all that data. |
343 | |
344 | Or is there? |
345 | |
346 | It turns out that we \e{can} do better than this, by adding another |
347 | annotation field to each tree node. This time, the annotation is a |
348 | \e{reference count}: it counts the number of pointers to the node, |
349 | either from other tree nodes or from the \q{root} field in a tree |
350 | header structure. To begin with, of course, all reference counts are |
351 | 1. |
352 | |
353 | Reference counts are normally used for garbage collection. In this |
354 | case, though, I'm going to use them to implement \e{copy-on-write}. |
355 | All of the tree-altering algorithms (insertion and deletion, plus |
356 | the split and join algorithms described above) will now check the |
357 | reference count of a node before attempting to modify it. If they |
358 | find that they need to modify a node with a reference count greater |
359 | than one, they will not modify it. Instead, they will make a copy of |
360 | that node, and use the copy in place of the original. The copy links |
361 | to all the same child nodes as the original, so the reference count |
362 | in each child must be incremented; and the copied node's parent (or |
363 | tree header structure) now links to the copy rather than to the |
364 | original, so the reference count in the original must be |
365 | decremented. Now we are looking at a node with a reference count of |
366 | 1, which means nobody else is using it so we can modify it safely. |
367 | |
368 | The effect of this is that it is now a trivial - not merely log-time |
369 | but \e{constant}-time - operation to \e{clone} an entire B-tree, no |
370 | matter how large. We simply create a new tree header structure; we |
371 | point its root field at the root node of the input tree; and we |
372 | increment the reference count on that root node. |
373 | |
374 | Once we have cloned a tree like this, we can treat the original and |
375 | the clone as if they were entirely independent. If you add an |
376 | element to one of them, for example, then a single string of nodes |
377 | from the root down to one leaf will be duplicated and modified, but |
378 | the rest of the trees will still be held in common. You can split |
379 | either tree into lots of little pieces, or join it into the middle |
380 | of a larger one, and never affect the data stored in what was once |
381 | its clone, because every time you touch a node that the other tree |
382 | is depending on, you make your own copy rather than disturbing it. |
383 | |
384 | This allows us to support \e{really} efficient cut and paste in our |
385 | hex editor. You select a 200Mb chunk and press Copy; the buffer tree |
386 | is split in two places (in log time), the middle section is cloned |
387 | (instantly), and the tree is joined back together. You'd hardly know |
388 | anything was different - but the cut buffer now contains a clone of |
389 | \e{part} of the original buffer, most of which consists of nodes |
390 | that are still shared with it. And you can paste in as many copies |
391 | as you like of that chunk, still in no worse than O(log N) time. The |
392 | best bit is that by the time you've done this a few times and have a |
393 | file that's 1600Mb longer than it started out, the hex editor isn't |
394 | actually using up 1600Mb more memory, because most of it is in |
395 | shared nodes! This technique naturally provides a form of |
396 | compression as well as being fast. |
397 | |
398 | \C{lazy-loading} Lazy file loading |
399 | |
400 | In all of the above I have been tacitly assuming that the data |
401 | elements stored in my tree are individual bytes. This would be |
402 | hideously inefficient if I were using AVL or red-black trees, in |
403 | which each node contains precisely one element: for every \e{byte} |
404 | of the file being edited, there would be an overhead of two child |
405 | pointers, a byte count and a reference count. On a normal 32-bit |
406 | machine, that's 20 bytes per node, not counting overhead from the |
407 | memory allocator. A factor of twenty is just ridiculous. |
408 | |
409 | B-trees are a bit more flexible, since they can be made to have a |
410 | large minimum degree. A B-tree with a minimum node size of (say) 512 |
411 | can contain up to 1023 bytes of data plus 1024 subtree pointers, and |
412 | those 1023 bytes can be packed together in memory so the overhead is |
413 | now more like a factor of five. Also, since no node in a B-tree ever |
414 | changes its height above ground level, you can just not bother to |
415 | allocate space for the 512 NULL child pointers in your leaf nodes, |
416 | and since the vast majority of your nodes will \e{be} leaf nodes, |
417 | the structure is now closer to being space-efficient. |
418 | |
419 | There are other improvements one could make. For example, there's no |
420 | reason why a B-tree really needs to have the \e{same} minimum node |
421 | degree at every level; so you could have low-degree nodes everywhere |
422 | above the leaf level, and enormous leaf nodes containing 4-8Kb of |
423 | file data. You could move to B+ trees in which no actual data |
424 | elements were stored anywhere except in the leaf nodes, thus saving |
425 | the tiny alignment overheads in the other nodes. |
426 | |
427 | However, there's a better direction to head in. In \k{simple} I |
428 | mentioned the idea of using a linked list as the main data |
429 | structure, and I said that each element of the linked list would be |
430 | a smallish array of file bytes (between size K and 2K). There's no |
431 | reason we couldn't do that in our B-tree-based approach: each |
432 | element stored in the B-tree is no longer a single byte but a small |
433 | block of bytes. It would mean that our element counts no longer |
434 | allowed us to jump to the Nth byte, only to the Nth \e{block}; but |
435 | we can fix that by replacing the element count with a byte count, |
436 | summing the total \e{size} of all the blocks in or below a |
437 | particular tree node. Now, given any byte position, we can do a |
438 | single log-time search and return a data block plus an offset within |
439 | that block. |
440 | |
441 | This technique adds work to all operations. Inserting a byte, for |
442 | example, is now done by finding the block it needs to go into, |
443 | inserting it in that block, and potentially splitting the block into |
444 | two and doing an extra tree operation. Splitting and joining buffers |
445 | involves splitting and joining blocks at each end, and checking to |
446 | make sure undersized blocks are not created. So what does this |
447 | technique buy us, that makes it worthwhile over just storing single |
448 | bytes in the B-tree? |
449 | |
450 | The answer is: once we have a block data structure as our tree |
451 | element, we can start having different \e{types} of block. In |
452 | particular, we can have a type of block which is a placeholder, |
453 | containing nothing but a file offset and length. A block of this |
454 | type indicates \q{at this point in the tree we have N bytes from |
455 | position P in the original file}. Blocks of this type are exempt |
456 | from the normal maximum size for normal literal-data blocks. |
457 | |
458 | The effect of this is that we no longer need to read the entire file |
459 | into memory when we start up. Instead, we just initialise our tree |
460 | trivially, so that it contains nothing but a single placeholder |
461 | block, with offset zero and size equal to the initial file size. |
462 | |
463 | Now whenever we need to read data from the tree, and it turns out |
464 | the data in question is somewhere in a placeholder block, we must |
465 | refer back to the original input file in order to find the data (and |
466 | the placeholder block will tell us where in the file to read it |
467 | from). So before we do any editing, our hex editor is suddenly a |
468 | low-cost hex \e{file viewer}, which just pages back and forth and |
469 | refers to the disk all the time. |
470 | |
471 | But as soon as we start altering parts of the file, the placeholder |
472 | block gets broken up into smaller blocks, and literal-data blocks |
473 | begin to be created in between them. If we cut and paste a section |
474 | including a placeholder block, then the tree can end up containing |
475 | placeholder blocks in a strange order; it might (for example) |
476 | indicate something like \q{the first 192K of the input file; then |
477 | the literal bytes 5B 49 A7; then 25K of the input file starting from |
478 | position 12345; then 512K of the input file starting from position |
479 | 204325}. |
480 | |
481 | Now the hex editor \e{looks} as if it's doing exactly the same thing |
482 | as it did to begin with. I can page around the original file; I can |
483 | insert, delete, overwrite, cut, copy and paste to my heart's |
484 | content, and (provided no other process modifies the original file |
485 | under our feet) the data I am manipulating will remain consistent at |
486 | all times with the editing operations I have performed. But there |
487 | wasn't a big delay at startup when the file was loaded in, because |
488 | most of it \e{wasn't} loaded in; and if I list the running processes |
489 | on my system, the hex editor will not be using memory proportional |
490 | to the size of the file. It will only be using memory proportional |
491 | to the \e{changes} I've made to the file. |
492 | |
493 | When I save the file, if there are any placeholder blocks remaining |
494 | in the buffer tree, the hex editor must write out the new version by |
495 | referring to the original. This is the \e{only} remaining operation, |
496 | apart from searching, that takes time proportional to the size of |
497 | the file. And there are \e{no} remaining operations which take |
498 | \e{memory} proportional to the size of the file. |
499 | |
500 | (There is one thing you need to be careful of. Literal data blocks |
501 | must be permitted to fall below the minimum size K if there is no |
502 | literal block next to them to merge with; in particular, this is |
503 | vital if you are writing a binary file from scratch or you would |
504 | never be able to give it a size between zero and K. But this raises |
505 | the possibility that given a pathological sequence of editing |
506 | operations, your data structure might end up being an interleaving |
507 | of one-byte literal blocks and one-byte placeholder blocks, giving a |
508 | huge space overhead. The simplest solution to this is to impose a |
509 | minimum size of 2K on \e{placeholder} blocks, below which you read |
510 | the relevant piece of file data and convert them into literal |
511 | blocks; then they can be merged with adjacent blocks and the worst |
512 | case is no longer terrible.) |
513 | |
514 | We now have a data structure which does pretty much everything you |
515 | could reasonably ask a hex editor to be able to do, and does it all |
516 | at a reasonable memory cost and (apart from the two genuinely |
517 | necessary operations of searching and saving) \e{all} in O(log N) |
518 | time. |
519 | |
520 | \C{further} Further directions |
521 | |
522 | The data structure as I have presented it is suitable for use in a |
523 | high-performance hex editor with an insert mode. |
524 | |
525 | There are a couple more points worth noting. |
526 | |
527 | \H{further-texted} Conventional text editing |
528 | |
529 | This structure would need only minor modifications to be an |
530 | efficient basis for a conventional text editor. In order to do this, |
531 | you would need to be able to jump quickly to a particular \e{line} |
532 | of the file, which means you'd need a node annotation counting |
533 | newlines. |
534 | |
535 | In fact, it's possible to do slightly better than that: we can |
536 | devise a more complex node annotation which tracks the effect of an |
537 | arbitrary byte sequence on the (line, column) position. Assuming |
538 | that a physical tab character always advances the cursor to the next |
539 | multiple of 8 spaces, there are three possibilities: |
540 | |
541 | \b A sequence of bytes containing no newlines or tabs simply adds |
542 | some number A to the column number, and does not affect the line |
543 | number. |
544 | |
545 | \b A sequence of bytes containing no newlines but at least one tab |
546 | has the overall effect of adding some number A to the column, and |
547 | rounding it up to the next number that is congruent to B mod 8. |
548 | |
549 | \b A sequence of bytes containing at least one newline has the |
550 | effect of adding some number A to the \e{line} number, and setting |
551 | the column number to a fixed value B. |
552 | |
553 | These three function schemas are closed under composition (i.e. |
554 | combining any two of them gives another one). Storing one in each |
555 | node of a buffer tree would provide the ability to search directly |
556 | to \e{a particular (line, column) position} in a single log-time |
557 | search. So the text editor could treat its buffer as a simple |
558 | sequence of bytes (or possibly of Unicode characters). This is |
559 | superior to treating the buffer as a sequence of lines, because it |
560 | removes the distinction between inserting \e{within} a line and |
561 | inserting data \e{between} lines. In particular, cut and paste in a |
562 | line-based model is fiddly because lines must be spliced together at |
563 | each end of the pasted region; but cut and paste in this model is as |
564 | trivial as it was in the hex editor - you just cut a sequence of |
565 | bytes, paste it somewhere else, and the line/column indexing |
566 | automatically keeps up no matter what you do. |
567 | |
568 | The only snag is that if you did this, you would probably no longer |
569 | be able to do the trick with placeholder blocks and lazy file |
570 | loading; a text editor tends to need to know in advance where all |
571 | the newlines are in its buffer, so there would probably be no |
572 | alternative to physically loading the file. But in that, at least, |
573 | this data structure is no worse than any other. |
574 | |
575 | \H{undo} Supporting undo |
576 | |
577 | An undo function in an editor \e{conceptually} stores a sequence of |
578 | previous buffer states, and allows you to return to one of them when |
579 | you need to. |
580 | |
581 | Usually, this is not actually implemented by storing copies of the |
582 | entire buffer, since that would be ludicrously wasteful of space! |
583 | Instead, a journal of changes is kept which allows previous buffer |
584 | states to be \e{reconstructed} by reversing the precise changes |
585 | made. |
586 | |
587 | One could do that using this data structure, if one wanted to. |
588 | However, there's another intriguing option. Since cloning an |
589 | arbitrarily large tree is a cheap operation, you could implement |
590 | undo by \e{actually} storing a sequence of clones of previous buffer |
591 | states! The cost of this would be nothing like as bad as it would |
592 | na\u00EF{i}vely appear. |
593 | |
594 | It might still not be ideal, though. Every time you clone a tree and |
595 | the two clones diverge, several nodes must be copied, and if each |
596 | node contains several blocks of literal data then the cost of |
597 | maintaining too many buffer clones might still become prohibitive. |
598 | But it's an interesting possibility regardless. |
599 | |
600 | \C{summary} Summary |
601 | |
602 | I've presented a design for a data structure which implements |
603 | practically every operation required for a hex editor in O(log N) |
604 | time, apart from one or two which genuinely \e{need} to be O(N). |
605 | |
606 | The structure is: |
607 | |
608 | \b A B-tree, each of whose elements is either a small array of |
609 | literal data bytes, or a placeholder block denoting a section of the |
610 | unmodified input file. |
611 | |
612 | \b Each B-tree node is annotated with the total byte count of all |
613 | the elements in or below that node, allowing a log-time search to |
614 | pinpoint any numeric byte position. |
615 | |
616 | \b Those counts provide the only necessary means of navigating the |
617 | tree, so there is no need for a sorting criterion. |
618 | |
619 | \b Split and join algorithms make it possible to link and unlink |
620 | large chunks from the middle of a buffer at a time. |
621 | |
622 | \b Reference counts implementing copy-on-write allow cloning of |
623 | chunks in constant time. |
624 | |
625 | As a result: |
626 | |
627 | \b Inserting or deleting bytes in the file is a log-time operation. |
628 | |
629 | \b Finding a particular byte position is a log-time operation. |
630 | |
631 | \b Cut and paste is always log-time, no matter how large or complex |
632 | the chunk of data being moved around. |
633 | |
634 | \b Memory usage grows proportionally to the \e{changes} made to the |
635 | file, not the overall file size. (However, memory usage is also |
636 | \e{bounded} by a value proportional to the file size, even if you |
637 | keep editing and re-editing for ever.) |
638 | |
639 | Searching must still be linear (there's no alternative to actually |
640 | reading the data if you need to know anything about its contents), |
641 | and saving the modified output file is linear (because you actually |
642 | must physically write out that much data), but \e{everything} else |
643 | can be done in log time. |
644 | |
645 | I've also sketched a means of converting this into a data structure |
646 | for an ordinary text editor, and suggested interesting implications |
647 | in the area of undo operations. |
648 | |
649 | \C{ref} References |
650 | |
651 | Donald Knuth's \q{The Art of Computer Programming} |
652 | (\W{http://en.wikipedia.org/w/wiki.phtml?title=Special:Booksources&isbn=0201485419}{Addison-Wesley, |
653 | ISBN 0201485419}) presents at least some of the same ideas as this |
654 | article. Counted and unsorted trees are mentioned in volume 3; |
655 | splitting and joining are also described (although Knuth does them |
656 | on AVL trees, which are significantly more fiddly to split than |
657 | B-trees; you have to cut the tree into lots of little pieces, and |
658 | then put them all back together by using the join algorithm |
659 | repeatedly). |
660 | |
661 | \q{Tweak}, a hex editor implementing this data structure, can be |
662 | downloaded at |
663 | \W{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}\cw{http://www.chiark.greenend.org.uk/~sgtatham/tweak/}. |
664 | |
665 | \versionid $Id$ |