@@@ wip
[mLib] / struct / hash.3
1 .\" -*-nroff-*-
2 .de VS
3 .sp 1
4 .RS
5 .nf
6 .ft B
7 ..
8 .de VE
9 .ft R
10 .fi
11 .RE
12 .sp 1
13 ..
14 .de hP
15 .IP
16 .ft B
17 \h'-\w'\\$1\ 'u'\\$1\ \c
18 .ft P
19 ..
20 .ie t .ds o \(bu
21 .el .ds o o
22 .TH hash 3 "2 August 1999" "Straylight/Edgeware" "mLib utilities library"
23 .SH "NAME"
24 hash \- low-level hashtable implementation
25 .\" @hash_create
26 .\" @hash_destroy
27 .\" @hash_bin
28 .\" @hash_extend
29 .\" @hash_remove
30 .\" @hash_mkiter
31 .\" @hash_next
32 .\"
33 .\" @HASH_BIN
34 .\" @HASH_MKITER
35 .\" @HASH_NEXT
36 .\"
37 .SH "SYNOPSIS"
38 .nf
39 .B "#include <mLib/hash.h>"
40
41 .B "typedef struct {"
42 .B "\h'4n'uint32 mask;"
43 .B "\h'4n'hash_base **v;"
44 .B "\h'4n'arena *a;"
45 .B "} hash_table;"
46
47 .B "typedef struct {"
48 .B "\h'4n'hash_base *next;"
49 .B "\h'4n'uint32 hash;"
50 .B "} hash_base;"
51
52 .B "typedef struct { ...\& } hash_iter;"
53
54 .BI "void hash_create(hash_table *" t ", size_t " n );
55 .BI "void hash_destroy(hash_table *" t );
56 .BI "hash_base **hash_bin(hash_table *" t ", uint32 " hash );
57 .BI "int hash_extend(hash_table *" t );
58 .BI "void hash_remove(hash_table *" t ", hash_base *" b );
59 .BI "void hash_mkiter(hash_iter *" i ", hash_table *" t );
60 .BI "hash_base *hash_next(hash_iter *" i );
61
62 .BI "hash_base **HASH_BIN(hash_table *" t ", uint32 " hash );
63 .BI "void HASH_MKITER(hash_iter *" i ", hash_table *" t );
64 .BI "void HASH_NEXT(hash_iter *" i ", " b );
65 .fi
66 .SH "OVERVIEW"
67 The
68 .B hash
69 functions provide the basis for an extensible hashtable implementation.
70 The implementation is not complete. Many decisions have been left to
71 the user, including:
72 .hP \*o
73 how keys should be represented, hashed and compared;
74 .hP \*o
75 how objects contained within the table should be allocated; and
76 .hP \*o
77 when the hashtable should be extended.
78 .PP
79 A complete hashtable implementation will need to take the above
80 decisions. If you just want a prepackaged solution, see
81 .BR sym (3)
82 which provides one.
83 .SH "IMPLEMENTATION DETAILS"
84 Each item in the hashtable is assigned a 32-bit integer
85 .IR hash :
86 a number computed somehow from the item's data such that two items which
87 are considered equal will yield the same hash. Ideally, different items
88 will yield different hashes. It is important for this implementation
89 that all bits of the hash are similarly good.
90 .PP
91 In order to look up an item, the high bits of the hash are masked off
92 and the remainder used as an index into a vector of
93 .IR "bin lists" .
94 Each bin contains a list of items with identical low-order bits of their
95 hashes.
96 .PP
97 A table expansion involves doubling the size of the bin vector. Each
98 bin list is then split into two, items being placed into a new bin
99 depending on the setting of the next lowest hash bit. By expanding the
100 hashtable as needed, lookups remain constant-time.
101 .PP
102 A low-level hashtable is represented by a
103 .B hash_table
104 structure. It contains two members:
105 .TP
106 .B "uint32 mask"
107 The current bitmask to be applied to hashes. This is one less than the
108 current number of bins in the hashtable, and is applied to hash values
109 in order to decide which bin an item should be in.
110 .TP
111 .B "hash_base **v"
112 The bin vector. It is an array of pointers to hashtable items.
113 .PP
114 A hashtable item consists of a
115 .B hash_base
116 structure. A full hashtable implementation will need to extend this
117 structure by adding keying information and other data; the
118 .B hash_base
119 only contains the bare minimum of information needed to maintain the
120 hashtable at a low level. It contains the following members:
121 .TP
122 .B "hash_base *next"
123 Pointer to the next item in the bin list. The final item has a null
124 .B next
125 pointer. The entry in the bin vector is null if the bin list is empty.
126 It is up to the high-level implementation to insert items into the list.
127 .TP
128 .B "uint32 hash"
129 The hash for this item. This must be the full 32-bit hash for the
130 current item. It is used during hashtable expansion to determine which
131 bin an item should be moved to.
132 .SH "FUNCTIONALITY PROVIDED"
133 This section describes the functions and macros provided for building
134 hashtables. Code examples are given throughout. They assume the
135 following definitions:
136 .VS
137 /* --- A table of items --- */
138
139 typedef struct item_table {
140 hash_table t;
141 size_t load;
142 };
143
144 /* --- An item --- */
145
146 typedef struct item {
147 hash_base b;
148 const char *k;
149 /* ... */
150 } item;
151 .VE
152 The implementation presented here is simple but relatively bad. The
153 source file
154 .B sym.c
155 presents a more realistic example, but is rather more complex.
156 .SS "Initialization and finalization"
157 An empty hashtable is initialized by calling
158 .B hash_create
159 with the address of a
160 .B hash_table
161 structure to be filled in and the initial number of hash bins to create.
162 .PP
163 For example, an item table might be initialized like this:
164 .VS
165 void item_createtab(item_table *t)
166 {
167 hash_create(&t->t, ITEM_INITSZ);
168 t->load = ITEM_INITLOAD;
169 }
170 .VE
171 A hashtable can be destroyed by calling
172 .B hash_destroy
173 with the address of the
174 .B hash_table
175 structure. This does not attempt to deallocate the individual items;
176 that must be done beforehand.
177 .PP
178 The usual way to deallocate the individual hashtable items is using the
179 iteration constructs described below.
180 .VS
181 void item_destroytab(item_table *t)
182 {
183 hash_iter i;
184 hash_base *b;
185 for (hash_mkiter(&i, &t->t); (b = hash_next(&i)) != 0; ) {
186 item *ii = (item *)b;
187 free(ii->k);
188 /* ... */
189 DESTROY(ii);
190 }
191 hash_destroy(&t->t);
192 }
193 .VE
194 .sp -1
195 .SS "Searching, adding and removing"
196 Items must be searched for and added by hand.
197 .PP
198 The macro
199 .B HASH_BIN
200 returns the address of the bin list haed for a particular hashtable and
201 hash value. The function
202 .B hash_bin
203 works the same way and provides the same result, but since the macro is
204 very simple its use is preferred. However, it will evaluate its
205 arguments multiple times.
206 .PP
207 Once the bin list has been found, it's fairly easy to search for an
208 exact match. A simple search might look something like this:
209 .VS
210 item *lookup(item_table *t, const char *k)
211 {
212 uint32 h = hash(k); /* Hash @k@ somehow */
213 hash_base **bin = HASH_BIN(&t->t, h);
214 hash_base *b;
215 for (b = *bin; b; b = b->next) {
216 item *i = (item *)b;
217 if (h == i->b.hash && strcmp(k, i->k) == 0)
218 return (i);
219 }
220 return (0);
221 }
222 .VE
223 Insertion is also relatively trivial given the bin list head. Insertion
224 may make the hashtable too large, so it might be necessary to extend
225 it. Extension is performed by
226 .BR hash_extend ,
227 which is passed only the address of the hashtable. It returns nonzero
228 if extension was successful.
229 .VS
230 item *add(item_table *t, const char *k, /* ... */)
231 {
232 item *i;
233 uint32 h;
234 hash_base **bin;
235
236 /* --- See if the item is already there --- */
237
238 if ((i = = lookup(t, k)) != 0)
239 return (i);
240
241 /* --- Make a new hashtable item --- */
242
243 i = CREATE(item);
244 i->k = xstrdup(k);
245 /* ... */
246
247 /* --- Link it into the bin list --- */
248
249 h = i->b.hash = hash(k);
250 bin = HASH_BIN(&t->t, h);
251 i->b.next = *bin;
252 *bin = &i->b.next;
253
254 /* --- Maybe extend the hashtable --- */
255
256 if (t->load)
257 t->load--;
258 else if (hash_extend(&t->t))
259 t->load = recalc_load(t);
260
261 /* --- Done --- */
262
263 return (i);
264 }
265 .VE
266 The
267 .B sym
268 implementation is rather more sophisticated in its approach but the idea
269 is the same. In particular,
270 .B sym
271 provides a single interface for lookup and insertion which prevents the
272 unnecessary rehashing performed by the above code.
273 .PP
274 Removal of items is more straightforward. The function
275 .B hash_remove
276 will unlink a given item from its bin list, after which point it is safe
277 to remove.
278 .SS "Iteration"
279 Iteration allows code to be performed on all the items in a hashtable.
280 This is done using an
281 .I iterator
282 object, represented by a
283 .B hash_iter
284 structure. An iterator is initialized by calling
285 .BR hash_mkiter .
286 Each call to
287 .B hash_next
288 yields a different item from the hashtable until there are none left, a
289 condition signified by a null return value.
290 .PP
291 The macros
292 .B HASH_MKITER
293 and
294 .B HASH_NEXT
295 do the same jobs as the above functions. However,
296 .B HASH_NEXT
297 has a slightly bizarre argument passing convention: its second argument
298 is an
299 .I lvalue
300 which is updated to contain the address of the next item.
301 .PP
302 The finalization code above contained an example of iteration.
303 .SH "SEE ALSO"
304 .BR sym (3),
305 .BR mLib (3).
306 .SH "AUTHOR"
307 Mark Wooding, <mdw@distorted.org.uk>