Commit | Line | Data |
---|---|---|
1f7d590d MW |
1 | %%% -*-latex-*- |
2 | %%% | |
3 | %%% Description of the parsing machinery | |
4 | %%% | |
5 | %%% (c) 2015 Straylight/Edgeware | |
6 | %%% | |
7 | ||
8 | %%%----- Licensing notice --------------------------------------------------- | |
9 | %%% | |
10 | %%% This file is part of the Sensble Object Design, an object system for C. | |
11 | %%% | |
12 | %%% SOD is free software; you can redistribute it and/or modify | |
13 | %%% it under the terms of the GNU General Public License as published by | |
14 | %%% the Free Software Foundation; either version 2 of the License, or | |
15 | %%% (at your option) any later version. | |
16 | %%% | |
17 | %%% SOD is distributed in the hope that it will be useful, | |
18 | %%% but WITHOUT ANY WARRANTY; without even the implied warranty of | |
19 | %%% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
20 | %%% GNU General Public License for more details. | |
21 | %%% | |
22 | %%% You should have received a copy of the GNU General Public License | |
23 | %%% along with SOD; if not, write to the Free Software Foundation, | |
24 | %%% Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. | |
25 | ||
26 | \chapter{Parsing} \label{ch:parsing} | |
27 | ||
28 | %%%-------------------------------------------------------------------------- | |
29 | \section{The parser protocol} \label{sec:parsing.proto} | |
30 | ||
31 | For the purpose of Sod's parsing library, \emph{parsing} is the process of | |
32 | reading a sequence of input items, in order, and computing an output value. | |
33 | ||
34 | A \emph{parser} is an expression which consumes zero or more input items and | |
35 | returns three values: a \emph{result}, a \emph{success flag}, and a | |
36 | \emph{consumed flag}. The two flags are (generalized) booleans. If the | |
37 | success flag is non-nil, then the parser is said to have \emph{succeeded}, | |
38 | and the result is the parser's output. If the success flag is nil then the | |
39 | parser is said to have \emph{failed}, and the result is a list of | |
40 | \emph{indicators}. Finally, the consumed flag is non-nil if the parser | |
41 | consumed any input items. | |
42 | ||
43 | %%%-------------------------------------------------------------------------- | |
44 | \section{File locations} | |
45 | ||
46 | %%%-------------------------------------------------------------------------- | |
47 | \section{Scanners} \label{sec:parsing.scanner} | |
48 | ||
49 | A \emph{scanner} is an object which keeps track of a parser's progress as it | |
50 | works through its input. There's no common base class for scanners: a | |
51 | scanner is simply any object which implements the scanner protocol described | |
52 | here. | |
53 | ||
54 | A scanner maintains a sequence of items to read. It can step forwards | |
55 | through the items, one at a time, until it reaches the end (if, indeed, the | |
56 | sequence is finite, which it needn't be). Until that point, there is a | |
57 | current item, though there's no protocol for accessing it at this level | |
58 | because the nature of the items is left unspecified. | |
59 | ||
60 | Some scanners support an additional \emph{place-capture} protocol which | |
61 | allows rewinding the scanner to an earlier point in the input so that it can | |
62 | be scanned again. | |
63 | ||
64 | \subsection{Basic scanner protocol} \label{sec:parsing.scanner.basic} | |
65 | ||
66 | The basic protocol supports stepping the scanner forward through its input | |
67 | sequence, and detecting the end of the sequence. | |
68 | ||
69 | \begin{describe}{gf}{scanner-step @<scanner>} | |
70 | Advance the @<scanner> to the next item, which becomes current. | |
71 | ||
72 | It is an error to step the scanner if the scanner is at end-of-file. | |
73 | \end{describe} | |
74 | ||
75 | \begin{describe}{gf}{scanner-at-eof-p @<scanner> @> @<generalized-boolean>} | |
76 | Return non-nil if the scanner is at end-of-file, i.e., there are no more | |
77 | items to read. | |
78 | ||
79 | If nil is returned, there is a current item, and it is safe to step the | |
80 | scanner again; otherwise, it is an error to query the current item or to | |
81 | step the scanner. | |
82 | \end{describe} | |
83 | ||
84 | \subsection{Place-capture scanner protocol} \label{sec:parsing.scanner.place} | |
85 | ||
86 | The place-capture protocol allows rewinding to an earlier point in the | |
87 | sequence. Not all scanners support the place-capture protocol. | |
88 | ||
89 | To rewind a scanner to a particular point, that point must be \emph{captured} | |
90 | as a \emph{place} when it's current -- so you must know in advance that this | |
91 | is an interesting place that's worth capturing. The type of place returned | |
92 | depends on the type of scanner. Given a captured place, the scanner can be | |
93 | rewound to the position held in it. | |
94 | ||
95 | Depending on how the scanner works, holding onto a captured place might | |
96 | consume a lot of memory or case poor performance. For example, if the | |
97 | scanner is reading from an input stream, having a captured place means that | |
98 | data from that point on must be buffered in case the program needs to rewind | |
99 | the scanner and read that data again. Therefore it's possible to | |
100 | \emph{release} a place when it turns out not to be needed any more. | |
101 | ||
102 | \begin{describe}{gf}{scanner-capture-place @<scanner> @> @<place>} | |
103 | Capture the @<scanner>'s current position as a place, and return the place. | |
104 | \end{describe} | |
105 | ||
106 | \begin{describe}{gf}{scanner-restore-place @<scanner> @<place>} | |
107 | Rewind the @<scanner> to the state it was in when @<place> was captured. | |
108 | In particular, the item that was current when the @<place> was captured | |
109 | becomes current again. | |
110 | ||
111 | It is an error to restore a @<place> that has been released, or if the | |
112 | @<place> wasn't captured from the @<scanner>. | |
113 | \end{describe} | |
114 | ||
115 | \begin{describe}{gf}{scanner-release-place @<scanner> @<place>} | |
116 | Release the @<place>, to avoid having to maintaining the ability to restore | |
117 | it after it's not needed any more.. | |
118 | ||
119 | It is an error if the @<place> wasn't captured from the @<scanner>. | |
120 | \end{describe} | |
121 | ||
122 | \begin{describe}{mac} | |
123 | {with-scanner-place (@<place> @<scanner>) @<body-form>^* @> @<value>^*} | |
124 | Capture the @<scanner>'s current position as a place, evaluate the | |
125 | @<body-form>s as an implicit progn with the variable @<place> bound to the captured | |
126 | place. When control leaves the @<body-form>s, the place is released. The return | |
127 | values are the values of the final @<body-form>. | |
128 | \end{describe} | |
129 | ||
130 | \subsection{Scanner file-location protocol} \label{sec:parsing.scanner.floc} | |
131 | ||
132 | Some scanners participate in the file-location protocol (\xref{sec:floc}). | |
133 | They implement a method on @|file-location| which collects the necessary | |
134 | information using scanner-specific functions described here. | |
135 | ||
136 | \begin{describe}{fun}{scanner-file-location @<scanner> @> @<file-location>} | |
137 | Return a @|file-location| object describing the current position of the | |
138 | @<scanner>. | |
139 | ||
140 | This calls the @|scanner-filename|, @|scanner-line| and @|scanner-column| | |
141 | generic functions on the scanner, and uses these to fill in an appropriate | |
142 | @|file-location|. | |
143 | ||
144 | Since there are default methods on these generic functions, it is not an | |
145 | error to call @|scanner-file-location| on any kind of value, but it might | |
146 | not be very useful. This function exists to do the work of appropriately | |
147 | specialized methods on @|file-location|. | |
148 | \end{describe} | |
149 | ||
150 | \begin{describe}{gf}{scanner-filename @<scanner> @> @<string>} | |
151 | Return the name of the file the scanner is currently processing, as a | |
152 | string, or nil if the filename is not known. | |
153 | \end{describe} | |
154 | ||
155 | \begin{describe}{meth}{scanner-filename (@<scanner> t) @> @<string>} | |
156 | Returns nil. | |
157 | \end{describe} | |
158 | ||
159 | \begin{describe}{gf}{scanner-line @<scanner> @> @<integer>} | |
160 | Return the line number of the @<scanner>'s current position, as an integer, | |
161 | or nil if the line number is not known. | |
162 | \end{describe} | |
163 | ||
164 | \begin{describe}{meth}{scanner-line (@<scanner> t) @> @<integer>} | |
165 | Returns nil. | |
166 | \end{describe} | |
167 | ||
168 | \begin{describe}{gf}{scanner-column @<scanner> @> @<integer>} | |
169 | Return the column number of the @<scanner>'s current position, as an | |
170 | integer, or nil if the column number is not known. | |
171 | \end{describe} | |
172 | ||
173 | \begin{describe}{meth}{scanner-column (@<scanner> t) @> @<integer>} | |
174 | Returns nil. | |
175 | \end{describe} | |
176 | ||
177 | \subsection{Character scanners} \label{sec:parsing.scanner.char} | |
178 | ||
179 | Character scanners are scanners which read sequences of characters. | |
180 | ||
181 | \begin{describe}{cls}{character-scanner () \&key} | |
182 | Base class for character scanners. This provides some very basic | |
183 | functionality. | |
184 | ||
185 | Not all character scanners are subclasses of @|character-scanner|. | |
186 | \end{describe} | |
187 | ||
188 | \begin{describe}{gf}{scanner-current-char @<scanner> @> @<character>} | |
189 | Returns the current character. | |
190 | \end{describe} | |
191 | ||
192 | \begin{describe}{gf}{scanner-unread @<scanner> @<character>} | |
193 | Rewind the @<scanner> by one step. The @<chararacter> must be the previous | |
194 | current character, and becomes the current character again. It is an error | |
195 | if: the @<scanner> has reached end-of-file; the @<scanner> is never been | |
196 | stepped; or @<character> was not the previous current character. | |
197 | \end{describe} | |
198 | ||
199 | \begin{describe}{gf} | |
200 | {scanner-interval @<scanner> @<place-a> \&optional @<place-b> | |
201 | @> @<string>} | |
202 | Return the characters in the @<scanner>'s input from @<place-a> up to (but | |
203 | not including) @<place-b>. | |
204 | ||
205 | The characters are returned as a string. If @<place-b> is omitted, return | |
206 | the characters up to (but not including) the current position. It is an | |
207 | error if @<place-b> precedes @<place-a> or they are from different | |
208 | scanners. | |
209 | ||
210 | This function is a character-scanner-specific extension to the | |
211 | place-capture protocol; not all character scanners implement the | |
212 | place-capture protocol, and some that do may not implement this function. | |
213 | \end{describe} | |
214 | ||
215 | \subsubsection{Stream access to character scanners} | |
216 | Sometimes it can be useful to apply the standard Lisp character input | |
217 | operations to the sequence of characters held by a character scanner. | |
218 | ||
219 | \begin{describe}{gf}{make-scanner-stream @<scanner> @> @<stream>} | |
220 | Returns a fresh input @|stream| object which fetches input characters from | |
221 | the character scanner object @<scanner>. Reading characters from the | |
222 | stream steps the scanner. The stream will reach end-of-file when the | |
223 | scanner reports end-of-file. If the scanner implements the file-location | |
224 | protocol then reading from the stream will change the file location in an | |
225 | appropriate manner. | |
226 | ||
227 | This is mostly useful for applying standard Lisp stream functions, most | |
228 | particularly the @|read| function, in the middle of a parsing operation. | |
229 | \end{describe} | |
230 | ||
231 | \begin{describe}{cls}{character-scanner-stream (stream) \&key :scanner} | |
232 | A Common Lisp input @|stream| object which works using the character | |
233 | scanner protocol. Any @<scanner> which implements the base scanner and | |
234 | character scanner protocols is suitable. See @|make-scanner-stream|. | |
235 | \end{describe} | |
236 | ||
237 | \subsection{String scanners} \label{sec:parsing.scanner.string} | |
238 | ||
239 | A \emph{string scanner} is a simple kind of character scanner which reads | |
240 | input from a string object. String scanners implement the character scanner | |
241 | and place-capture protocols. | |
242 | ||
243 | \begin{describe}{cls}{string-scanner} | |
244 | The class of string scanners. The @|string-scanner| class is not a | |
245 | subclass of @|character-scanner|. | |
246 | \end{describe} | |
247 | ||
248 | \begin{describe}{fun}{string-scanner-p @<value> @> @<generalized-boolean>} | |
249 | Return non-nil if @<value> is a @|string-scanner| object; otherwise return | |
250 | nil. | |
251 | \end{describe} | |
252 | ||
253 | \begin{describe}{fun} | |
254 | {make-string-scanner @<string> \&key :start :end @> @<string-scanner>} | |
255 | Construct and return a fresh @|string-scanner| object. The new scanner | |
256 | will read characters from @<string>, starting at index @<start> (which | |
257 | defaults to zero), and continuing until it reaches index @<end> (defaults | |
258 | to the end of the @<string>). | |
259 | \end{describe} | |
260 | ||
261 | \subsection{Character buffer scanners} \label{sec:parsing.scanner.charbuf} | |
262 | ||
263 | A \emph{character buffer scanner}, or \emph{charbuf scanner} for short, is an | |
264 | efficient scanner for reading characters from an input stream. Charbuf | |
265 | scanners implements the basic scanner, character buffer, place-capture, and | |
266 | file-location protocols. | |
267 | ||
268 | \begin{describe}{cls} | |
269 | {charbuf-scanner (character-scanner) | |
270 | \&key :stream :filename :line :column} | |
271 | The class of charbuf scanners. The scanner will read characters from | |
272 | @<stream>. Charbuf scanners implement the file-location protocol: the | |
273 | initial location is set from the given @<filename>, @<line> and @<column>; | |
274 | the scanner will update the location as it reads its input. | |
275 | \end{describe} | |
276 | ||
277 | \begin{describe}{cls}{charbuf-scanner-place} | |
278 | The class of place objects captured by a charbuf scanner. | |
279 | \end{describe} | |
280 | ||
281 | \begin{describe}{fun} | |
282 | {charbuf-scanner-place-p @<value> @> @<generalized-boolean>} | |
283 | Type predicate for charbuf scanner places: returns non-nil if @<value> is a | |
284 | place captured by a charbuf scanner, and nil otherwise. | |
285 | \end{describe} | |
286 | ||
287 | \begin{describe}{gf} | |
288 | {charbuf-scanner-map @<scanner> @<func> \&optional @<fail> | |
289 | \nlret @<result> @<successp> @<consumedp>} | |
290 | Read characters from the @<scanner>'s buffers. | |
291 | ||
292 | This is intended to be an efficient and versatile interface for reading | |
293 | characters from a scanner in bulk. The function @<func> is invoked | |
294 | repeatedly, as if by | |
295 | \begin{prog} | |
296 | (multiple-value-bind (@<donep> @<used>) \\ \ind\ind | |
297 | (funcall @<func> @<buf> @<start> @<end>) \- \\ | |
298 | \textrm\ldots) | |
299 | \end{prog} | |
300 | The argument @<buf> is a simple string; @<start> and @<end> are two | |
301 | nonnegative fixnums, indicating that the subsequence of @<buf> between | |
302 | @<start> (inclusive) and @<end> (exclusive) should be processed. If | |
303 | @<func>'s return value @<donep> is nil then @<used> is ignored: the | |
304 | function has consumed the entire buffer and wishes to read more. If | |
305 | @<donep> is non-nil, then it must be a fixnum such that $@<start> \le | |
306 | @<used> \le @<end>$: the function has consumed the buffer as far as @<used> | |
307 | (exclusive) and has completed successfully. | |
308 | ||
309 | If end-of-file is encountered before @<func> completes successfully then it | |
310 | fails: the @<fail> function is called with no arguments, and is expected to | |
311 | return two values. If omitted, @<fail> defaults to | |
312 | \begin{prog} | |
313 | (lambda () \\ \ind | |
314 | (values nil nil))% | |
315 | \end{prog} | |
316 | ||
317 | The @|charbuf-scanner-map| function returns three values. The first value | |
318 | is the non-nil @<donep> value returned by @<func> if @|charbuf-scanner-map| | |
319 | succeeded, or the first value returned by @<fail>; the second value is @|t| | |
320 | on success, or the second value returned by @<fail>; the third value is | |
321 | non-nil if @<func> consumed any input, i.e., it returned with @<donep> nil | |
322 | at least once, or with $@<used> > @<start>$. | |
323 | \end{describe} | |
324 | ||
325 | \subsection{Token scanners} \label{sec:parsing.scanner.token} | |
326 | ||
327 | \begin{describe}{cls} | |
328 | {token-scanner () \&key :filename (:line 1) (:column 0)} | |
329 | \end{describe} | |
330 | ||
331 | \begin{describe}{gf}{token-type @<scanner> @> @<type>} | |
332 | \end{describe} | |
333 | ||
334 | \begin{describe}{gf}{token-value @<scanner> @> @<value>} | |
335 | \end{describe} | |
336 | ||
337 | \begin{describe}{gf}{scanner-token @<scanner> @> @<type> @<value>} | |
338 | \end{describe} | |
339 | ||
340 | \begin{describe}{ty}{token-scanner-place} | |
341 | \end{describe} | |
342 | ||
343 | \begin{describe}{fun} | |
344 | {token-scanner-place-p @<value> @> @<generalized-boolean>} | |
345 | \end{describe} | |
346 | ||
347 | \subsection{List scanners} | |
348 | ||
349 | \begin{describe}{ty}{list-scanner} | |
350 | \end{describe} | |
351 | ||
352 | \begin{describe}{fun}{list-scanner-p @<value> @> @<generalized-boolean>} | |
353 | \end{describe} | |
354 | ||
355 | \begin{describe}{fun}{make-list-scanner @<list> @> @<list-scanner>} | |
356 | \end{describe} | |
357 | ||
358 | %%%-------------------------------------------------------------------------- | |
359 | \section{Parsing macros} | |
360 | ||
361 | %%%-------------------------------------------------------------------------- | |
362 | \section{Lexical analyser} | |
363 | ||
364 | %%%----- That's all, folks -------------------------------------------------- | |
365 | ||
366 | %%% Local variables: | |
367 | %%% mode: LaTeX | |
368 | %%% TeX-master: "sod.tex" | |
369 | %%% TeX-PDF-mode: t | |
370 | %%% End: |