Sunday, December 03, 2006

Epilog to Active Comments

Jumping back for a moment, I' ld like to finish, for the time being, the story on Active Comments. I had made great progress in the area of raising all Active Comments to first class comments, in that they could nest one within the other, arbitrarily deep, just as you would want from an XML document with no schema to govern the order of entries, nor to say who can be a parent of whom. The following is a somewhat modified version of the syntax I used to test the final version of the draft of nested Active Comments.

(define-syntax @
(lambda (stx)
(syntax-case stx (author attribution
                  comment created
                  description function
                  goal code header
                  modified name note
                  param predecessor
                  return section stx
                  TODO url key0)
  ((_ key0)
  ((_ key0 text ...(@ k a ...))
   #'(@ key0 text ...)(@ k a ...)))
  ((_ key0 text ...)
   #'(text ...))
  ((_ key)
  ((_ key ())
  ((_ key ((a b) ...))
   #'(key '(a ...) '(b ...)))
  ((_ key ((a b) ...) c ...)
   #'(begin (key '(a ...) '(b ...))
            (@ key0 c ...)))
  ((_ key (@ k a ...) ...)
   #'(begin (key)
            (@ k a ...) ...))
  ((_ key text ...(@ k a ...))
   #'(begin (key k )
            (@ key0 text ...)(@ k a ...)))
  ((_ key text ...) #'(key '(text ...)))
  ((_ a ...)
   (raise-syntax-error #f
    "Invalid Entity for Active Comment"
    stx #'(list 'a ...)))))

And when I say test, I mean test. I wrote a program to generate 1000 Active Comment phrases, each containing a random number of Active Comments from one (1) to fifteen (15), and each of those comments randomly had or did not have text and attributes. Even the number of attributes was randomly varied, as well as using gensym to randomize id's, value strings and text within the Active Comments. Not since my days as a graduate student at UCSC when I also worked at Borland in their Database Engine QC department had I written such an exhaustive randomized test. Each clause of syntax when activated bumped a counter in a vector of rules, that was then dumped at the end of the test to generate a histogram of rule usage over the entire test set to insure proper coverage. In short I had done my homework on this one, as I wanted to get it right. And indeed it ate up the entire test set without a single syntax error.

Unfortunately, from the vary start I had made a fatal mistake. You can see it even in the stripped down version (minus all the testing apparatus) of the syntax I show above. The flaw is that, although elegant in its recursive design, the recursion by its nature introduces expressions within the transformer environment. Under most situations, for most applications of syntactic forms this is not an issue, but for Active Comments its death. Because in the beginning, and as I mentioned in the previous post on this topic, a design imperative was that Active Comments would not break definition context, and that is exactly what an expression does. So, it took a little while to get over all that lost time, but then again, I thought, at least I've built an excellent parser for nested Active Comments even if they can't be actually active.

The solution I'm guessing is to utilize this syntax within comments, and then to regain the purpose for their existence, I'll need to build a DrScheme tool that can, in a generalized way, parse comments on command against any syntax a user might prefer. In other words, my solution would be to have two languages within a single source file. One that governed the Scheme, that is already provided, and another to reign in the structure of comments, if a user were so inclined, to make exporting the document for use as an API source file a pain free experience. One could dream up all manner of pre and post-processing programs to do the same thing, but they would be doing it long after the syntax errors in the comments had been made, and would supply no added benefit over simply exporting the source document to any XML editor of your choice and cleaning the entire document all in one tiresome session. That's why I'm committed to the tool solution.

One of the niceties of DrScheme is that it allows you to enter boxes of non-textual information within your source program. They can be as elaborate as a slide show or as simple as a comment box. However, as has been recently discussed with some frequency on the PLT discussion list, it is difficult to parse these files since the format is not well documented and until recently no one has pushed for methods to parse these files. As you can imagine I have been following the discussions with a great deal of interest, and have even found rudimentary ways to parse .scm files that contain non-textual information for its text data, which is all that I need access to. Based on these routines I've been successful at parsing the text out of all sorts of .scm files from within DrScheme, where my tool would be based. Flatt has even indicated that the command line based mzscheme, the underlying scheme upon which DrScheme is built may soon have facilities for parsing text data from arbitrary .scm files.

All of this is good news when stacked up against my blunder with nested Active Comments. So, there still may be a day when I stop commenting my source files in raw XML within block comments, and start using Active Comment syntax within the comments of my source files. If and when that day comes, I'll at least have the syntax to parse it waiting in the wings.



Jens Axel Søgaard said...

Hi Kyle,

You wrote:

"Comments would not break definition context, and that is
exactly what an expression does."

Have you considered syntax-local-context ? It allows a macro transformer to figure out, whether the current expanding macro application is in an expression context or not.

Kyle Smith said...

Hi Jens,

I had not looked at syntax-local-context, but I'm not certain how it would solve my underlying problem. Even if I knew I was in definition context (which would be the most likely scenario) I don't know how to construct a non-recursive macro for parsing the possibly nested comment.

Now if you take nesting off the table, as I've done with the original Active Comments, then syntax-local-context could be very useful in deciding whether to generate a (begin (define (lambda () '(...)))) form or a (let () '(...)) form. At some point along my design/implement pathway I made a commitment that Active Comments would give me just as much expressive power as raw XML, and that meant that they needed to be able to nest one within another, so I set aside a perfectly functional alpha version of Active Comments and decided they had to do more to be worth the switch.

I've looked into SchemeDoc, which is a noteworthy body of work. However, it seems to my taste to lock the content and the format together to tightly. At the end of the day I would like a well-formed XML document that can be used for an API today, perhaps as part of a data search engine tomorrow, and who knows what the day after that.

I'm probably in the minority on this issue, but I actually think that documentation is part of the sport of programming. I've written more than one source file, which was designed to answer a specific technical question, and had it end up being a self running tutorial on the subject for future reference.

And then there's the historical value of documentation. When I first started to get acquainted with Scheme, after devouring Dybvig's TSPL, because I was at the time very interested in XSLT 2.0 and XQuery 1.0, I set out as my first project to build a native XML database written entirely in Scheme, with both a SQL and XQuery front-end. I haven't finished the project, but in my documentation, one of my very first goals I set for the project was to get a standardized method of documentation in place as soon as possible. As it turned out, the project and Scheme took me for a ride, and by the time I got around to numbering my goals with id's so they could be cross-referenced I had already settled into using raw XML for documentation. And then as I was numbering the goals, I ran across what must be goal id 1 or 2 about documentation, and I decided to break from the database project to really do something novel. By this time I was already aware of Sedna, which has nearly identical design goals as I had set out for my database, except that they do use some C/C++ code in their work.

I appreciate the heads up on syntax-local-context, I'm sure I'll have occasion to put it to good use. Now if I only knew how to syntacticly check a recursive scheme form like nested Active Comments without using recursion I'd be set. Actually, there are parts of the syntax I presented that are essentially recursive but don't utilize an expression to make it happen. That only works, though, when you know the next bit of input will match on its own.

Which brings me to another point about pattern matching within the context of syntax-case. I would be very useful to have something like Perls .*? operator. So we might have e1 e2 ...?, so that the greedy ... operator would be told to make the least greedy match as possible to make the overall match of the pattern succeed. This comes up time and time again when matching expressions involving free text, as with Active Comments.

Thanks for the comment,