SExpr: Indentation

An Eccentric Anomaly: Ed Davies's Blog

My little Lisp-like SExpr language has significant indentation, that is indentation is used to delimit syntactic constructs (lists in this case). A bit about why and how.

In general I'm quite in favour of the use of significant indentation in programming languages. The primary reason is that indentation is almost always used in practice when writing code and is a very powerful cue used by programmers when reading the code. If the compiler ignores indentation then there's a great risk of it and humans putting a different interpretation on some code which is not likely to result in good things happening.

For example, indentation could have been a contributing factor to Apple's goto Fail; SSL bug though, of course, it would have taken more than just indentation checking to pick up the problem.

It's also a matter of applying the DRY principle. If you've indented your code to match the program structure then it seems silly to then add extra syntax ('{', '}', etc) to re-express the same thing. (Still, I can't bring myself to leave out semicolons in JavaScript for some reason.)

Lisp-like languages seem like a good target for significant indentation. Firstly, the very regular and simple syntax lends itself to this modification in a simple way; you don't have to consider lots of different syntactic constructs as there are really only lists to worry about.

Also, of course, significant indentation can get rid of one of the minor gripes some people have with Lisp: it's often said the name stands for “Lots of Irritating Superfluous Parenthesis”.

Spaces and Tabs

The biggest concern with significant indentation is the possibility of muddles arising from different expansion of tabs. For example, it's common to set tab stops at 4 column intervals in editors yet Python (and quite a lot of older Unix code) expands tabs to multiples of 8 columns. Unless you have your editor set to clutter the screen up with little blibs of some sort for tab characters the effects of mixing them can be hard to spot.

Python has various command line settings to try to detect problems but you'd have to remember to set them. If you did, though, then they'd be a bit draconian. The only real way out is to be completely consistent and either always or never use tabs. Personally, I have my editor set to always expand tabs to spaces. Still, that's not ideal if I cut and paste a bit of code which contains tabs or whatever.

Therefore, I thought this would be a good opportunity to try out a little idea I'd been mulling for a while:

To put it another way: it's an error for adjacent lines to have one with more tabs than the other and the other to have more spaces than the first.

The idea is simply that it must be completely unambiguous whether one line is more indented than the other irrespective of the number of columns tabs expand to.

It's quite OK to mix code in the same file which uses spaces and code which uses tabs so long as the cut-over from one to the other is somewhere with no indentation. Most importantly, if the rules are broken it's picked up immediately by the parser rather than causing surprising effects further down the line.

For example, using “-->|” to represent tab expansions and “␣” spaces:

-->|␣␣if (> 5 s)
-->|-->|print s

results in an “TypeError: incomparable line indentations” exception just in case somebody actually has their tabs set to every two columns and doesn't actually want the print as the condition of the if.

Lines and Lists

At lower syntactic levels, parenthesis work in the way that they normally do in Lisp: lists started with '(' can extend over multiple lines up to the matching ')'. However, where such a list spans multiple lines:

Note that it's the indentation of the first line (i.e., of the first token on the first line) rather than the position of the open parenthesis that sets the left limit on the indentation of the continuation lines.

These rules aren't really essential (continuation lines could be allowed arbitrary indentation) but they do allow errors to be picked up quickly and remove possible confusion about which indent level applies to continued lines. It's allowed (though unusual) to have multiple lists causing a chain of continuations:

    a b (c d
        e f) g h (i j
        k l) m n

results in a single list containing two sublists: (a b (c d e f) g h (i j k l) m n).

Dotted pairs at the end of lists works in the same way as in Lisp except that U+00B7 · MIDDLE DOT is used to avoid any possible confusion with decimal points.

Apart from continuation lines as described above, the rules for line handling are:

Here's a fragment from the regression-test suite which illustrates these three rules:

(subcontext)
    let a 1
    assert
        fails? (cond)
        fails? (cond (= a 2))

is equivalent to:

((subcontext) (let a 1) (assert (fails? (cond)) (fails? (cond (= a 2)))))

(This is the beginning of some tests to show that various malformed cond expressions throw an exception properly. fails? returns True iff evaluation of its one and only parameter throws an exception, False otherwise. assert evaluates one or more parameters, throwing an exception if any of them return anything other than True. I'll get on to what subcontext is about in a later post.)

Here's the specification for some threaded rods through the purlins on the south side of my greenhouse design.

    fastener
        GreenhouseSouthPurlinJoin
        bcs 3
        M16
        roundWasher squareWasher purlin purlin squareWasher roundWasher
        'rod

This is equivalent to (fastener GreenhouseSouthPurlinJoin (bcs 3) M16 (roundWasher squareWasher purlin purlin squareWasher roundWasher) (quote rod)). Notice how “'rod” expands to (quote rod) in a fairly standard Lisp way then that is taken as a single expression on the line so not wrapped in a further list.

Mostly these rules seem to work well and allow common code to be expressed naturally, readably and reasonably compactly. It can get a bit vertically verbose though. For example, the syntax for definition of a function is def ( <functionName> <parameters> ) <functionBody>, such as this function, called threadedLength, which takes two parameters (iso and length) and returns the length of thread available on the bolts from a particular supplier (twice the diameter plus a fixed amount depending on the what range the bolt length falls in):

def
    threadedLength iso length
    # Length of threaded part
    +
        * 2 (iso diameter)
        cond
            (<= length 120)                         6
            (and (<= 130 length) (< length 210))    12
            (>= length 210)                         25

That's not terrible but it's a pity that using indentation to signify lists results in the top of the function definition being split over two lines (where the equivalent in Python would be def threadedLength(iso length), typically written on one line) and the + needs to sit all on its own.

For if expressions I felt it was too much to insist on:

if
    conditionExpr
    thenExpr
    elseExpr

over four lines when it's natural to put the condition on the same line as the if.

Accordingly I implemented a unary-if: if if is passed just one parameter (rather than the normal 2 or 3) then it evaluates that single parameter as the conditionExpr and returns one of two functions depending on the result. If the result is True then it returns a function which takes one or two parameters and evaluates and returns the first one. If the result is False then it returns a function which again takes one or two parameters and evaluates and returns the second if present or returns an empty list (null) if not.

Here's the Python implementation, built for comfort rather than speed:

def ifTrue(params, ctx):
    """ Unary if when the condition is True. """
    assert listLen(params) in (1, 2)
    return seval(params[0], ctx)

def ifFalse(params, ctx):
    """ Unary if when the condition is False. """
    pl = listLen(params)
    assert pl in (1, 2)
    if pl >= 2:
        return seval(params[1], ctx)
    else:
        return null

def operatorIf(params, ctx):
    pl = listLen(params) 
    assert pl >= 1
    test = ifTrue if boolOnly(seval(params[0], ctx)) else ifFalse
    if pl > 1:
        return test(tail(params), ctx)
    else:
        return test

Here's an example of use of the unary-if to see if a fastener is a threaded rod or a normal bolt:

    if (= selLenSpec (list 'rod))
        let type rod
        let type bolt

This parses as ((if (= selLenSpec (list 'rod))) (let type rod) (let type bolt)). The if expression (parenthesized in red) evaluates first to a function which is then called with the two let expressions as parameters, only one of which is evaluated.

Having the unary-if as an evaluable expression raises the theoretical (and horrid) possibility of assigning its value to a symbol:

    : let ifgh (if (in? 'greenhouse attrs))
    : let postSpacing (ifgh (+ 3600 150) 3600)

Don't worry about the leading ':' characters - they're to do with this code being in a parts-list, to be discussed later. The ifgh symbol is defined as a function which returns its first parameter if the frame we're describing is part of the greenhouse or its second parameter if not. The post spacing is then set to 3750 mm in the greenhouse frames and 3600 mm in the main house frames.

This is not quite entirely gratuitous: ifgh is used again a few lines later. Still, it's mostly there because it could be.

In principle it would be possible to have a “unary-def” as well: a function which returns a function which defines a named function with given parameters when it's passed a list of expressions to form the function body. That seems even less generally useful than unary-if, though, so I haven't bothered.

Conclusion

Mostly I'm quite pleased with the way that significant indentation has worked out in this mini-language and I'll likely explore it further in any other DSLs I play with.