Main ArticleExamples TOC

Takmela: an algorithm for parsing and Datalog query execution

by Mohamed Samy (samy2004@gmail.com)

A lot of modern (and old) parsing algorithms utilize a data structure called the Graph-structured Stack (GSS). You can find it in GLL, implicitly in Earley’s parser, and many more. I believe this data structure to be a very interesting piece of math/CS, and that it might have a larger role to play. Today I’ll show you two algorithms, the first is for general parsing – that is, parsing any CFG – and the second is for executing Datalog queries. We’ll realize that the second algorithm is exactly the same as the first, parameterized by a different set of data types and actions that apply to the GSS. If those two algorithms work as advertised, then we can probably generalize this same algorithm to many other operations; all expressed as the same basic routine.

This article is long and requires some knowledge of parsing (not too advanced knowledge, I promise) but if you like computer science I feel you would really, really enjoy reading it; so let’s go!

The Takmela parser

The parser version of the Takmela depends on combining the CS concepts of a graph-structured stack and the continuation.

We can begin to understand how Takmela works by looking at a common problem in parsing from context-free grammars; namely the left-recursion problem. Consider the classic arithmetic expression example:

expr → expr ‘+’ term
     | term

term → NUM

Let’s try to parse it with the input 1+2+3

A naive top-down parser will try to ‘invoke’ expr , attempt to parse the first alternative, and so immediately try to recursively invoke expr again without consuming any input. This will lead to an infinite loop.

A less naive parsing algorithm can do the following: if it encounters left recursion in expr it will freeze parsing at that point and try other alternatives first. If one of the others succeed the parser will go back in time into the first (recursive) alternative, and substitute any ‘successes’ of expr obtained so far into the left recursive call. Awesome right?

Let’s try to work through a hypothetical time-jumping parser, and parse 1+2 using our left recursive grammar.

We’ll use a notation like expr/0 to indicate a ‘call’ to expr, i.e a parsing attempt, starting at input position 0.

We’ll use a notation like expr → expr • ‘+’ term to indicate a CodePos, for example the preceding CodePos says "we are in the middle of running the first alternative of expr, we have successfully parsed the recursive call, and the ‘code’ is now right before trying to match the plus sign"

Let’s run our parser!

Rules
expr → expr ‘+’ term
            | term

term → NUM
Input
1 + 2
Trace

If we try to use the same algorithm to parse 1+2+3 a rather interesting thing will happen: we will resume the frozen expr/0 twice! Once will be like the above when 1 was fully parsed as an expr, and the second time when 1+2 is fully parsed (i.e when the full trace above is completed). When 1+2 is parsed it will resume expr/0 at expr → expr • ‘+’ term all ready to match + and 3.

We are using a technique of (1) freezing or setting aside the remaining execution of some parse (2) Invoking it with results from other parts of the program.

This concept of ‘frozen program that we can re-execute’ is called a continuation, and is the core of what we’re trying to do with Takmela.

Notice however that implementing our hypothetical time-jump parser feels rather complex, doesn’t it? We’ll need to implement something resembling a call stack for non-left-recursive parsing and yet deal with representing continuations, resuming them, and so on.

To simplify things we’ll go further, everything will use continuations; there will be no ‘normal calls’, and our parser doesn’t even need to analyze the grammar to detect what calls are left-recursive.

Turns out this "suspend all the things" approach makes the parser rather powerful; all the nasty things like left-recursion, ambiguity, wrong behavior due to nullable nonterminals; all these things become normal processing steps when we don’t use direct calls. The algorithm will just keep eating away at the input, branching as many times as it needs.

We are now in a position to understand the full algorithm. Let’s define the little pieces from which we’ll form our parsing engine…

How Takmela parsing works

A Call is a pair (non_terminal, input_pos) ; for example a call (term, 5) or term/5 means we’re trying to parse a term, starting from input position 5.

A CodePos is a place in a rule body, indicating what has been parsed so far; we’ll represent it here by a heavy dot. For example in the first alternative of expr, we could have a CodePos like this: expr → expr ‘+’ • term which means we have parsed an expr and a plus sign, but are yet to parse the final term

A Continuation is a tuple (Call, CodePos)

The role of a continuation is to tell us what should happen after a callee has succeeded. Example Suppose a rule a → b c d was called at position 5 , calling b moved us to position 7 , and the sub-call to c at position 7 has succeeded; what should happen after c’s success? We should resume the call to a / 5 right before d , thus one of the continuations for the callee c / 7 is (a/5, a → b c • d)

Notice that the call a/5 means "attempting to parse an ‘a’ starting at position 5"; resuming that call will continue from wherever c has finished parsing, not from position 5.

Our program state is stored into these variables:

K is the continuation table, it maps from Call → Set <continuation> , as in the example above. This is our graph-structured stack. (if you’re into writing interpreters, this graph looks like a family of related call stacks in a compact data structure)

S is the success table or memoisation table; it stores all calls that have succeeded so far. It maps from Call → Set<final_input_pos>. For example an item (expr, 5) → 10 means that a parse for expr starting from position 5 has succeed, and the rest of the parsing (resumed via a continuation) should continue at position 10.

If we want to store parse trees, we can have an additional entry in the success table alongside the final position. More about parse trees in our example with ambuiguity

In addition to K and S, we will have newK and newS where we collect any newly created continuations or successes at a given iteration of the algorithm. They have the same structure as the original K & S.

A final part of the program state is the Awakenings set, used to prevent redundant re-parsing of past successful parses. Awakenings is a set of (input_pos, continuation) pairs, so having an awakening like (5, (a/2, a → b c • d)) means "we have already processed the call to c from its parent (a, 2), and this call to c has succeeded at position 5".

In an imagined ML-like language, our program state would be fully expressed like this:

  type Call = (RuleName, InputPos)
  type Continuation = (Call, CodePos)

  K :: Map<Call, Set<Continuation>>
  S :: Map<Call, Set<InputPos>>
  newK :: sametype K
  newS :: sametype S
  Awakenings :: Set<(InputPos, Continuation)>

The parser works in iterations, it starts with exactly one call (calling the root non-terminal at input position zero) and each iteration processes pending calls, matching tokens as much as it can until it finds a nonterminal (if so it calls/processes it) or reaches the end of the rule (if so, the rule succeeds). This processing might generate new continuations and successes.

(reminder: continuations are entries to K/newK which are created from calling non-terminals, and successes are entries in S/newS created from process(..) working through a rule till the end).

After processing, new continuations and successes are joined ; so a new success might lead to resuming some existing continuation (going ‘back in time’). Similarly a new call/continuation might join with an existing success, meaning "there are existing results of calling the same rule at the same position (outside of this continuation), let’s resume the new call’s continuations with them".

Any new items (continuations and successes) from the joining will be processed in the next iteration.

(If you’re familiar with Datalog implementation all of this should sound familiar; Takmela is basically a special case of semi-naive).

In pesudocode:
K = {}; newK = {}  // init continuations
S = {}; newS = {}  // init successes

callRootRule() // invokes 'process' on the rule’s branches, to match tokens etc
               // might generate some newK and newS records

while(true)
{
    kWorklist = diff(newK, K)    // we need only the continuations not seen before
    sWorklist = diff(newS, S)    // ditto
    
    if(kWorklist.size == 0 && sWorklist.size == 0)
    { 
       break // stop if we reach a fixed point
    }     

    newK = {}
    newS = {}
    K = merge(kWorklist, K)
    S = merge(sWorklist, S)
    
    joinNewContinuationsWithPastSuccesses(kWorklist, S) // may invoke `process` on some nonterminals
                                                        // and may generate (schedule) new conts.
                                                        // and successes
                                                        
    joinNewSuccessesWithPastContinuations(sWorklist, K) // same as above
}

This is the essential outline of the algorithm, we’ll now explain it again in pseudocode but in detail, and show the role of the awakenings set too.

Takmela's parsing algorithm in detail
K = {}; newK = {}  // init continuations
S = {}; newS = {}  // init successes

awakenings = {}

call(startRule, 0)

while(true)
{
    kWorklist = diff(newK, K)
    sWorklist = diff(newS, S)
    
    if(kWorklist.size == 0 && sWorklist.size == 0)
    { 
        break  // stop if we reach a fixed point
    }     

    newK = {}
    newS = {}
    
    K = merge(kWorklist, K)
    S = merge(sWorklist, S)

    awakenings.clear()
    
    joinNewContinuationsWithPastSuccesses(kWorklist, S) // may invoke 'process' on some nonterminals
                                                        // and may generate (schedule) new conts. and
                                                        // successes
                                                        
    joinNewSuccessesWithPastContinuations(sWorklist, K) // same as above
}

func call(string callee, string caller, int inputPosNow, CodePos codePos)
{
    Cont cont = ((caller, inputPosNow), codePos)
    bool firstCall = ! ( K.containsKey((callee, inputPosNow)) || newK.containsKey((callee, inputPosNow)) )
    
    newK[callee, inputPosNow].add(cont) // schedule a 'join new continuation with successes'

    if (firstCall)
    {
        process(callee, inputPosNow)
    }
}

func process(string ruleName, int inputPos)
{
    Rule r = rules[ruleName]

    for (RuleBranch b : r.branches)
    {
        processBranch(ruleName, b.codeStart(), inputPos)
    }
}

func processBranch(string ruleName, CodePos codePos, int inputPosNow)
{
    if (ruleName == "root" && eof(inputPosNow))
    {
        fire event onSuccessfulParse()   // ← this is where the algorithm actually yields results
                                         // if we return parse trees, we’d pass them here
        return
    }
    
    bool stop = false
    int inputPos = inputPosNow
    
    while (codePos != endOfRuleBranch)
    {
        if (codePos is Terminal)
        {
            Terminal t = codePos
            if (!match(t, inputPos))
            {
                stop = true
                break
            }
            else
            {
                inputPos++
            }
        }
        else if (codePos is Nonterminal)
        {
            Nonterminal toCall = codePos
            call(toCall.name, ruleName, inputPos, next(codePos));
            stop = true
            break
        }
        
        codePos = next(codePos)
        
    } // end while
    
    if (!stop)
    {
        // Reached the end of this rule branch without stopping due to match failure or
        // nonterminal call
        // This means the rule has succeeded, let's schedule a 'join new success with continuations'

        newS[ruleName, inputPosNow].add(inputPos)
    }
}

func joinNewContinuationsWithPastSuccesses(kWorklist, S)
{
    for(k of kWorklist.entries())
    {
        (callee, posAtCallee) = k.key
        Set cs = k.value
        for(Cont c : cs)
        {
            successes = S[callee, posAtCallee]
            for(int s : successes)
            {
                if(!awakenings.containsKey((s, c)))
                {
                    awakenings.add((s,c))
                    processBranch(c.caller, c.codePos, s)
                }
            }
        }
    }
}

func joinNewSuccessesWithPastContinuations(sWorklist, K)
{
    for (s : sWorklist)
    {
        (callee, posAtCallee) = s.key
        successPositions = s.value
        for (int p : successPositions)
        {
            contSet = K[callee, posAtCallee]
            
            for (cont : contSet)
            {
                if (!awakenings.contains((p, cont)))
                {
                    awakenings.add((p, cont));
                    processBranch(cont.caller, cont.codePos, p)
                }
            }
        }
    }
}

We will now do a trace of parsing 1+2 using Takmela, then again with 1+2+3

Let's trace how Takmela parses 1+2

Hint: if the trace appears confusing this might help

You can collapse this trace and open it in a separate page, link here

Call start nonterminal
#0: Called e/0 at the top level (will process)
Processing e/0 codePos e → • e `+` t inputPos 0
Calling: e/0 with Cont e/0 ; e → e• `+` t [already processed]
Processing e/0 codePos e → • t inputPos 0
Calling: t/0 with Cont e/0 ; e → t• [will process]
Processing t/0 codePos t → • NUM inputPos 0
Match NUM to 1, matched; inputPos now 1
Reached end of rule, success! t/0 → 1
New continuations:
e/0e/0 ; e → e• `+` t ; e
t/0e/0 ; e → t• ; e
t/0 → 1, t('1')
Iteration 0
Worklist - successes
jS t/0 → 1 t('1')
Worklist - continuations
nJ e/0e/0 ; e → e• `+` t
S⤚ t/0e/0 ; e → t•
#1: JoinS: t/0 → 1 will resume cont: e/0 ; e → t•
Processing e/0 codePos e → t• inputPos 1
Reached end of rule, success! e/0 → 1
t/0 → 1, t('1') e/0 → 1, e(t('1'))
Iteration 1
Worklist - successes
jS e/0 → 1 e(t('1'))
Worklist - continuations
#2: JoinS: e/0 → 1 will resume cont: Root/0 ; Root → e•
×Didn't reach end of input, not a successful parse
t/0 → 1, t('1') e/0 → 1, e(t('1'))
#3: JoinS: e/0 → 1 will resume cont: e/0 ; e → e• `+` t
Processing e/0 codePos e → e• `+` t inputPos 1
Match + to +, matched; inputPos now 2
Calling: t/2 with Cont e/0 ; e → e `+` t• [will process]
Processing t/2 codePos t → • NUM inputPos 2
Match NUM to 2, matched; inputPos now 3
Reached end of rule, success! t/2 → 3
New continuations:
t/2e/0 ; e → e `+` t• ; e(e(t('1')), '+')
t/0 → 1, t('1') e/0 → 1, e(t('1')) t/2 → 3, t('2')
Iteration 2
Worklist - successes
jS t/2 → 3 t('2')
Worklist - continuations
S⤚ t/2e/0 ; e → e `+` t•
#4: JoinS: t/2 → 3 will resume cont: e/0 ; e → e `+` t•
Processing e/0 codePos e → e `+` t• inputPos 3
Reached end of rule, success! e/0 → 3
t/0 → 1, t('1') e/0 → 1, e(t('1')) t/2 → 3, t('2') e/0 → 3, e(e(t('1')),'+',t('2'))
Iteration 3
Worklist - successes
jS e/0 → 3 e(e(t('1')),'+',t('2'))
Worklist - continuations
#5: JoinS: e/0 → 3 will resume cont: Root/0 ; Root → e•
Reached end of input, successful parse!
t/0 → 1, t('1') e/0 → 1, e(t('1')) t/2 → 3, t('2') e/0 → 3, e(e(t('1')),'+',t('2')) Root/0 → 3, Root(e(e(t('1')),'+',t('2')))
#6: JoinS: e/0 → 3 will resume cont: e/0 ; e → e• `+` t
Processing e/0 codePos e → e• `+` t inputPos 3
Match `+` to (eof), fail
t/0 → 1, t('1') e/0 → 1, e(t('1')) t/2 → 3, t('2') e/0 → 3, e(e(t('1')),'+',t('2')) Root/0 → 3, Root(e(e(t('1')),'+',t('2')))
Iteration 4
Worklist - successes
nJ Root/0 → 3 Root(e(e(t('1')),'+',t('2')))
Worklist - continuations
(No new calls or successes that can be further processed)
Iteration 5 [Fixed point reached]

The next step is to trace 1+2+3 which you'll find here.

For many more examples, you can go here.

If you want actual code you’ll find the Java implementation here. If you want examples more complex than 1+2 you’ll find that I’ve included some tricky cases in the Examples TOC, including ambiguity and nullable rules. If you want to see some real-world usage of Takmela, the demo program itself uses Takmela to parse a subset of GraphViz dot files: This is the grammar and this is the code for parsing it

The parsing algorithm is implemented in about 500 lines of Java code, and using more expressive languages would likely reduce that size even further; not bad for a general parser. Remember that Takmela should work with any context-free grammar, including ones with left recursion (both direct and indirect), ambiguity, nullable non-terminals, and even left-recursion-like behavior that results from nullable non-terminals (e.g a → b a c when b is nullable). And it does that without any analysis or transformation of its grammar; it just runs.

But let’s talk about Datalog…

Takmelogic: A Datalog execution engine based on Takmela

First, a very quick introduction to Datalog: Datalog is a language of facts and deductions from rules; consider the classic example below, consisting of four facts followed by two rules:
// Notice that constants always start with small letters while
// variables will always start with capital letters
parent(a, b).
parent(b, c).
parent(a, d).
parent(d, e).

// If there’s a Z such that X is a parent of Z, and Z is a 
// parent of Y, we can infer the relation "X is a grandparent of Y"
grandparent(X, Y):- parent(X, Z), parent(Z, Y).

// All parents are ancestors
ancestor(X, Y):- parent(X, Y).

// An ancestor of an ancestor is also an ancestor
// Notice that Z is used in exactly the same way as in ‘grandparent’
ancestor(X, Y):- ancestor(X, Z), ancestor(Z, Y).
Datalog is an amazing language, you supply it with some straightforward rules and a database (even one as simple as the four facts above) and it will make you feel as if the computer is thinking. Here are some sample runs of Datalog queries, which run against the above program:
?- parent(a, X)
X = [b, d]

?- grandparent(X, e)
X = [a]

?- ancestor(a, X)
X = [b, c, d, e]

?- ancestor(X, c)
X = [b, a]

There are many ways to implement Datalog; which can be roughly classified as top-down and bottom-up.

A top-down algorithm will start with a query like ancestor(X, c) and try to run any subgoals required to solve the parent query; e.g ancestor(X, c) yields calls to parent(X, c) from rule 1, and to ancestor(X, Z) & ancestor(Z, c) from rule 2.

A bottom-up algorithm, like the famous semi-naive, will start from the facts and keep applying the rules in an attempt to infer all possible data; as if your queries were all variables. It will be as if you must always run grandparent(V0, V1) or ancestor(V0, V1).

Top-down Datalog engines obviously do less work when the query has some solution-constraining data, like ancestor(X, c) [note: not talking about constraint-solving]. There are clever algorithms, like the so called Magic Sets, that fool bottom-up algorithms into working like top-down algorithms. They do this by generating a few new rules and a new query, such that running seminaive on the modified program will simulate a top-down query engine, constraining data and all, running the original program. It’s really a sign of humanity’s ingenuity.

Let’s think about how a Datalog query might be executed. It feels as if any query engine (let’s focus on top-down) has to do the following:

This sounds a lot like parsing, and especially like general parsers a la Takmela. To take the idea further I literally did a copy/paste on Takmela’s parser, changed various things, and tested it with some Datalog programs. It worked.

ParsingDatalog
Input positionAn integern/a
CallRule Name + input posRule Name + any constant arguments,
e.g ancestor(a, ?)
CodePosWhere are we in a non-terminal rule?Where are we in a Datalog rule (aka a Horn Clause)?
ContinuationCall + CodePos + "parse tree so far" if we want parse trees.Call + CodePos + any bound intermediate variables
e.g {Z = b}
SuccessCall → Set<FinalInputPos>Call → Set<ResultTuple>
Surface matchingMatching terminalsQuerying facts
More complex matchingCalling non-terminalsCalling Subgoals
When resuming a continuationContinue from last input positionMake sure variables match

Datalog as implemented in Takmela doesn’t have the concept of an input position because facts can be matched multiple times while executing a query, perhaps in some variant of the language (linear logic?) a concept of "facts used so far" would correspond to the input position.

A continuation in Datalog-via-Takmela also includes the values of variables that have been already bound; reminding us of the relationship between a continuation and a programming language’s call stack.

Let's see a trace of ancestor(b, X) , which can be found here. You'll find it closely resembles our traces of Takmela's parsing, with the GSS, successes, and so on. The source code for the core Takmelogic engine can be found here, I suggest opening it with Takmela's parser code side by side and comparing them!

A Java implementation of the first iteration (lol) of Takmelogic can be found within the same project as the Takmela parser in the takmelogic package.

More traces can be found in the Trace TOC page

Adding to Takmelogic

Sometimes while executing a Datalog query one needs to wait until a certain subquery has been completely executed. Consider for example a special operation count that works like this:

num_ancestors(X, N):-
	N:= count ancestor(Y, X).

The predicate num_ancestors will attempt to find all tuples where Y is an ancestor of X and return the number thereof. It is important to schedule the execution such that a call to num_ancestors(d, N) will wait until all possible results of ancestor(Y, d) have already been computed. This is also important for negation; if we want to test that a certain predicate is not true, we need to make sure no future inference of this predicate will invalidate our conclusion

A plain Takmelogic implementation doesn’t have that guarantee but we can add it. There's a technique that's compatible with the GSS approach we're using, but perhaps that's a topic for another article.

Comparison with earlier work

Earley’s algorithm is the ancestor of many general parsers. It uses a GSS implicitly.

Earley's parsing technique has been applied to Prolog/Datalog queries, presented in the paper parsing as deduction in 1983 and improved upon by many other researchers, so I'm definitely not the first person to consider a link between general parsing and Datalog!

Takmela is closely related to a parsing technique presented in Mark Johnson's "Squibs and Discussions: Memoization in Top-Down Parsing" (link), with the main difference that Mark Johnson’s technique uses Scheme’s built in continuations language feature, while Takmela represents continuations as an explicit data structure; this opens the door for implementation of the parser in typical programming languages that don’t support continuations, and access to the GSS means the ability to add features like visualizing a running parser, or optimizations that e.g compress or prune the graph-structured stack (If I understood correctly -- and I may have not -- Marpa seems to make great use of such optimizations).

Takmela was directly inspired by GLL and started as an attempt to simplify it in order to understand it (because I needed a parser for my other projects and wasn’t satisfied with existing tools). In terms of generality and worst-case time complexity Takmela can’t beat GLL; which represents the state of the art in parsing right now. If Takmela were to add something it would be simplicity of understanding and implementation, which could make it easier for more people to make use of general parsing or or add improvements.

A really important part is missing

The bad news: I haven’t written a correctness proof for Takmela nor worked out the algorithm’s time complexity. I’ve tried, but it seems I lack the academic experience needed to do so. I do have some rough ideas on how to start but that's it.

Before you dismiss me outright, please consider that it’s okay to ask for help, right? I want to release this work to the world, but I fully acknowledge my lack of the knowledge needed to finish it. I’m not writing this article to celebrate some accomplishment but for the sake of outreach. If you’re a person who finds this research direction interesting and who actually knows what they’re doing; you’re so welcome to contact me about mentoring, guidance, supervision, or other forms of collaboration. My email is samy2004@gmail.com

What would you gain out of this? I suggest the possibility of discovering something which makes the field a slightly more interesting place; this article is about a ~500 line general parser and a similarly sized Datalog interpreter, both of which can be explained to someone else in about an hour. The only thing missing is the correctness and complexity proofs; and then both parsing and Datalog could become much more ubiquitously used. We also have the potential of the GSS as a new computer science workhorse, through introducing more variants of Takmela’s algorithm for many other uses. Thank you for giving someone the benefit of doubt in the name of pursuing ideas!