This article is long and requires some knowledge of parsing (not too advanced knowledge, I promise) but if you like computer science I feel you would really, really enjoy reading it; so let’s go!
The parser version of the Takmela depends on combining the CS concepts of a graph-structured stack and the continuation.
We can begin to understand how Takmela works by looking at a common problem in parsing from context-free grammars; namely the left-recursion problem. Consider the classic arithmetic expression example:
expr → expr ‘+’ term
| term
term → NUM
Let’s try to parse it with the input 1+2+3
A naive top-down parser will try to ‘invoke’ expr , attempt to parse the first alternative, and so immediately try to recursively invoke expr again without consuming any input. This will lead to an infinite loop.
A less naive parsing algorithm can do the following: if it encounters left recursion in expr it will freeze parsing at that point and try other alternatives first. If one of the others succeed the parser will go back in time into the first (recursive) alternative, and substitute any ‘successes’ of expr obtained so far into the left recursive call. Awesome right?
Let’s try to work through a hypothetical time-jumping parser, and parse 1+2 using our left recursive grammar.
We’ll use a notation like expr/0 to indicate a ‘call’ to expr, i.e a parsing attempt, starting at input position 0.
We’ll use a notation like expr → expr • ‘+’ term to indicate a CodePos, for example the preceding CodePos says "we are in the middle of running the first alternative of expr, we have successfully parsed the recursive call, and the ‘code’ is now right before trying to match the plus sign"
Let’s run our parser!
Rules
expr → expr ‘+’ term
| term
term → NUM
Input
1 + 2Trace
If we try to use the same algorithm to parse 1+2+3 a rather interesting thing will happen: we will resume the frozen expr/0 twice! Once will be like the above when 1 was fully parsed as an expr, and the second time when 1+2 is fully parsed (i.e when the full trace above is completed). When 1+2 is parsed it will resume expr/0 at expr → expr • ‘+’ term all ready to match + and 3.
We are using a technique of (1) freezing or setting aside the remaining execution of some parse (2) Invoking it with results from other parts of the program.
This concept of ‘frozen program that we can re-execute’ is called a continuation, and is the core of what we’re trying to do with Takmela.
Notice however that implementing our hypothetical time-jump parser feels rather complex, doesn’t it? We’ll need to implement something resembling a call stack for non-left-recursive parsing and yet deal with representing continuations, resuming them, and so on.
To simplify things we’ll go further, everything will use continuations; there will be no ‘normal calls’, and our parser doesn’t even need to analyze the grammar to detect what calls are left-recursive.
Turns out this "suspend all the things" approach makes the parser rather powerful; all the nasty things like left-recursion, ambiguity, wrong behavior due to nullable nonterminals; all these things become normal processing steps when we don’t use direct calls. The algorithm will just keep eating away at the input, branching as many times as it needs.
We are now in a position to understand the full algorithm. Let’s define the little pieces from which we’ll form our parsing engine…
A Call is a pair (non_terminal, input_pos) ; for example a call (term, 5) or term/5 means we’re trying to parse a term, starting from input position 5.
A CodePos is a place in a rule body, indicating what has been parsed so far; we’ll represent it here by a heavy dot. For example in the first alternative of expr, we could have a CodePos like this: expr → expr ‘+’ • term which means we have parsed an expr and a plus sign, but are yet to parse the final term
A Continuation is a tuple (Call, CodePos)
The role of a continuation is to tell us what should happen after a callee has succeeded. Example Suppose a rule a → b c d was called at position 5 , calling b moved us to position 7 , and the sub-call to c at position 7 has succeeded; what should happen after c’s success? We should resume the call to a / 5 right before d , thus one of the continuations for the callee c / 7 is (a/5, a → b c • d)
Notice that the call a/5 means "attempting to parse an ‘a’ starting at position 5"; resuming that call will continue from wherever c has finished parsing, not from position 5.
Our program state is stored into these variables:
K is the continuation table, it maps from Call → Set <continuation> , as in the example above. This is our graph-structured stack. (if you’re into writing interpreters, this graph looks like a family of related call stacks in a compact data structure)
S is the success table or memoisation table; it stores all calls that have succeeded so far. It maps from Call → Set<final_input_pos>. For example an item (expr, 5) → 10 means that a parse for expr starting from position 5 has succeed, and the rest of the parsing (resumed via a continuation) should continue at position 10.
If we want to store parse trees, we can have an additional entry in the success table alongside the final position. More about parse trees in our example with ambuiguity
In addition to K and S, we will have newK and newS where we collect any newly created continuations or successes at a given iteration of the algorithm. They have the same structure as the original K & S.
A final part of the program state is the Awakenings set, used to prevent redundant re-parsing of past successful parses. Awakenings is a set of (input_pos, continuation) pairs, so having an awakening like (5, (a/2, a → b c • d)) means "we have already processed the call to c from its parent (a, 2), and this call to c has succeeded at position 5".
In an imagined ML-like language, our program state would be fully expressed like this:
type Call = (RuleName, InputPos) type Continuation = (Call, CodePos) K :: Map<Call, Set<Continuation>> S :: Map<Call, Set<InputPos>> newK :: sametype K newS :: sametype S Awakenings :: Set<(InputPos, Continuation)>
The parser works in iterations, it starts with exactly one call (calling the root non-terminal at input position zero) and each iteration processes pending calls, matching tokens as much as it can until it finds a nonterminal (if so it calls/processes it) or reaches the end of the rule (if so, the rule succeeds). This processing might generate new continuations and successes.
(reminder: continuations are entries to K/newK which are created from calling non-terminals, and successes are entries in S/newS created from process(..) working through a rule till the end).
After processing, new continuations and successes are joined ; so a new success might lead to resuming some existing continuation (going ‘back in time’). Similarly a new call/continuation might join with an existing success, meaning "there are existing results of calling the same rule at the same position (outside of this continuation), let’s resume the new call’s continuations with them".
Any new items (continuations and successes) from the joining will be processed in the next iteration.
(If you’re familiar with Datalog implementation all of this should sound familiar; Takmela is basically a special case of semi-naive).
In pesudocode:
K = {}; newK = {} // init continuations
S = {}; newS = {} // init successes
callRootRule() // invokes 'process' on the rule’s branches, to match tokens etc
// might generate some newK and newS records
while(true)
{
kWorklist = diff(newK, K) // we need only the continuations not seen before
sWorklist = diff(newS, S) // ditto
if(kWorklist.size == 0 && sWorklist.size == 0)
{
break // stop if we reach a fixed point
}
newK = {}
newS = {}
K = merge(kWorklist, K)
S = merge(sWorklist, S)
joinNewContinuationsWithPastSuccesses(kWorklist, S) // may invoke `process` on some nonterminals
// and may generate (schedule) new conts.
// and successes
joinNewSuccessesWithPastContinuations(sWorklist, K) // same as above
}
This is the essential outline of the algorithm, we’ll now explain it again in pseudocode but in detail, and show the role of the awakenings set too.
K = {}; newK = {} // init continuations
S = {}; newS = {} // init successes
awakenings = {}
call(startRule, 0)
while(true)
{
kWorklist = diff(newK, K)
sWorklist = diff(newS, S)
if(kWorklist.size == 0 && sWorklist.size == 0)
{
break // stop if we reach a fixed point
}
newK = {}
newS = {}
K = merge(kWorklist, K)
S = merge(sWorklist, S)
awakenings.clear()
joinNewContinuationsWithPastSuccesses(kWorklist, S) // may invoke 'process' on some nonterminals
// and may generate (schedule) new conts. and
// successes
joinNewSuccessesWithPastContinuations(sWorklist, K) // same as above
}
func call(string callee, string caller, int inputPosNow, CodePos codePos)
{
Cont cont = ((caller, inputPosNow), codePos)
bool firstCall = ! ( K.containsKey((callee, inputPosNow)) || newK.containsKey((callee, inputPosNow)) )
newK[callee, inputPosNow].add(cont) // schedule a 'join new continuation with successes'
if (firstCall)
{
process(callee, inputPosNow)
}
}
func process(string ruleName, int inputPos)
{
Rule r = rules[ruleName]
for (RuleBranch b : r.branches)
{
processBranch(ruleName, b.codeStart(), inputPos)
}
}
func processBranch(string ruleName, CodePos codePos, int inputPosNow)
{
if (ruleName == "root" && eof(inputPosNow))
{
fire event onSuccessfulParse() // ← this is where the algorithm actually yields results
// if we return parse trees, we’d pass them here
return
}
bool stop = false
int inputPos = inputPosNow
while (codePos != endOfRuleBranch)
{
if (codePos is Terminal)
{
Terminal t = codePos
if (!match(t, inputPos))
{
stop = true
break
}
else
{
inputPos++
}
}
else if (codePos is Nonterminal)
{
Nonterminal toCall = codePos
call(toCall.name, ruleName, inputPos, next(codePos));
stop = true
break
}
codePos = next(codePos)
} // end while
if (!stop)
{
// Reached the end of this rule branch without stopping due to match failure or
// nonterminal call
// This means the rule has succeeded, let's schedule a 'join new success with continuations'
newS[ruleName, inputPosNow].add(inputPos)
}
}
func joinNewContinuationsWithPastSuccesses(kWorklist, S)
{
for(k of kWorklist.entries())
{
(callee, posAtCallee) = k.key
Set cs = k.value
for(Cont c : cs)
{
successes = S[callee, posAtCallee]
for(int s : successes)
{
if(!awakenings.containsKey((s, c)))
{
awakenings.add((s,c))
processBranch(c.caller, c.codePos, s)
}
}
}
}
}
func joinNewSuccessesWithPastContinuations(sWorklist, K)
{
for (s : sWorklist)
{
(callee, posAtCallee) = s.key
successPositions = s.value
for (int p : successPositions)
{
contSet = K[callee, posAtCallee]
for (cont : contSet)
{
if (!awakenings.contains((p, cont)))
{
awakenings.add((p, cont));
processBranch(cont.caller, cont.codePos, p)
}
}
}
}
}
We will now do a trace of parsing 1+2 using Takmela, then again with 1+2+3
Hint: if the trace appears confusing this might help
You can collapse this trace and open it in a separate page, link here
| Call start nonterminal | |||
![]() |
#0: Called e/0 at the top level (will process)
Processing e/0 codePos e → • e `+` t inputPos 0
Calling: e/0 with Cont e/0 ; e → e• `+` t [already processed]
Processing e/0 codePos e → • t inputPos 0
Calling: t/0 with Cont e/0 ; e → t• [will process]
Processing t/0 codePos t → • NUM inputPos 0
Match NUM to 1, matched; inputPos now 1
Reached end of rule, success! t/0 → 1
e/0 → e/0 ; e → e• `+` t ; e
t/0 → e/0 ; e → t• ; e
|
||
t/0 → 1, t('1')
| |||
| Iteration 0 | |||
| |||
![]() |
#1: JoinS: t/0 → 1 will resume cont: e/0 ; e → t•
Processing e/0 codePos e → t• inputPos 1
Reached end of rule, success! e/0 → 1
|
||
t/0 → 1, t('1')
e/0 → 1, e(t('1'))
| |||
| Iteration 1 | |||
| |||
![]() |
#2: JoinS: e/0 → 1 will resume cont: Root/0 ; Root → e•
×Didn't reach end of input, not a successful parse
t/0 → 1, t('1')
e/0 → 1, e(t('1'))
| ||
![]() |
#3: JoinS: e/0 → 1 will resume cont: e/0 ; e → e• `+` t
Processing e/0 codePos e → e• `+` t inputPos 1
Match + to +, matched; inputPos now 2
Calling: t/2 with Cont e/0 ; e → e `+` t• [will process]
Processing t/2 codePos t → • NUM inputPos 2
Match NUM to 2, matched; inputPos now 3
Reached end of rule, success! t/2 → 3
t/2 → e/0 ; e → e `+` t• ; e(e(t('1')), '+')
|
||
t/0 → 1, t('1')
e/0 → 1, e(t('1'))
t/2 → 3, t('2')
| |||
| Iteration 2 | |||
| |||
![]() |
#4: JoinS: t/2 → 3 will resume cont: e/0 ; e → e `+` t•
Processing e/0 codePos e → e `+` t• inputPos 3
Reached end of rule, success! e/0 → 3
|
||
t/0 → 1, t('1')
e/0 → 1, e(t('1'))
t/2 → 3, t('2')
e/0 → 3, e(e(t('1')),'+',t('2'))
| |||
| Iteration 3 | |||
| |||
![]() |
#5: JoinS: e/0 → 3 will resume cont: Root/0 ; Root → e•
✓Reached end of input, successful parse!
t/0 → 1, t('1')
e/0 → 1, e(t('1'))
t/2 → 3, t('2')
e/0 → 3, e(e(t('1')),'+',t('2'))
Root/0 → 3, Root(e(e(t('1')),'+',t('2')))✓
| ||
![]() |
#6: JoinS: e/0 → 3 will resume cont: e/0 ; e → e• `+` t
Processing e/0 codePos e → e• `+` t inputPos 3
Match `+` to (eof), fail
|
||
t/0 → 1, t('1')
e/0 → 1, e(t('1'))
t/2 → 3, t('2')
e/0 → 3, e(e(t('1')),'+',t('2'))
Root/0 → 3, Root(e(e(t('1')),'+',t('2')))✓
| |||
| Iteration 4 | |||
| |||
| (No new calls or successes that can be further processed) | |||
| Iteration 5 [Fixed point reached] | |||
The next step is to trace 1+2+3 which you'll find here.
For many more examples, you can go here.
If you want actual code you’ll find the Java implementation here. If you want examples more complex than 1+2 you’ll find that I’ve included some tricky cases in the Examples TOC, including ambiguity and nullable rules. If you want to see some real-world usage of Takmela, the demo program itself uses Takmela to parse a subset of GraphViz dot files: This is the grammar and this is the code for parsing it
The parsing algorithm is implemented in about 500 lines of Java code, and using more expressive languages would likely reduce that size even further; not bad for a general parser. Remember that Takmela should work with any context-free grammar, including ones with left recursion (both direct and indirect), ambiguity, nullable non-terminals, and even left-recursion-like behavior that results from nullable non-terminals (e.g a → b a c when b is nullable). And it does that without any analysis or transformation of its grammar; it just runs.
But let’s talk about Datalog…
// Notice that constants always start with small letters while // variables will always start with capital letters parent(a, b). parent(b, c). parent(a, d). parent(d, e). // If there’s a Z such that X is a parent of Z, and Z is a // parent of Y, we can infer the relation "X is a grandparent of Y" grandparent(X, Y):- parent(X, Z), parent(Z, Y). // All parents are ancestors ancestor(X, Y):- parent(X, Y). // An ancestor of an ancestor is also an ancestor // Notice that Z is used in exactly the same way as in ‘grandparent’ ancestor(X, Y):- ancestor(X, Z), ancestor(Z, Y).Datalog is an amazing language, you supply it with some straightforward rules and a database (even one as simple as the four facts above) and it will make you feel as if the computer is thinking. Here are some sample runs of Datalog queries, which run against the above program:
?- parent(a, X) X = [b, d] ?- grandparent(X, e) X = [a] ?- ancestor(a, X) X = [b, c, d, e] ?- ancestor(X, c) X = [b, a]
There are many ways to implement Datalog; which can be roughly classified as top-down and bottom-up.
A top-down algorithm will start with a query like ancestor(X, c) and try to run any subgoals required to solve the parent query; e.g ancestor(X, c) yields calls to parent(X, c) from rule 1, and to ancestor(X, Z) & ancestor(Z, c) from rule 2.
A bottom-up algorithm, like the famous semi-naive, will start from the facts and keep applying the rules in an attempt to infer all possible data; as if your queries were all variables. It will be as if you must always run grandparent(V0, V1) or ancestor(V0, V1).
Top-down Datalog engines obviously do less work when the query has some solution-constraining data, like ancestor(X, c) [note: not talking about constraint-solving]. There are clever algorithms, like the so called Magic Sets, that fool bottom-up algorithms into working like top-down algorithms. They do this by generating a few new rules and a new query, such that running seminaive on the modified program will simulate a top-down query engine, constraining data and all, running the original program. It’s really a sign of humanity’s ingenuity.
Let’s think about how a Datalog query might be executed. It feels as if any query engine (let’s focus on top-down) has to do the following:
This sounds a lot like parsing, and especially like general parsers a la Takmela. To take the idea further I literally did a copy/paste on Takmela’s parser, changed various things, and tested it with some Datalog programs. It worked.
| Parsing | Datalog | |
| Input position | An integer | n/a |
| Call | Rule Name + input pos | Rule Name + any constant arguments, e.g ancestor(a, ?) |
| CodePos | Where are we in a non-terminal rule? | Where are we in a Datalog rule (aka a Horn Clause)? |
| Continuation | Call + CodePos + "parse tree so far" if we want parse trees. | Call + CodePos + any bound intermediate variables e.g {Z = b} |
| Success | Call → Set<FinalInputPos> | Call → Set<ResultTuple> |
| Surface matching | Matching terminals | Querying facts |
| More complex matching | Calling non-terminals | Calling Subgoals |
| When resuming a continuation | Continue from last input position | Make sure variables match |
Datalog as implemented in Takmela doesn’t have the concept of an input position because facts can be matched multiple times while executing a query, perhaps in some variant of the language (linear logic?) a concept of "facts used so far" would correspond to the input position.
A continuation in Datalog-via-Takmela also includes the values of variables that have been already bound; reminding us of the relationship between a continuation and a programming language’s call stack.
Let's see a trace of ancestor(b, X) , which can be found here. You'll find it closely resembles our traces of Takmela's parsing, with the GSS, successes, and so on. The source code for the core Takmelogic engine can be found here, I suggest opening it with Takmela's parser code side by side and comparing them!
A Java implementation of the first iteration (lol) of Takmelogic can be found within the same project as the Takmela parser in the takmelogic package.
More traces can be found in the Trace TOC pageSometimes while executing a Datalog query one needs to wait until a certain subquery has been completely executed. Consider for example a special operation count that works like this:
num_ancestors(X, N):- N:= count ancestor(Y, X).
The predicate num_ancestors will attempt to find all tuples where Y is an ancestor of X and return the number thereof. It is important to schedule the execution such that a call to num_ancestors(d, N) will wait until all possible results of ancestor(Y, d) have already been computed. This is also important for negation; if we want to test that a certain predicate is not true, we need to make sure no future inference of this predicate will invalidate our conclusion
A plain Takmelogic implementation doesn’t have that guarantee but we can add it. There's a technique that's compatible with the GSS approach we're using, but perhaps that's a topic for another article.
Earley’s algorithm is the ancestor of many general parsers. It uses a GSS implicitly.
Earley's parsing technique has been applied to Prolog/Datalog queries, presented in the paper parsing as deduction in 1983 and improved upon by many other researchers, so I'm definitely not the first person to consider a link between general parsing and Datalog!
Takmela is closely related to a parsing technique presented in Mark Johnson's "Squibs and Discussions: Memoization in Top-Down Parsing" (link), with the main difference that Mark Johnson’s technique uses Scheme’s built in continuations language feature, while Takmela represents continuations as an explicit data structure; this opens the door for implementation of the parser in typical programming languages that don’t support continuations, and access to the GSS means the ability to add features like visualizing a running parser, or optimizations that e.g compress or prune the graph-structured stack (If I understood correctly -- and I may have not -- Marpa seems to make great use of such optimizations).
Takmela was directly inspired by GLL and started as an attempt to simplify it in order to understand it (because I needed a parser for my other projects and wasn’t satisfied with existing tools). In terms of generality and worst-case time complexity Takmela can’t beat GLL; which represents the state of the art in parsing right now. If Takmela were to add something it would be simplicity of understanding and implementation, which could make it easier for more people to make use of general parsing or or add improvements.
The bad news: I haven’t written a correctness proof for Takmela nor worked out the algorithm’s time complexity. I’ve tried, but it seems I lack the academic experience needed to do so. I do have some rough ideas on how to start but that's it.
Before you dismiss me outright, please consider that it’s okay to ask for help, right? I want to release this work to the world, but I fully acknowledge my lack of the knowledge needed to finish it. I’m not writing this article to celebrate some accomplishment but for the sake of outreach. If you’re a person who finds this research direction interesting and who actually knows what they’re doing; you’re so welcome to contact me about mentoring, guidance, supervision, or other forms of collaboration. My email is samy2004@gmail.com
What would you gain out of this? I suggest the possibility of discovering something which makes the field a slightly more interesting place; this article is about a ~500 line general parser and a similarly sized Datalog interpreter, both of which can be explained to someone else in about an hour. The only thing missing is the correctness and complexity proofs; and then both parsing and Datalog could become much more ubiquitously used. We also have the potential of the GSS as a new computer science workhorse, through introducing more variants of Takmela’s algorithm for many other uses. Thank you for giving someone the benefit of doubt in the name of pursuing ideas!