smuth4/zig-pegparse
A full-featured PEG parser written in Zig
zig-pegparse is an arbitrary-lookahead parser based on parsimonious. It is able to rapidly build parsers from a human-readable grammar definition, and then use those parsers to build a abstract syntax tree.
WARNING zig-pegparse is a work in progress, currently being rewritten to use a full PEG virtual machine.
zig-pegparse requires the lastest release of zig (currently 0.15.1).
zig fetch https://github.com/smuth4/zig-pegparse/archive/<commit>.tar.gz --save-exact=zig-pegparse
const std = @import("std");
const pegparse = @import("zig_pegparse");
pub fn main() !void {
// Feel free to use whatever allocator you want
const allocator = std.heap.c_allocator;
// The factory here is a just a specialized grammar that can create more grammars
var grammar_factory = pegparse.Grammar.initFactory(allocator);
defer grammar_factory.deinit();
// Read in the PEG definition
var grammar = grammar_factory.createGrammar(
\\bold_text = bold_open text bold_close
\\bold_open = "(("
\\bold_close = "))"
);
defer grammar.deinit();
// Parse an example string
const data = "((bold stuff))";
// Note that parsing the data does not create a copy of it.
// If the data dissappears, the nodes are still technically traverseable, but references invalid data offsets.
var tree = grammar.parse(data);
defer tree.deinit();
// A tree is always returned, even if it didn't parse everything.
if (tree.root().?.value.start != data.len) {
std.debug.print("Parsing did not complete", .{});
}
}
Once a tree is built, there are many options for traversing it. The simplest option is the built-in iterators:
var iter = tree.iterator(.{});
while (iter.next()) |e| {
if (e.exp.name == "bold_open") {
std.debug.print("<b>", .{});
} else if (e.exp.name == "bold_close") {
std.debug.print("</b>", .{});
} else if (e.exp.name == "text") {
const nodeData = data[e.start..e.end];
std.debug.print("{s}", .{nodeData});
} else {
std.debug.print("Unknown node type \"{s}\"!", .{node.value.expr.*.name});
}
}
If you need more control, you may want to consider the visitor
pattern. It will allow for more complicated decision making about when
and where to descend the tree. Since zig-pegparse dogfoods it's own
grammar logic, see ExpressionVisitor
for a more complete example.
const NodeVisitor = struct {
const NodeVisitorSignature = *const fn (self: *NodeVisitor, data: []const u8, node: *const Node) void;
const visitor_table = std.static_string_map.StaticStringMap(ExpressionVisitorSignature).initComptime(.{
.{ "bold_open", visit_bold_open },
.{ "bold_close", visit_bold_close },
.{ "bold_text", visit_bold_text },
});
fn visit_generic(self: *ExpressionVisitor, data: []const u8, node: *const Node) void {
if (visitor_table.get(node.value.expr.*.name)) |func| {
func(self, data, node);
} else {
std.debug.print("Unknown node type \"{s}\"!", .{node.value.expr.*.name});
}
}
}
fn visit_bold_open(_: *ExpressionVisitor, _: []const u8, _: *const Node) void {
std.debug.print("<b>", .{});
}
fn visit_bold_close(_: *ExpressionVisitor, _: []const u8, _: *const Node) void {
std.debug.print("</b>", .{});
}
fn visit_bold_close(_: *ExpressionVisitor, data: []const u8, node: *const Node) void {
const nodeData = data[e.start..e.end];
std.debug.print("{s}", .{nodeData});
}
}
nv = NodeVisitor{};
nv.visit_generic(tree.root().?);
Example | Notes |
---|---|
"string literal" |
A simple string literal, with support for escape sequences such as \n . |
'string literal' |
Another form of string literals. Unlike double quoted literals, the only supported escape sequences are \' and \\ |
a |
An example of a reference to another rule. Recursivity in references in support by PEG grammars, but must be carefully used to avoid infinite loops. |
a b c |
A sequence, where all the items must match in that order. |
a / b / c |
A choice, where only of one the items will be matched. Items will be tested in order until one succeeds or all of the fail |
~"\s+"i |
Regex, with an optional flag at the end (i for case-insensitivity in this case). Escape sequences are passed directly to PCRE2. |
( ) |
Used to group expression to ensure a certain priority, has no effect on actual parsing |
By default, parse()
may return several types of library-specific errors, e.g. if an invalid regex is supplied. This is often insufficient for troubleshooting, so zig-pegparse uses the diagnostic pattern to provide additional context:
var diagnostic = pegparse.ParseErrorDiagnostic.init(allocator); // Create a diagnostic object
grammar_factory.diagnostic = &p; // Assign it to the grammar
grammar_data = "a = ~\"[A-Z\""; // Oh no, this invalid regex won't be parsed correctly!
if (grammar_factor.createGrammar(grammar_data)) {
// Success! Do whatever processing is needed
} else |err| {
switch (err) {
pegparse.Error.GrammarParseError => {
// Grammar error, print the information out to stderr
var stderr_buffer: [4096]u8 = undefined;
var stderr_writer = std.fs.File.stderr().writer(&stderr_buffer);
const stderr = &stderr_writer.interface;
diagnostic.dump(stderr, grammar_data);
}
else => { // Some other error e.g. OutOfMemory, no diagnostic will be available }
}
Note that if you wish to re-use this pattern, the diagnostic object should be assigned to the visitor, not the created grammar.
While a PEG parser will never beat out a dedicated state machine or the like, it should still be pretty darn fast. Parsimonious' section on optimizing grammars is very relevant to zig-pegparser's grammars as well.
zig-pegparse makes use of a packrat cache to prevent excessive backtracking. However, there are grammars that don't do much backtracking in general, leading to both increased memory usage and adverse performance impact. In these situations, you can disable the cache:
grammar.disableCache();
There is also an experimental grammar.optimize()
functional, which currently only resolves references to be direct pointers, but may be updated with additional improvements like automatic cut operators in the future.
Bryan Ford's site is an incredible resource for anyone looking at PEG/packrat parsing: https://bford.info/packrat