ndrean/z-html
Building a Zig wrapper for lexbor
lexbor
in Zig
projectWARNING Work in progress
zhtml
is a wrapper of the C
library lexbor, an HTML parser/DOM emulator.
This is useful for web scraping, email sanitization, test engine for integrated tests, SSR post-processing of fragments.
The primitives exposed here stay as close as possible to JavaScript
semantics.
Features:
This project exposes a significant / essential subset of all available lexbor
functions:
<template>
elements.lexbor
DOM memory management: Document Ownership and zero-copy functionsIn lexbor
, nodes belong to documents, and the document acts as the memory manager.
When a node is attached to a document (either directly or through a fragment that gets appended), the document owns it.
Every time you create a document, you need to call destroyDocument()
: it automatically destroys ALL nodes that belong to it.
When a node is NOT attached to any document, you must manually destroy it.
Some functions borrow memory from lexbor
for zero-copy operations: their result is consumed immediately.
We opted for the following convention: add _zc
(for zero_copy) to the non allocated version of a function. For example, you can get the qualifiedName of an HTMLElement with the allocated version qualifiedName(allocator, node)
or by mapping to lexbor
memory with qualifiedName_zc(node)
. The non-allocated must be consumed immediately whilst the allocated result can outlive the calling function.
You have several methods available.
The parseString
creates a <head>
and a <body>
element and replaces BODY innerContent with the nodes created by the parsing of the given string.
const z = @import("zhtml");
const doc: *HTMLDocument = try z.createDocument();
defer z.destroyDocument(doc);
try z.parseString(doc, "<div></div>");
const body: *DomNode = z.bodyNode(doc).?;
const p: *HTMLElement = try z.createElement(doc, "p");
z.appendChild(body, z.elementToNode(p));
Your document now contains this HTML:
<head></head>
<body>
<div></div>
<p></p>
</body>
You have a shortcut to directly create and parse an HTML string with createDocFromString
.
const doc: *HTMLDocument = try z.createDocFromString("<div></div><p></p>");
defer z.destroyDocument(doc);
You receive chunks and build a document.
const z = @import("zhtml");
const print = std.debug.print;
fn demoStreamParser(allocator: std.mem.Allocator) !void {
var streamer = try z.Stream.init(allocator);
defer streamer.deinit();
try streamer.beginParsing();
const streams = [_][]const u8{
"<!DOCTYPE html><html><head><title>Large",
" Document</title></head><body>",
"<table id=\"producttable\">",
"<caption>Company data</caption><thead>",
"<tr><th scope=\"col\">",
"Code</th><th>Product_Name</th>",
"</tr></thead><tbody>",
};
for (streams) |chunk| {
print("chunk: {s}\n", .{chunk});
try streamer.processChunk(chunk);
}
for (0..2) |i| {
const li = try std.fmt.allocPrint(
allocator,
"<tr id={}><th >Code: {}</th><td>Name: {}</td></tr>",
.{ i, i, i },
);
defer allocator.free(li);
print("chunk: {s}\n", .{li});
try streamer.processChunk(li);
}
const end_chunk = "</tbody></table></body></html>";
print("chunk: {s}\n", .{end_chunk});
try streamer.processChunk(end_chunk);
try streamer.endParsing();
const html_doc = streamer.getDocument();
defer z.destroyDocument(html_doc);
const html_node = z.documentRoot(html_doc).?;
print("\n\n", .{});
try z.prettyPrint(html_node);
print("\n", .{});
try z.printDocStruct(html_doc);
}
You get the output:
chunk: <!DOCTYPE html><html><head><title>Large
chunk: Document</title></head><body>
chunk: <table id="producttable">
chunk: <caption>Company data</caption><thead>
chunk: <tr><th scope="col">Items</th><th>
chunk: Code</th><th>Product_Name</th>
chunk: </tr></thead><tbody>
chunk: <tr id=0><th >Code: 0</th><td>Name: 0</td></tr>
chunk: <tr id=1><th >Code: 1</th><td>Name: 1</td></tr>
chunk: </tbody></table></body></html>;
You can use the "parser engine" with insertFragment
.
fn demoParser(allocator: std.mem.Allocator) !void {
const doc = try z.createDocFromString("<div><ul></ul></div>");
defer z.destroyDocument(doc);
const body = z.bodyNode(doc).?;
const ul_elt = z.getElementByTag(body, .ul).?;
const ul = z.elementToNode(ul_elt);
var parser = try z.FragmentParser.init(allocator);
defer parser.deinit();
for (0..3) |i| {
const li = try std.fmt.allocPrint(
allocator,
"<li id='item-{}'>Item {}</li>",
.{ i, i },
);
defer allocator.free(li);
try parser.insertFragment(ul, li, .ul, false);
}
try z.prettyPrint(body);
}
You can use setInnerHTML
:
test "setInnerHTML" {
const doc = try z.createDocFromString("<div id=\"target\"></div>");
defer z.destroyDocument(doc);
const body = z.bodyNode(doc).?;
const div = z.getElementById(body, "target").?;
const new_div = try z.setInnerHTML(div, "<p class=\"new-content\">New Content</p>");
try z.prettyPrint(new_div);
}
The parseString
and setInnerHTML
provide some basic HTML escaping.
Suppose you have the following malicious HTML string:
<script>alert('XSS')</script>
<img src=\"data:text/html,<script>alert('XSS')</script>\" alt=\"escaped\">
<p id=\"1\" phx-click=\"increment\" onclick=\"alert('XSS')\">Click me</p>
<a href=\"http://example.org/results?search=<img src=x onerror=alert('hello')>\">URL Escaped</a>
When you use setInnerHTML
or parseString
, you get this first barrier with escaped code:
<script>alert('XSS')</script>
<img src=\"data:text/html,<script>alert('XSS')</script>\" alt=\"escaped\">
<p id=\"1\" phx-click=\"increment\" onclick="alert('XSS')">Click me</p>
<a href=\"http://example.org/results?search=<img src=x onerror=alert('hello')>\">URL Escaped</a>
You can go one step further with nuanced parsing that comes in different flavours: setInnerSafeHTML
, setInnerSafeHTMLStrict
, setInnerSafeHTMLPermissive
.
Let's take a bigger HTML with <svg>
, custom-elements with XSS injections coming mostly from scr
and href
attribute values.
<div style="background: url(javascript:alert('css'))">
<button disabled hidden onclick=\"alert('XSS')\" phx-click=\"increment\">Potentially dangerous</button>
<!-- a malicious comment -->
<div data-time=\"{@current}\"> The current value is: {@counter} </div>
<a href=\"http://example.org/results?search=<img src=x onerror=alert('hello')>\">URL Escaped</a>
<a href=\"javascript:alert('XSS')\">Dangerous, not escaped</a>
<img src=\"javascript:alert('XSS')\" alt=\"not escaped\">
<img src="https://example.com/image.jpg" alt="Safe image" onerror="alert('img')">
<iframe src=\"javascript:alert('XSS')\" alt=\"not escaped\"></iframe>
<a href=\"data:text/html,<script>alert('XSS')</script>\" alt=\"escaped\">Safe escaped</a>
<a href="https://example.com" class="link">Good link</a>
<img src=\"data:text/html,<script>alert('XSS')</script>\" alt=\"escaped\">
<iframe src=\"data:text/html,<script>alert('XSS')</script>\" >Escaped</iframe>
<iframe sandbox src="https://example.com" title"test iframe">Safe iframe</iframe>
<img src=\"data:image/svg+xml,<svg onload=alert('XSS')\" alt=\"escaped\"></svg>\">
<img src=\"data:image/svg+xml;base64,PHN2ZyBvbmxvYWQ9YWxlcnQoJ1hTUycpPjwvc3ZnPg==\" alt=\"potential dangerous b64\">
<a href=\"data:text/html;base64,PHNjcmlwdD5hbGVydCgnWFNTJyk8L3NjcmlwdD4=\">Potential dangerous b64</a>
<img src=\"data:text/html;base64,PHNjcmlwdD5hbGVydCgnWFNTJyk8L3NjcmlwdD4=\" alt=\"potential dangerous b64\">
<a href=\"file:///etc/passwd\">Dangerous Local file access</a>
<img src=\"file:///etc/passwd\" alt=\"dangerous local file access\">
<p>Hello<i>there</i>, all<strong>good?</strong></p>
<p>Visit this link: <a href=\"https://example.com\">example.com</a></p>
<svg viewBox=\"0 0 100 100\" onclick=\"alert('svg-xss')\">
<circle cx=\"50\" cy=\"50\" r=\"40\" fill=\"blue\"/>
<script>alert('svg-script')</script>
<foreignObject width=\"100\" height=\"100\">
<div xmlns=\"http://www.w3.org/1999/xhtml\">Evil content</div>
</foreignObject>
<animate attributeName=\"opacity\" values=\"0;1\" dur=\"2s\" onbegin=\"alert('animate')\"/>
<path d=\"M10 10 L90 90\" stroke=\"red\"/>
<text x=\"50\" y=\"50\" href=\"javascript:alert('text')\">SVG Text</text>
</svg>
<phoenix-component phx-click=\"increment\" :if=\"show_component\" onclick=\"alert('custom')\">Phoenix LiveView Component</phoenix-component>
<my-button @click=\"handleClick\" :disabled=\"isDisabled\" class=\"btn\">Custom Button</my-button>
<vue-component v-if=\"showProfile\" data-user-id=\"123\">Vue Component</vue-component>
<p> The <code>push()</code> method adds one or more elements to the end of an array<p/>
</div>
<link href=\"/shared-assets/misc/link-element-example.css\" rel=\"stylesheet\">
<script>console.log(\"hi\");</script>
<template><li id=\"{}\">Item-"\{}\"</li></li></template>
When you use setInnerSafeHTML
, you remove the main source of XSS injection, like comments, on
-listeners, unknown
<img alt="escaped">
<p id="1" phx-click="increment">Click me</p>
<a href="http://example.org/results?search=<img src=x onerror=alert('hello')>">URL Escaped</a>
You can also use insertAdjacentHTML
to insert HTML fragments.. To serialize the DOM, you can use innerHTML
or outerHTML
.
test "insertAdjacentHTML and serialize" {
const allocator = std.testing.allocator;
const doc = try z.createDocFromString(
\\<div id="container">
\\ <p id="target">Target</p>
\\</div>
;
);
defer z.destroyDocument(doc);
const body = z.bodyNode(doc).?;
const target = z.getElementById(body, "target").?;
try z.insertAdjacentHTML(
allocator,
target,
.beforeend,
"<span class=\"before end\"></span>",
false,
);
try z.insertAdjacentHTML(
allocator,
target,
.afterend,
"<span class=\"after end\">After End</span>",
false,
);
try z.insertAdjacentHTML(
allocator,
target,
.afterbegin,
"<span class=\"after begin\"></span>",
false,
);
try z.insertAdjacentHTML(
allocator,
target,
.beforebegin,
"<span class=\"before begin\"></span>",
false,
);
// serialize
const html = try z.outerHTML(allocator, z.nodeToElement(body).?);
defer allocator.free(html);
// Normalize whitespace for easy comparison
const clean_html = try z.normalizeText(allocator, html, .{});
defer allocator.free(clean_html);
const expected = "<body><div id=\"container\"><span class=\"before begin\"></span><p id=\"target\"><span class=\"after begin\"></span>Target<span class=\"before end\"></span></p><span class=\"after end\">After End</span></div></body>";
std.debug.assert(std.mem.eql(u8, expected, clean_html) == true);
try z.prettyPrint(z.elementToNode(body));
print("{s}\n", .{clean_html});
}
<body>
<div id="container">
<span class="before begin"></span>
<p id="target">
<span class="after begin"></span>Target<span class="before end"></span></p>
<span class="after end">After End</span>
</div>
</body>
From the HTML string:
<body>
<div>
<button phx-click="increment">Click me</button>
<p>Hello<i>there</i>, all <strong>good?</strong></p>
<p>Visit this link: <a href="https://example.com">example.com</a></p>
</div>
</body>
You can output a nice colourful log with prettyPrint
.
We introduced a sanitization tool
// continuation
var engine = try z.CssSelectorEngine.init(allocator);
defer engine.deinit();
// Find the second li element using nth-child
const second_li = try engine.querySelector(
body_node,
"ul > li:nth-child(2)",
);
if (second_li) |result| {
const attribute = try z.getAttribute(
allocator,
z.nodeToElement(result).?,
"data-id",
);
if (attribute) |attr| {
defer allocator.free(attr);
try testing.expectEqualStrings(attr, "2");
}
}
}
We have three types of search available, each with different behaviors and use cases:
// Search-on-demand collections (browser-like behavior)
var collection = try z.createCollectionByClassName(doc, "bold");
defer collection.deinit();
print("Found {} elements\n", .{collection.length()});
// Token-based, case-insensitive, most flexible
const css_results = try z.querySelectorAll(allocator, doc, ".bold");
defer allocator.free(css_results);
print("Found {} elements\n", .{css_results.len});
// Manual traversal with hasClass checking
var current_element = z.firstElementChild(body);
while (current_element) |element| {
if (z.hasClass(element, "bold")) {
// Found matching element
}
current_element = z.nextElementSibling(element);
}
We provide a comprehensive test that demonstrates all three search approaches:
Test: "Comprehensive search comparison: Collection vs CSS vs Attributes"
in src/modules/collection.zig
When you run zig build test
, you'll see output like:
=== Comprehensive Search Comparison ===
Testing search for class 'bold' using 3 different approaches:
1. Collection-based search (exact string matching):
Found 1 elements with exact class='bold'
2. CSS Selector search (token-based, case-insensitive):
Found 10 elements matching CSS selector '.bold'
3. Attribute-based search (manual traversal with hasClass):
Found 8 elements using hasClass('bold')
Key Differences:
class="bold"
exactlybold
, BOLD
, text bold
, etc. bold
as class token in any positionThis demonstrates why different search methods return different results for the same query.
HTML chunks are parsed as they stream into a full HTML document.
Fragment parsing handles templates and components (for template engines, component frameworks, server-side rendering).
Feature | Fragment Parsing | Chunk Parsing |
---|---|---|
Purpose | Parse incomplete HTML snippets | Parse streaming into complete HTML |
Input | Template fragments, components | Network streams, large files chunks |
Context | Respects HTML parsing rules by context | Sequential document building |
Output | Parsed fragment nodes | Complete document |
Use Cases | Templates e.g. email, web components, component-based frameworks... | HTTP responses, file processing |
Method | Context Awareness | Cross-Document | Memory Management | Use Case |
---|---|---|---|---|
printDocStruct |
Full document | Single doc | Document owns all | Complete pages |
setInnerHTML |
Parent element context | Same document | Element cleanup | Content replacement |
parseFragment |
Explicit context | Cross-document via parseFragmentInto |
Fragment result owns | Templates/components |
The fragment parsing gives you context-aware parsing - meaning <tr>
elements are parsed differently when you specify a .table
context vs. .body
context.
const result = try z.parseFragment(allocator, "<tr><td>Data</td></tr>", .table);
defer result.deinit();
try z.parseFragmentInto(allocator, target_doc, container, "<p>Fragment</p>", .body);
getElements()
and serialize()
TODO
lexbor
build automation)lexbor
source code & built static file)zhtml.zig
minimal.c
errors.zig
modules
main.zig (demo ? TODO:
examples (TODO)
lexbor
built with static linkingmake -f Makefile.lexbor
The build.zig file runs all the tests from zhtml.zig. It imports all the submodules and runs the tests.
zig build test --summary all
zig build run -Doptimize=Debug
#
zig build run -Doptimize=ReleaseFast
lexbor
exampleshttps://github.com/lexbor/lexbor/tree/master/examples/lexbor
lexbor
libraryIn the built static object liblexbor_static.a:
nm lexbor_src_2.4.0/build/liblexbor_static.a | grep -i "serialize"
In the source code:
find lexbor_src_2.4.0/source -name "*.c" | xargs grep -l "lxb_selectors_opt_set_noi"
or
grep -r -A 10 -B 5 "serialize" lexbor_src_2.4.0/source/lexbor/