squeek502/named-character-references
Purpose-built DAFSA for named character reference matching during HTML tokenization
This is an attempt at a purpose-built DAFSA of named character references for implementing the Named character reference state of HTML tokenization.
That is, the goal is to encode the necessary data compactly while still allowing for fast matching of named character references, while taking full advantage of the note that:
This list [of named character references] is static and will not be expanded or changed in the future.
I've written an in-depth article about the DAFSA approach being used, plus a comparison to the approaches used by Chrome/Firefox/Safari here:
The current implementation is an evolution of what's described in that article.
NOTE The implementation in this repository has also been ported to C++ and used in the Ladybird browser (initial PR, follow-up PR)
Currently targets the latest Zig release (0.14.1).
Add the package to your build.zig.zon
, e.g.:
zig fetch --save git+https://github.com/squeek502/named-character-references.git
In your build.zig
:
const named_character_references = b.dependency("named_character_references", .{
.target = target,
.optimize = optimize,
});
const named_character_references_mod = named_character_references.module("named_character_references");
// Add the module as an import wherever it's needed
your_compile_step.root_module.addImport("named_character_references", named_character_references_mod);
An example of how to use the API to implement the named character reference state can be found here (it's C++ but it's using the same Matcher
API that the Zig implementation exposes)
TODO: Add a proper reference implementation of the named character reference tokenization state to this repository.
I'll skip over describing a typical DAFSA (see the article for a thorough explanation) and only talk about the modifications that were made. The TL;DR is that lookup tables were added to make the search for the first and second character O(1)
instead of O(n)
.
O(1)
search for the first character instead of an O(n)
search. To continue to allow minimal perfect hashing, the number
field in the lookup table entries contain the total of what the normal number field would be of all of the sibling nodes before it. That is, if we have a
, b
, and c
(in that order) where the count of all possible valid words from each node is 3, 2, and 1 (respectively), then in the lookup table the element corresponding to a
would have the number
0, b
would get the number 3, and c
would get the number 5 (since the nodes before it have the numbers 3 and 2, so 3 + 2 = 5). This allows the O(1)
lookup to emulate the "linear scan over children while incrementally adding up number
fields" approach to minimal perfect hashing using a DAFSA without the need for the linear scan. In other words, the linear scan is emulated at construction time and the result is stored in the lookup entry corresponding to the node that would get that result.@popCount
. This allows for storing the minimum number of elements in the second layer lookup table, since gaps between set bits are disregarded by @popCount
. For example, if a mask looks like 0b1010
, then we can store a lookup table with 2 elements and our @popCount
incantation will only ever be able to return either a 0 or a 1 for the index to use.@popCount
stuff) is added to the offset to get the final index of the second layer node.number
field is cumulative, in the same way that the first/second layer store a cumulative number
field. This cuts down slightly on the amount of work done during the search of a list of children, and we can get away with it because the cumulative number
fields of the remaining nodes in the DAFSA (after the first and second layer nodes were extracted out) happens to require few enough bits that we can store the cumulative version while staying under our 32-bit budget.Overall, these changes give the DAFSA implementation about a 2x speedup in the 'raw matching speed' benchmark I'm using.
u21
integers, which allows the storage of 2,231 character reference -> codepoint(s)
transformations in 5,857 bytesThis means that the full named character reference data is stored in 15,170 + 5,857 = 21,027 bytes or 20.53 KiB. This is actually 318 fewer bytes than the traditional DAFSA implementation would use (3,872 4-byte nodes in a single array).
NOTE It's also possible to remove the semicolon nodes from the DAFSA (inspired by a change that Safari made). This would save 796 bytes (199 4-byte nodes removed from the DAFSA), but has a performance cost so I didn't feel it was worth it overall. If you're curious, see this commit for how that change could be implemented.
The DAFSA data is generated and the result is combined with the rest of the relevant code and the combination is then checked into the repository (see named_character_references.zig
). This is done for two reasons (but I'm not claiming they are good reasons):
comptime
is currently too slow to handle generating the DAFSA data at compile-timeNote: Running this is only necessary if the encoding of the DAFSA is changed
Requires entities.json
which can be downloaded from here.
zig build generate
./zig-out/bin/generate > generated.zig
Outputs the generated Zig code to stdout, containing all the generated arrays.
Note: This is not a full test of the Named character reference state tokenization step; instead, it's a somewhat crude approximation in order to run the
namedEntities.test
test cases
The main test needs namedEntities.test
from html5lib-tests
which can be downloaded from here. That test is skipped if that file is not found.
zig build test
Encoding the DAFSA nodes:
Minimal perfect hashing:
Constructing and minimizing the trie when generating the DAFSA: