dbtreasure/zig-bpe
Byte Pair Encoding (BPE) in the Zig programming language (0.13.0)
This project implements a basic tokenizer in Zig, focusing on text processing and Byte Pair Encoding (BPE) concepts.
Byte Pair Encoding (BPE) is a data compression algorithm originally described by Philip Gage in 1994 1. It's also known as digram coding 2. BPE works by iteratively replacing the most frequent pair of bytes in a sequence with a single, unused byte. This process continues until no byte pair occurs more than once.
In recent years, BPE has gained popularity in natural language processing, particularly for tokenization in large language models. The modified version used in NLP starts with a vocabulary of individual characters and iteratively merges the most frequent adjacent pairs to form new tokens. This allows the algorithm to effectively handle both single characters and entire words, making it particularly useful for processing text data 3.
BPE is notable for its simplicity and effectiveness in compressing data. It's especially useful in scenarios where the input data contains repeated sequences of bytes. The algorithm's ability to adapt to the specific patterns in the input data makes it a versatile choice for various compression tasks.
This project is developed using Zig version 0.13.0. Make sure you have this version installed to build and run the project successfully.
src/main.zig
: Main entry point and example usagesrc/basic_tokenizer.zig
: Core implementation of the BasicTokenizersrc/utils/read_file.zig
: Utility function for reading filessrc/utils/time_statistics.zig
: Performance measurement utilitiesTo run the main program:
zig run src/main.zig
The current implementation works with UTF-8 encoded text but may not handle all edge cases of UTF-8 encoding. Future updates may focus on improving Unicode support and handling more complex scenarios.
The basic_tokenizer.zig
file provides the core implementation of the tokenizer. Key features include:
BasicTokenizer
struct, encapsulating all tokenization-related functionality.train
method to learn from input text.The tokenizer supports serializing and deserializing merge operations, allowing you to save and load the trained model.
The project includes several unit tests to ensure the correct functionality of the tokenizer. These tests cover various aspects of the BasicTokenizer
implementation:
To run all the tests for the project, use the following command in the root directory of the project:
zig test src/basic_tokenizer.zig
This command will compile and run all the tests defined in the basic_tokenizer.zig
file.
To run tests for other specific files, replace
zig test src/basic_tokenizer.zig
This command will compile and run all the tests defined in the basic_tokenizer.zig
file.
When you run the tests, Zig will compile the code and execute each test function. The output will show which tests passed or failed, along with any debug information printed during the tests.
If all tests pass, you'll see a message indicating success. If any tests fail, Zig will provide detailed information about the failure, including the line number and the nature of the assertion that failed.