Skip to content

Latest commit

 

History

History
1567 lines (1241 loc) · 76.5 KB

README.md

File metadata and controls

1567 lines (1241 loc) · 76.5 KB

StringZilla 🦖

StringZilla banner

The world wastes a minimum of $100M annually due to inefficient string operations. A typical codebase processes strings character by character, resulting in too many branches and data-dependencies, neglecting 90% of modern CPU's potential. LibC is different. It attempts to leverage SIMD instructions to boost some operations, and is often used by higher-level languages, runtimes, and databases. But it isn't perfect. 1️⃣ First, even on common hardware, including over a billion 64-bit ARM CPUs, common functions like strstr and memmem only achieve 1/3 of the CPU's throughput. 2️⃣ Second, SIMD coverage is inconsistent: acceleration in forward scans does not guarantee speed in the reverse-order search. 3️⃣ At last, most high-level languages can't always use LibC, as the strings are often not NULL-terminated or may contain the Unicode "Zero" character in the middle of the string. That's why StringZilla was created. To provide predictably high performance, portable to any modern platform, operating system, and programming language.

StringZilla Python installs StringZilla Rust installs GitHub Actions Workflow Status GitHub Actions Workflow Status GitHub Actions Workflow Status GitHub Actions Workflow Status StringZilla code size

StringZilla is the GodZilla of string libraries, using SIMD and SWAR to accelerate string operations on modern CPUs. It is up to 10x faster than the default and even other SIMD-accelerated string libraries in C, C++, Python, and other languages, while covering broad functionality. It accelerates exact and fuzzy string matching, edit distance computations, sorting, lazily-evaluated ranges to avoid memory allocations, and even random-string generators.

  • 🐂 C : Upgrade LibC's <string.h> to <stringzilla.h> in C 99
  • 🐉 C++: Upgrade STL's <string> to <stringzilla.hpp> in C++ 11
  • 🐍 Python: Upgrade your str to faster Str
  • 🍎 Swift: Use the String+StringZilla extension
  • 🦀 Rust: Use the StringZilla traits crate
  • 🐚 Shell: Accelerate common CLI tools with sz_ prefix
  • 📚 Researcher? Jump to Algorithms & Design Decisions
  • 💡 Thinking to contribute? Look for "good first issues"
  • 🤝 And check the guide to setup the environment
  • Want more bindings or features? Let me know!

Who is this for?

  • For data-engineers parsing large datasets, like the CommonCrawl, RedPajama, or LAION.
  • For software engineers optimizing strings in their apps and services.
  • For bioinformaticians and search engineers looking for edit-distances for USearch.
  • For DBMS devs, optimizing LIKE, ORDER BY, and GROUP BY operations.
  • For hardware designers, needing a SWAR baseline for strings-processing functionality.
  • For students studying SIMD/SWAR applications to non-data-parallel operations.

Performance

C C++ Python StringZilla
find the first occurrence of a random word from text, ≅ 5 bytes long
strstr 1
x86: 7.4 · arm: 2.0 GB/s
.find
x86: 2.9 · arm: 1.6 GB/s
.find
x86: 1.1 · arm: 0.6 GB/s
sz_find
x86: 10.6 · arm: 7.1 GB/s
find the last occurrence of a random word from text, ≅ 5 bytes long
.rfind
x86: 0.5 · arm: 0.4 GB/s
.rfind
x86: 0.9 · arm: 0.5 GB/s
sz_rfind
x86: 10.8 · arm: 6.7 GB/s
split lines separated by \n or \r 2
strcspn 1
x86: 5.42 · arm: 2.19 GB/s
.find_first_of
x86: 0.59 · arm: 0.46 GB/s
re.finditer
x86: 0.06 · arm: 0.02 GB/s
sz_find_charset
x86: 4.08 · arm: 3.22 GB/s
find the last occurrence of any of 6 whitespaces 2
.find_last_of
x86: 0.25 · arm: 0.25 GB/s
sz_rfind_charset
x86: 0.43 · arm: 0.23 GB/s
Random string from a given alphabet, 20 bytes long 5
rand() % n
x86: 18.0 · arm: 9.4 MB/s
std::uniform_int_distribution
x86: 47.2 · arm: 20.4 MB/s
join(random.choices(...))
x86: 13.3 · arm: 5.9 MB/s
sz_generate
x86: 56.2 · arm: 25.8 MB/s
Mapping Characters with Look-Up Table Transforms
std::transform
x86: 3.81 · arm: 2.65 GB/s
str.translate
x86: 260.0 · arm: 140.0 MB/s
sz_look_up_transform
x86: 21.2 · arm: 8.5 GB/s
Get sorted order, ≅ 8 million English words 6
qsort_r
x86: 3.55 · arm: 5.77 s
std::sort
x86: 2.79 · arm: 4.02 s
numpy.argsort
x86: 7.58 · arm: 13.00 s
sz_sort
x86: 1.91 · arm: 2.37 s
Levenshtein edit distance, ≅ 5 bytes long
via jellyfish 3
x86: 1,550 · arm: 2,220 ns
sz_edit_distance
x86: 99 · arm: 180 ns
Needleman-Wunsch alignment scores, ≅ 10 K aminoacids long
via biopython 4
x86: 257 · arm: 367 ms
sz_alignment_score
x86: 73 · arm: 177 ms

StringZilla has a lot of functionality, most of which is covered by benchmarks across C, C++, Python and other languages. You can find those in the ./scripts directory, with usage notes listed in the CONTRIBUTING.md file. Notably, if the CPU supports misaligned loads, even the 64-bit SWAR backends are faster than either standard library.

Most benchmarks were conducted on a 1 GB English text corpus, with an average word length of 6 characters. The code was compiled with GCC 12, using glibc v2.35. The benchmarks performed on Arm-based Graviton3 AWS c7g instances and r7iz Intel Sapphire Rapids. Most modern Arm-based 64-bit CPUs will have similar relative speedups. Variance withing x86 CPUs will be larger. 1 Unlike other libraries, LibC requires strings to be NULL-terminated. 2 Six whitespaces in the ASCII set are: \t\n\v\f\r. Python's and other standard libraries have specialized functions for those. 3 Most Python libraries for strings are also implemented in C. 4 Unlike the rest of BioPython, the alignment score computation is implemented in C. 5 All modulo operations were conducted with uint8_t to allow compilers more optimization opportunities. The C++ STL and StringZilla benchmarks used a 64-bit Mersenne Twister as the generator. For C, C++, and StringZilla, an in-place update of the string was used. In Python every string had to be allocated as a new object, which makes it less fair. 6 Contrary to the popular opinion, Python's default sorted function works faster than the C and C++ standard libraries. That holds for large lists or tuples of strings, but fails as soon as you need more complex logic, like sorting dictionaries by a string key, or producing the "sorted order" permutation. The latter is very common in database engines and is most similar to numpy.argsort. Current StringZilla solution can be at least 4x faster without loss of generality.

Functionality

StringZilla is compatible with most modern CPUs, and provides a broad range of functionality.

  • works on both Little-Endian and Big-Endian architectures.
  • works on 32-bit and 64-bit hardware architectures.
  • compatible with ASCII and UTF-8 encoding.

Not all features are available across all bindings. Consider contributing, if you need a feature that's not yet implemented.

Maturity C 99 C++ 11 Python Swift Rust
Substring Search 🌳
Character Set Search 🌳
Edit Distances 🧐
Small String Class 🧐
Sorting & Sequence Operations 🚧
Lazy Ranges, Compressed Arrays 🧐
Hashes & Fingerprints 🚧

🌳 parts are used in production. 🧐 parts are in beta. 🚧 parts are under active development, and are likely to break in subsequent releases. ✅ are implemented. ⚪ are considered. ❌ are not intended.

Quick Start: Python 🐍

Python bindings are available on PyPI, and can be installed with pip. You can immediately check the installed version and the used hardware capabilities with following commands:

pip install stringzilla
python -c "import stringzilla; print(stringzilla.__version__)"
python -c "import stringzilla; print(stringzilla.__capabilities__)"

Basic Usage

If you've ever used the Python str, bytes, bytearray, memoryview class, you'll know what to expect. StringZilla's Str class is a hybrid of those two, providing str-like interface to byte-arrays.

from stringzilla import Str, File

text_from_str = Str('some-string') # no copies, just a view
text_from_bytes = Str(b'some-array') # no copies, just a view
text_from_file = Str(File('some-file.txt')) # memory-mapped file

import numpy as np
alphabet_array = np.arange(ord("a"), ord("z"), dtype=np.uint8)
text_from_array = Str(memoryview(alphabet_array))

The File class memory-maps a file from persistent memory without loading its copy into RAM. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously. A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.

Basic Operations

  • Length: len(text) -> int
  • Indexing: text[42] -> str
  • Slicing: text[42:46] -> Str
  • Substring check: 'substring' in text -> bool
  • Hashing: hash(text) -> int
  • String conversion: str(text) -> str

Advanced Operations

import sys

x: bool = text.contains('substring', start=0, end=sys.maxsize)
x: int = text.find('substring', start=0, end=sys.maxsize)
x: int = text.count('substring', start=0, end=sys.maxsize, allowoverlap=False)
x: str = text.decode(encoding='utf-8', errors='strict')
x: Strs = text.split(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.splitlines(keeplinebreaks=False, maxsplit=sys.maxsize)

It's important to note, that the last function behavior is slightly different from Python's str.splitlines. The native version matches \n, \r, \v or \x0b, \f or \x0c, \x1c, \x1d, \x1e, \x85, \r\n, \u2028, \u2029, including 3x two-bytes-long runes. The StringZilla version matches only \n, \v, \f, \r, \x1c, \x1d, \x1e, \x85, avoiding two-byte-long runes.

Character Set Operations

Python strings don't natively support character set operations. This forces people to use regular expressions, which are slow and hard to read. To avoid the need for re.finditer, StringZilla provides the following interfaces:

x: int = text.find_first_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_of('chars', start=0, end=sys.maxsize)
x: int = text.find_first_not_of('chars', start=0, end=sys.maxsize)
x: int = text.find_last_not_of('chars', start=0, end=sys.maxsize)
x: Strs = text.split_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
x: Strs = text.rsplit_charset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)

You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set. This would result in a copy - str for str inputs and bytes for other types.

x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False)
x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False)

For efficiency reasons, pass the LUT as a string or bytes object, not as a dictionary. This can be useful in high-throughput applications dealing with binary data, including bioinformatics and image processing. Here is an example:

import stringzilla as sz
look_up_table = bytes(range(256)) # Identity LUT
image = open("/image/path.jpeg", "rb").read()
sz.translate(image, look_up_table, inplace=True)

Collection-Level Operations

Once split into a Strs object, you can sort, shuffle, and reorganize the slices, with minimum memory footprint. If all the chunks are located in consecutive memory regions, the memory overhead can be as low as 4 bytes per chunk.

lines: Strs = text.split(separator='\n') # 4 bytes per line overhead for under 4 GB of text
batch: Strs = lines.sample(seed=42) # 10x faster than `random.choices`
lines.shuffle(seed=42) # or shuffle all lines in place and shard with slices
# WIP: lines.sort() # explodes to 16 bytes per line overhead for any length text
# WIP: sorted_order: tuple = lines.argsort() # similar to `numpy.argsort`

Working on RedPajama, addressing 20 Billion annotated english documents, one will need only 160 GB of RAM instead of Terabytes. Once loaded, the data will be memory-mapped, and can be reused between multiple Python processes without copies. And of course, you can use slices to navigate the dataset and shard it between multiple workers.

lines[::3] # every third line
lines[1::1] # every odd line
lines[:-100:-1] # last 100 lines in reverse order

Iterators and Memory Efficiency

Python's operations like split() and readlines() immediately materialize a list of copied parts. This can be very memory-inefficient for large datasets. StringZilla saves a lot of memory by viewing existing memory regions as substrings, but even more memory can be saved by using lazily evaluated iterators.

x: SplitIterator[Str] = text.split_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_iter(separator=' ', keepseparator=False)
x: SplitIterator[Str] = text.split_charset_iter(separator='chars', keepseparator=False)
x: SplitIterator[Str] = text.rsplit_charset_iter(separator='chars', keepseparator=False)

StringZilla can easily be 10x more memory efficient than native Python classes for tokenization. With lazy operations, it practically becomes free.

import stringzilla as sz
%load_ext memory_profiler

text = open("enwik9.txt", "r").read() # 1 GB, mean word length 7.73 bytes
%memit text.split() # increment: 8670.12 MiB (152 ms)
%memit sz.split(text) # increment: 530.75 MiB (25 ms)
%memit sum(1 for _ in sz.split_iter(text)) # increment: 0.00 MiB

Low-Level Python API

Aside from calling the methods on the Str and Strs classes, you can also call the global functions directly on str and bytes instances. Assuming StringZilla CPython bindings are implemented without any intermediate tools like SWIG or PyBind, the call latency should be similar to native classes.

import stringzilla as sz

contains: bool = sz.contains("haystack", "needle", start=0, end=sys.maxsize)
offset: int = sz.find("haystack", "needle", start=0, end=sys.maxsize)
count: int = sz.count("haystack", "needle", start=0, end=sys.maxsize, allowoverlap=False)

Edit Distances

assert sz.edit_distance("apple", "aple") == 1 # skip one ASCII character
assert sz.edit_distance("αβγδ", "αγδ") == 2 # skip two bytes forming one rune
assert sz.edit_distance_unicode("αβγδ", "αγδ") == 1 # one unicode rune

Several Python libraries provide edit distance computation. Most of them are implemented in C, but are not always as fast as StringZilla. Taking a 1'000 long proteins around 10'000 characters long, computing just a 100 distances:

Moreover, you can pass custom substitution matrices to compute the Needleman-Wunsch alignment scores. That task is very common in bioinformatics and computational biology. It's natively supported in BioPython, and its BLOSUM matrices can be converted to StringZilla's format. Alternatively, you can construct an arbitrary 256 by 256 cost matrix using NumPy. Depending on arguments, the result may be equal to the negative Levenshtein distance.

import numpy as np
import stringzilla as sz

costs = np.zeros((256, 256), dtype=np.int8)
costs.fill(-1)
np.fill_diagonal(costs, 0)

assert sz.alignment_score("first", "second", substitution_matrix=costs, gap_score=-1) == -sz.edit_distance(a, b)

Using the same proteins as for Levenshtein distance benchmarks:

§ Example converting from BioPython to StringZilla.
import numpy as np
from Bio import Align
from Bio.Align import substitution_matrices

aligner = Align.PairwiseAligner()
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = 1
aligner.extend_gap_score = 1

# Convert the matrix to NumPy
subs_packed = np.array(aligner.substitution_matrix).astype(np.int8)
subs_reconstructed = np.zeros((256, 256), dtype=np.int8)

# Initialize all banned characters to a the largest possible penalty
subs_reconstructed.fill(127)
for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
    for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
        reconstructed_row = ord(packed_row_aminoacid)
        reconstructed_column = ord(packed_column_aminoacid)
        subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column]

# Let's pick two examples for of tri-peptides (made of 3 aminoacids)
glutathione = "ECG" # Need to rebuild human tissue?
thyrotropin_releasing_hormone = "QHP" # Or to regulate your metabolism?

assert sz.alignment_score(
    glutathione,
    thyrotropin_releasing_hormone, 
    substitution_matrix=subs_reconstructed, 
    gap_score=1) == aligner.score(glutathione, thyrotropin_releasing_hormone) # Equal to 6

Serialization

Filesystem

Similar to how File can be used to read a large file, other interfaces can be used to dump strings to disk faster. The Str class has write_to to write the string to a file, and offset_within to obtain integer offsets of substring view in larger string for navigation.

web_archive = Str("<html>...</html><html>...</html>")
_, end_tag, next_doc = web_archive.partition("</html>") # or use `find`
next_doc_offset = next_doc.offset_within(web_archive)
web_archive.write_to("next_doc.html") # no GIL, no copies, just a view

PyArrow

A Str is easy to cast to PyArrow buffers.

from pyarrow import foreign_buffer
from stringzilla import Str

original = "hello"
view = Str(native)
arrow = foreign_buffer(view.address, view.nbytes, view)

That means you can convert Str to pyarrow.Buffer and Strs to pyarrow.Array without extra copies.

Quick Start: C/C++ 🛠️

The C library is header-only, so you can just copy the stringzilla.h header into your project. Same applies to C++, where you would copy the stringzilla.hpp header. Alternatively, add it as a submodule, and include it in your build system.

git submodule add https://github.com/ashvardanian/stringzilla.git

Or using a pure CMake approach:

FetchContent_Declare(stringzilla GIT_REPOSITORY https://github.com/ashvardanian/stringzilla.git)
FetchContent_MakeAvailable(stringzilla)

Last, but not the least, you can also install it as a library, and link against it. This approach is worse for inlining, but brings dynamic runtime dispatch for the most advanced CPU features.

Basic Usage with C 99 and Newer

There is a stable C 99 interface, where all function names are prefixed with sz_. Most interfaces are well documented, and come with self-explanatory names and examples. In some cases, hardware specific overloads are available, like sz_find_avx512 or sz_find_neon. Both are companions of the sz_find, first for x86 CPUs with AVX-512 support, and second for Arm NEON-capable CPUs.

#include <stringzilla/stringzilla.h>

// Initialize your haystack and needle
sz_string_view_t haystack = {your_text, your_text_length};
sz_string_view_t needle = {your_subtext, your_subtext_length};

// Perform string-level operations
sz_size_t substring_position = sz_find(haystack.start, haystack.length, needle.start, needle.length);
sz_size_t substring_position = sz_find_avx512(haystack.start, haystack.length, needle.start, needle.length);
sz_size_t substring_position = sz_find_neon(haystack.start, haystack.length, needle.start, needle.length);

// Hash strings
sz_u64_t hash = sz_hash(haystack.start, haystack.length);

// Perform collection level operations
sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};
sz_sort(&array, &your_config);
§ Mapping from LibC to StringZilla.

By design, StringZilla has a couple of notable differences from LibC:

  1. all strings are expected to have a length, and are not necessarily null-terminated.
  2. every operations has a reverse order counterpart.

That way sz_find and sz_rfind are similar to strstr and strrstr in LibC. Similarly, sz_find_byte and sz_rfind_byte replace memchr and memrchr. The sz_find_charset maps to strspn and strcspn, while sz_rfind_charset has no sibling in LibC.

LibC Functionality StringZilla Equivalents
memchr(haystack, needle, haystack_length), strchr sz_find_byte(haystack, haystack_length, needle)
memrchr(haystack, needle, haystack_length) sz_rfind_byte(haystack, haystack_length, needle)
memcmp, strcmp sz_order, sz_equal
strlen(haystack) sz_find_byte(haystack, haystack_length, needle)
strcspn(haystack, needles) sz_rfind_charset(haystack, haystack_length, needles_bitset)
strspn(haystack, needles) sz_find_charset(haystack, haystack_length, needles_bitset)
memmem(haystack, haystack_length, needle, needle_length), strstr sz_find(haystack, haystack_length, needle, needle_length)
memcpy(destination, source, destination_length) sz_copy(destination, source, destination_length)
memmove(destination, source, destination_length) sz_move(destination, source, destination_length)
memset(destination, value, destination_length) sz_fill(destination, destination_length, value)

Basic Usage with C++ 11 and Newer

There is a stable C++ 11 interface available in the ashvardanian::stringzilla namespace. It comes with two STL-like classes: string_view and string. The first is a non-owning view of a string, and the second is a mutable string with a Small String Optimization.

#include <stringzilla/stringzilla.hpp>

namespace sz = ashvardanian::stringzilla;

sz::string haystack = "some string";
sz::string_view needle = sz::string_view(haystack).substr(0, 4);

auto substring_position = haystack.find(needle); // Or `rfind`
auto hash = std::hash<sz::string_view>{}(haystack); // Compatible with STL's `std::hash`

haystack.end() - haystack.begin() == haystack.size(); // Or `rbegin`, `rend`
haystack.find_first_of(" \v\t") == 4; // Or `find_last_of`, `find_first_not_of`, `find_last_not_of`
haystack.starts_with(needle) == true; // Or `ends_with`
haystack.remove_prefix(needle.size()); // Why is this operation in-place?!
haystack.contains(needle) == true; // STL has this only from C++ 23 onwards
haystack.compare(needle) == 1; // Or `haystack <=> needle` in C++ 20 and beyond

StringZilla also provides string literals for automatic type resolution, similar to STL:

using sz::literals::operator""_sz;
using std::literals::operator""sv;

auto a = "some string"; // char const *
auto b = "some string"sv; // std::string_view
auto b = "some string"_sz; // sz::string_view

Memory Ownership and Small String Optimization

Most operations in StringZilla don't assume any memory ownership. But in addition to the read-only search-like operations StringZilla provides a minimalistic C and C++ implementations for a memory owning string "class". Like other efficient string implementations, it uses the Small String Optimization (SSO) to avoid heap allocations for short strings.

typedef union sz_string_t {
    struct internal {
        sz_ptr_t start;
        sz_u8_t length;
        char chars[SZ_STRING_INTERNAL_SPACE]; /// Ends with a null-terminator.
    } internal;

    struct external {
        sz_ptr_t start;
        sz_size_t length;        
        sz_size_t space; /// The length of the heap-allocated buffer.
        sz_size_t padding;
    } external;

} sz_string_t;

As one can see, a short string can be kept on the stack, if it fits within internal.chars array. Before 2015 GCC string implementation was just 8 bytes, and could only fit 7 characters. Different STL implementations today have different thresholds for the Small String Optimization. Similar to GCC, StringZilla is 32 bytes in size, and similar to Clang it can fit 22 characters on stack. Our layout might be preferential, if you want to avoid branches. If you use a different compiler, you may want to check it's SSO buffer size with a simple Gist.

libstdc++ in GCC 13 libc++ in Clang 17 StringZilla
sizeof(std::string) 32 24 32
Small String Capacity 15 22 22

This design has been since ported to many high-level programming languages. Swift, for example, can store 15 bytes in the String instance itself. StringZilla implements SSO at the C level, providing the sz_string_t union and a simple API for primary operations.

sz_memory_allocator_t allocator;
sz_string_t string;

// Init and make sure we are on stack
sz_string_init(&string);
sz_string_is_on_stack(&string); // == sz_true_k

// Optionally pre-allocate space on the heap for future insertions.
sz_string_grow(&string, 100, &allocator); // == sz_true_k

// Append, erase, insert into the string.
sz_string_expand(&string, 0, "_Hello_", 7, &allocator); // == sz_true_k
sz_string_expand(&string, SZ_SIZE_MAX, "world", 5, &allocator); // == sz_true_k
sz_string_erase(&string, 0, 1);

// Unpacking & introspection.
sz_ptr_t string_start;
sz_size_t string_length;
sz_size_t string_space;
sz_bool_t string_is_external;
sz_string_unpack(string, &string_start, &string_length, &string_space, &string_is_external);
sz_equal(string_start, "Hello_world", 11); // == sz_true_k

// Reclaim some memory.
sz_string_shrink_to_fit(&string, &allocator); // == sz_true_k
sz_string_free(&string, &allocator);

Unlike the conventional C strings, the sz_string_t is allowed to contain null characters. To safely print those, pass the string_length to printf as well.

printf("%.*s\n", (int)string_length, string_start);

What's Wrong with the C Standard Library?

StringZilla is not a drop-in replacement for the C Standard Library. It's designed to be a safer and more modern alternative. Conceptually:

  1. LibC strings are expected to be null-terminated, so to use the efficient LibC implementations on slices of larger strings, you'd have to copy them, which is more expensive than the original string operation.
  2. LibC functionality is asymmetric - you can find the first and the last occurrence of a character within a string, but you can't find the last occurrence of a substring.
  3. LibC function names are typically very short and cryptic.
  4. LibC lacks crucial functionality like hashing and doesn't provide primitives for less critical but relevant operations like fuzzy matching.

Something has to be said about its support for UTF8. Aside from a single-byte char type, LibC provides wchar_t:

  • The size of wchar_t is not consistent across platforms. On Windows, it's typically 16 bits (suitable for UTF-16), while on Unix-like systems, it's usually 32 bits (suitable for UTF-32). This inconsistency can lead to portability issues when writing cross-platform code.
  • wchar_t is designed to represent wide characters in a fixed-width format (UTF-16 or UTF-32). In contrast, UTF-8 is a variable-length encoding, where each character can take from 1 to 4 bytes. This fundamental difference means that wchar_t and UTF-8 are incompatible.

StringZilla partially addresses those issues.

What's Wrong with the C++ Standard Library?

C++ Code Evaluation Result Invoked Signature
"Loose"s.replace(2, 2, "vath"s, 1) "Loathe" 🤢 (pos1, count1, str2, pos2)
"Loose"s.replace(2, 2, "vath", 1) "Love" 🥰 (pos1, count1, str2, count2)

StringZilla is designed to be a drop-in replacement for the C++ Standard Templates Library. That said, some of the design decisions of STL strings are highly controversial, error-prone, and expensive. Most notably:

  1. Argument order for replace, insert, erase and similar functions is impossible to guess.
  2. Bounds-checking exceptions for substr-like functions are only thrown for one side of the range.
  3. Returning string copies in substr-like functions results in absurd volume of allocations.
  4. Incremental construction via push_back-like functions goes through too many branches.
  5. Inconsistency between string and string_view methods, like the lack of remove_prefix and remove_suffix.

Check the following set of asserts validating the std::string specification. It's not realistic to expect the average developer to remember the 14 overloads of std::string::replace.

using str = std::string;

assert(str("hello world").substr(6) == "world");
assert(str("hello world").substr(6, 100) == "world"); // 106 is beyond the length of the string, but its OK
assert_throws(str("hello world").substr(100), std::out_of_range);   // 100 is beyond the length of the string
assert_throws(str("hello world").substr(20, 5), std::out_of_range); // 20 is beyond the length of the string
assert_throws(str("hello world").substr(-1, 5), std::out_of_range); // -1 casts to unsigned without any warnings...
assert(str("hello world").substr(0, -1) == "hello world");          // -1 casts to unsigned without any warnings...

assert(str("hello").replace(1, 2, "123") == "h123lo");
assert(str("hello").replace(1, 2, str("123"), 1) == "h23lo");
assert(str("hello").replace(1, 2, "123", 1) == "h1lo");
assert(str("hello").replace(1, 2, "123", 1, 1) == "h2lo");
assert(str("hello").replace(1, 2, str("123"), 1, 1) == "h2lo");
assert(str("hello").replace(1, 2, 3, 'a') == "haaalo");
assert(str("hello").replace(1, 2, {'a', 'b'}) == "hablo");

To avoid those issues, StringZilla provides an alternative consistent interface. It supports signed arguments, and doesn't have more than 3 arguments per function or The standard API and our alternative can be conditionally disabled with SZ_SAFETY_OVER_COMPATIBILITY=1. When it's enabled, the subjectively risky overloads from the Standard will be disabled.

using str = sz::string;

str("a:b").front(1) == "a"; // no checks, unlike `substr`
str("a:b").front(2) == "2"; // take first 2 characters
str("a:b").back(-1) == "b"; // accepting negative indices
str("a:b").back(-2) == ":b"; // similar to Python's `"a:b"[-2:]`
str("a:b").sub(1, -1) == ":"; // similar to Python's `"a:b"[1:-1]`
str("a:b").sub(-2, -1) == ":"; // similar to Python's `"a:b"[-2:-1]`
str("a:b").sub(-2, 1) == ""; // similar to Python's `"a:b"[-2:1]`
"a:b"_sz[{-2, -1}] == ":"; // works on views and overloads `operator[]`

Assuming StringZilla is a header-only library you can use the full API in some translation units and gradually transition to safer restricted API in others. Bonus - all the bound checking is branchless, so it has a constant cost and won't hurt your branch predictor.

Beyond the C++ Standard Library - Learning from Python

Python is arguably the most popular programming language for data science. In part, that's due to the simplicity of its standard interfaces. StringZilla brings some of that functionality to C++.

  • Content checks: isalnum, isalpha, isascii, isdigit, islower, isspace, isupper.
  • Trimming character sets: lstrip, rstrip, strip.
  • Trimming string matches: remove_prefix, remove_suffix.
  • Ranges of search results: splitlines, split, rsplit.
  • Number of non-overlapping substring matches: count.
  • Partitioning: partition, rpartition.

For example, when parsing documents, it is often useful to split it into substrings. Most often, after that, you would compute the length of the skipped part, the offset and the length of the remaining part. This results in a lot of pointer arithmetic and is error-prone. StringZilla provides a convenient partition function, which returns a tuple of three string views, making the code cleaner.

auto parts = haystack.partition(':'); // Matching a character
auto [before, match, after] = haystack.partition(':'); // Structure unpacking
auto [before, match, after] = haystack.partition(sz::char_set(":;")); // Character-set argument
auto [before, match, after] = haystack.partition(" : "); // String argument
auto [before, match, after] = haystack.rpartition(sz::whitespaces_set()); // Split around the last whitespace

Combining those with the split function, one can easily parse a CSV file or HTTP headers.

for (auto line : haystack.split("\r\n")) {
    auto [key, _, value] = line.partition(':');
    headers[key.strip()] = value.strip();
}

Some other extensions are not present in the Python standard library either. Let's go through the C++ functionality category by category.

Some of the StringZilla interfaces are not available even Python's native str class. Here is a sneak peek of the most useful ones.

text.hash(); // -> 64 bit unsigned integer 
text.ssize(); // -> 64 bit signed length to avoid `static_cast<std::ssize_t>(text.size())`
text.contains_only(" \w\t"); // == text.find_first_not_of(sz::char_set(" \w\t")) == npos;
text.contains(sz::whitespaces_set()); // == text.find(sz::char_set(sz::whitespaces_set())) != npos;

// Simpler slicing than `substr`
text.front(10); // -> sz::string_view
text.back(10); // -> sz::string_view

// Safe variants, which clamp the range into the string bounds
using sz::string::cap;
text.front(10, cap) == text.front(std::min(10, text.size()));
text.back(10, cap) == text.back(std::min(10, text.size()));

// Character set filtering
text.lstrip(sz::whitespaces_set()).rstrip(sz::newlines_set()); // like Python
text.front(sz::whitespaces_set()); // all leading whitespaces
text.back(sz::digits_set()); // all numerical symbols forming the suffix

// Incremental construction
using sz::string::unchecked;
text.push_back('x'); // no surprises here
text.push_back('x', unchecked); // no bounds checking, Rust style
text.try_push_back('x'); // returns `false` if the string is full and the allocation failed

sz::concatenate(text, "@", domain, ".", tld); // No allocations

Splits and Ranges

One of the most common use cases is to split a string into a collection of substrings. Which would often result in StackOverflow lookups and snippets like the one below.

std::vector<std::string> lines = split(haystack, "\r\n"); // string delimiter
std::vector<std::string> words = split(lines, ' '); // character delimiter

Those allocate memory for each string and the temporary vectors. Each allocation can be orders of magnitude more expensive, than even serial for-loop over characters. To avoid those, StringZilla provides lazily-evaluated ranges, compatible with the Range-v3 library.

for (auto line : haystack.split("\r\n"))
    for (auto word : line.split(sz::char_set(" \w\t.,;:!?")))
        std::cout << word << std::endl;

Each of those is available in reverse order as well. It also allows interleaving matches, if you want both inclusions of xx in xxx. Debugging pointer offsets is not a pleasant exercise, so keep the following functions in mind.

  • haystack.[r]find_all(needle, interleaving)
  • haystack.[r]find_all(sz::char_set(""))
  • haystack.[r]split(needle)
  • haystack.[r]split(sz::char_set(""))

For $N$ matches the split functions will report $N+1$ matches, potentially including empty strings. Ranges have a few convenience methods as well:

range.size(); // -> std::size_t
range.empty(); // -> bool
range.template to<std::set<std::sting>>(); 
range.template to<std::vector<std::sting_view>>(); 

Concatenating Strings without Allocations

Another common string operation is concatenation. The STL provides std::string::operator+ and std::string::append, but those are not very efficient, if multiple invocations are performed.

std::string name, domain, tld;
auto email = name + "@" + domain + "." + tld; // 4 allocations

The efficient approach would be to pre-allocate the memory and copy the strings into it.

std::string email;
email.reserve(name.size() + domain.size() + tld.size() + 2);
email.append(name), email.append("@"), email.append(domain), email.append("."), email.append(tld);

That's mouthful and error-prone. StringZilla provides a more convenient concatenate function, which takes a variadic number of arguments. It also overrides the operator| to concatenate strings lazily, without any allocations.

auto email = sz::concatenate(name, "@", domain, ".", tld);   // 0 allocations
auto email = name | "@" | domain | "." | tld;                // 0 allocations
sz::string email = name | "@" | domain | "." | tld;          // 1 allocations

Random Generation

Software developers often need to generate random strings for testing purposes. The STL provides std::generate and std::random_device, that can be used with StringZilla.

sz::string random_string(std::size_t length, char const *alphabet, std::size_t cardinality) {
    sz::string result(length, '\0');
    static std::random_device seed_source; // Expensive to construct - due to system calls
    static std::mt19937 generator(seed_source()); // Also expensive - due to the state size
    std::uniform_int_distribution<std::size_t> distribution(0, cardinality);
    std::generate(result.begin(), result.end(), [&]() { return alphabet[distribution(generator)]; });
    return result;
}

Mouthful and slow. StringZilla provides a C native method - sz_generate and a convenient C++ wrapper - sz::generate. Similar to Python it also defines the commonly used character sets.

auto protein = sz::string::random(300, "ARNDCQEGHILKMFPSTWYV"); // static method
auto dna = sz::basic_string<custom_allocator>::random(3_000_000_000, "ACGT");

dna.randomize("ACGT"); // `noexcept` pre-allocated version
dna.randomize(&std::rand, "ACGT"); // pass any generator, like `std::mt19937`

char uuid[36];
sz::randomize(sz::string_span(uuid, 36), "0123456789abcdef-"); // Overwrite any buffer

Bulk Replacements

In text processing, it's often necessary to replace all occurrences of a specific substring or set of characters within a string. Standard library functions may not offer the most efficient or convenient methods for performing bulk replacements, especially when dealing with large strings or performance-critical applications.

  • haystack.replace_all(needle_string, replacement_string)
  • haystack.replace_all(sz::char_set(""), replacement_string)
  • haystack.try_replace_all(needle_string, replacement_string)
  • haystack.try_replace_all(sz::char_set(""), replacement_string)
  • haystack.transform(sz::look_up_table::identity())
  • haystack.transform(sz::look_up_table::identity(), haystack.data())

Levenshtein Edit Distance and Alignment Scores

Levenshtein and Hamming edit distance are provided for both byte-strings and UTF-8 strings. The latter will output the distance in Unicode code points, not bytes. Needleman-Wunsch alignment scores are only defined for byte-strings.

// Count number of substitutions in same length strings
sz::hamming_distance(first, second[, upper_bound]) -> std::size_t;
sz::hamming_distance_utf8(first, second[, upper_bound]) -> std::size_t;

// Count number of insertions, deletions and substitutions
sz::edit_distance(first, second[, upper_bound[, allocator]]) -> std::size_t;
sz::edit_distance_utf8(first, second[, upper_bound[, allocator]]) -> std::size_t;

// Substitution-parametrized Needleman-Wunsch global alignment score
std::int8_t costs[256][256]; // Substitution costs matrix
sz::alignment_score(first, second, costs[, gap_score[, allocator]) -> std::ptrdiff_t;

Sorting in C and C++

LibC provides qsort and STL provides std::sort. Both have their quarks. The LibC standard has no way to pass a context to the comparison function, that's only possible with platform-specific extensions. Those have different arguments order on every OS.

// Linux: https://linux.die.net/man/3/qsort_r
void qsort_r(void *elements, size_t count, size_t element_width, 
    int (*compare)(void const *left, void const *right, void *context),
    void *context);
// MacOS and FreeBSD: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/qsort_r.3.html
void qsort_r(void *elements, size_t count, size_t element_width, 
    void *context,
    int (*compare)(void *context, void const *left, void const *right));
// Windows conflicts with ISO `qsort_s`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/qsort-s?view=msvc-170
void qsort_s(id *elements, size_t count, size_t element_width, 
    int (*compare)(void *context, void const *left, void const *right),
    void *context);

C++ generic algorithm is not perfect either. There is no guarantee in the standard that std::sort won't allocate any memory. If you are running on embedded, in real-time or on 100+ CPU cores per node, you may want to avoid that. StringZilla doesn't solve the general case, but hopes to improve the performance for strings. Use sz_sort, or the high-level sz::sorted_order, which can be used sort any collection of elements convertible to sz::string_view.

std::vector<std::string> data({"c", "b", "a"});
std::vector<std::size_t> order = sz::sorted_order(data); //< Simple shortcut

// Or, taking care of memory allocation:
sz::sorted_order(data.begin(), data.end(), order.data(), [](auto const &x) -> sz::string_view { return x; });

Standard C++ Containers with String Keys

The C++ Standard Templates Library provides several associative containers, often used with string keys.

std::map<std::string, int, std::less<std::string>> sorted_words;
std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>> words;

The performance of those containers is often limited by the performance of the string keys, especially on reads. StringZilla can be used to accelerate containers with std::string keys, by overriding the default comparator and hash functions.

std::map<std::string, int, sz::string_view_less> sorted_words;
std::unordered_map<std::string, int, sz::string_view_hash, sz::string_view_equal_to> words;

Alternatively, a better approach would be to use the sz::string class as a key. The right hash function and comparator would be automatically selected and the performance gains would be more noticeable if the keys are short.

std::map<sz::string, int> sorted_words;
std::unordered_map<sz::string, int> words;

Compilation Settings and Debugging

SZ_DEBUG:

For maximal performance, the C library does not perform any bounds checking in Release builds. In C++, bounds checking happens only in places where the STL std::string would do it. If you want to enable more aggressive bounds-checking, define SZ_DEBUG before including the header. If not explicitly set, it will be inferred from the build type.

SZ_USE_X86_AVX512, SZ_USE_X86_AVX2, SZ_USE_ARM_NEON:

One can explicitly disable certain families of SIMD instructions for compatibility purposes. Default values are inferred at compile time.

SZ_DYNAMIC_DISPATCH:

By default, StringZilla is a header-only library. But if you are running on different generations of devices, it makes sense to pre-compile the library for all supported generations at once, and dispatch at runtime. This flag does just that and is used to produce the stringzilla.so shared library, as well as the Python bindings.

SZ_USE_MISALIGNED_LOADS:

By default, StringZilla avoids misaligned loads. If supported, it replaces many byte-level operations with word-level ones. Going from char-like types to uint64_t-like ones can significantly accelerate the serial (SWAR) backend. So consider enabling it if you are building for some embedded device.

SZ_AVOID_LIBC and SZ_OVERRIDE_LIBC:

When using the C header-only library one can disable the use of LibC. This may affect the type resolution system on obscure hardware platforms. Moreover, one may let stringzilla override the common symbols like the memcpy and memset with its own implementations. In that case you can use the LD_PRELOAD trick to prioritize it's symbols over the ones from the LibC and accelerate existing string-heavy applications without recompiling them.

SZ_AVOID_STL and SZ_SAFETY_OVER_COMPATIBILITY:

When using the C++ interface one can disable implicit conversions from std::string to sz::string and back. If not needed, the <string> and <string_view> headers will be excluded, reducing compilation time. Moreover, if STL compatibility is a low priority, one can make the API safer by disabling the overloads, which are subjectively error prone.

STRINGZILLA_BUILD_SHARED, STRINGZILLA_BUILD_TEST, STRINGZILLA_BUILD_BENCHMARK, STRINGZILLA_TARGET_ARCH for CMake users:

When compiling the tests and benchmarks, you can explicitly set the target hardware architecture. It's synonymous to GCC's -march flag and is used to enable/disable the appropriate instruction sets. You can also disable the shared library build, if you don't need it.

Quick Start: Rust 🦀

StringZilla is available as a Rust crate, with documentation available on docs.rs/stringzilla. To use the latest crate release in your project, add the following to your Cargo.toml:

[dependencies]
stringzilla = ">=3"

Or if you want to use the latest pre-release version from the repository:

[dependencies]
stringzilla = { git = "https://github.com/ashvardanian/stringzilla", branch = "main-dev" }

Once installed, all of the functionality is available through the stringzilla namespace. Many interfaces will look familiar to the users of the memchr crate.

use stringzilla::sz;

// Identical to `memchr::memmem::find` and `memchr::memmem::rfind` functions
sz::find("Hello, world!", "world") // 7
sz::rfind("Hello, world!", "world") // 7

// Generalizations of `memchr::memrchr[123]`
sz::find_char_from("Hello, world!", "world") // 2
sz::rfind_char_from("Hello, world!", "world") // 11

Unlike memchr, the throughput of stringzilla is high in both normal and reverse-order searches. It also provides no constraints on the size of the character set, while memchr allows only 1, 2, or 3 characters. In addition to global functions, stringzilla provides a StringZilla extension trait:

use stringzilla::StringZilla;

let my_string: String = String::from("Hello, world!");
let my_str = my_string.as_str();
let my_cow_str = Cow::from(&my_string);

// Use the generic function with a String
assert_eq!(my_string.sz_find("world"), Some(7));
assert_eq!(my_string.sz_rfind("world"), Some(7));
assert_eq!(my_string.sz_find_char_from("world"), Some(2));
assert_eq!(my_string.sz_rfind_char_from("world"), Some(11));
assert_eq!(my_string.sz_find_char_not_from("world"), Some(0));
assert_eq!(my_string.sz_rfind_char_not_from("world"), Some(12));

// Same works for &str and Cow<'_, str>
assert_eq!(my_str.sz_find("world"), Some(7));
assert_eq!(my_cow_str.as_ref().sz_find("world"), Some(7));

The library also exposes Levenshtein and Hamming edit-distances for byte-arrays and UTF-8 strings, as well as Needleman-Wunch alignment scores.

use stringzilla::sz;

// Handling arbitrary byte arrays:
sz::edit_distance("Hello, world!", "Hello, world?"); // 1
sz::hamming_distance("Hello, world!", "Hello, world?"); // 1
sz::alignment_score("Hello, world!", "Hello, world?", sz::unary_substitution_costs(), -1); // -1

// Handling UTF-8 strings:
sz::hamming_distance_utf8("αβγδ", "αγγδ") // 1
sz::edit_distance_utf8("façade", "facade") // 1

Quick Start: Swift 🍏

StringZilla can be added as a dependency in the Swift Package Manager. In your Package.swift file, add the following:

dependencies: [
    .package(url: "https://github.com/ashvardanian/stringzilla")
]

The package currently covers only the most basic functionality, but is planned to be extended to cover the full C++ API.

var s = "Hello, world! Welcome to StringZilla. 👋"
s[s.findFirst(substring: "world")!...] // "world! Welcome to StringZilla. 👋")    
s[s.findLast(substring: "o")!...] // "o StringZilla. 👋")
s[s.findFirst(characterFrom: "aeiou")!...] // "ello, world! Welcome to StringZilla. 👋")
s[s.findLast(characterFrom: "aeiou")!...] // "a. 👋")
s[s.findFirst(characterNotFrom: "aeiou")!...] // "Hello, world! Welcome to StringZilla. 👋"
s.editDistance(from: "Hello, world!")! // 29

Algorithms & Design Decisions 📚

StringZilla aims to optimize some of the slowest string operations. Some popular operations, however, like equality comparisons and relative order checking, almost always complete on some of the very first bytes in either string. In such operations vectorization is almost useless, unless huge and very similar strings are considered. StringZilla implements those operations as well, but won't result in substantial speedups.

Exact Substring Search

Substring search algorithms are generally divided into: comparison-based, automaton-based, and bit-parallel. Different families are effective for different alphabet sizes and needle lengths. The more operations are needed per-character - the more effective SIMD would be. The longer the needle - the more effective the skip-tables are. StringZilla uses different exact substring search algorithms for different needle lengths and backends:

  • When no SIMD is available - SWAR (SIMD Within A Register) algorithms are used on 64-bit words.
  • Boyer-Moore-Horspool (BMH) algorithm with Raita heuristic variation for longer needles.
  • SIMD algorithms are randomized to look at different parts of the needle.

On very short needles, especially 1-4 characters long, brute force with SIMD is the fastest solution. On mid-length needles, bit-parallel algorithms are effective, as the character masks fit into 32-bit or 64-bit words. Either way, if the needle is under 64-bytes long, on haystack traversal we will still fetch every CPU cache line. So the only way to improve performance is to reduce the number of comparisons. The snippet below shows how StringZilla accomplishes that for needles of length two.

/**
* @brief 2Byte-level equality comparison between two 64-bit integers.
* @return 64-bit integer, where every top bit in each 2byte signifies a match.
*/
SZ_INTERNAL sz_u64_vec_t _sz_u64_each_2byte_equal(sz_u64_vec_t a, sz_u64_vec_t b) {
sz_u64_vec_t vec;
vec.u64 = ~(a.u64 ^ b.u64);
// The match is valid, if every bit within each 2byte is set.
// For that take the bottom 15 bits of each 2byte, add one to them,
// and if this sets the top bit to one, then all the 15 bits are ones as well.
vec.u64 = ((vec.u64 & 0x7FFF7FFF7FFF7FFFull) + 0x0001000100010001ull) & ((vec.u64 & 0x8000800080008000ull));
return vec;
}
/**
* @brief Find the first occurrence of a @b two-character needle in an arbitrary length haystack.
* This implementation uses hardware-agnostic SWAR technique, to process 8 possible offsets at a time.
*/
SZ_INTERNAL sz_cptr_t _sz_find_2byte_serial(sz_cptr_t h, sz_size_t h_length, sz_cptr_t n) {
// This is an internal method, and the haystack is guaranteed to be at least 2 bytes long.
sz_assert(h_length >= 2 && "The haystack is too short.");
sz_cptr_t const h_end = h + h_length;
#if !SZ_USE_MISALIGNED_LOADS
// Process the misaligned head, to void UB on unaligned 64-bit loads.
for (; ((sz_size_t)h & 7ull) && h + 2 <= h_end; ++h)
if ((h[0] == n[0]) + (h[1] == n[1]) == 2) return h;
#endif
sz_u64_vec_t h_even_vec, h_odd_vec, n_vec, matches_even_vec, matches_odd_vec;
n_vec.u64 = 0;
n_vec.u8s[0] = n[0], n_vec.u8s[1] = n[1];
n_vec.u64 *= 0x0001000100010001ull; // broadcast
// This code simulates hyper-scalar execution, analyzing 8 offsets at a time.
for (; h + 9 <= h_end; h += 8) {
h_even_vec.u64 = *(sz_u64_t *)h;
h_odd_vec.u64 = (h_even_vec.u64 >> 8) | ((sz_u64_t)h[8] << 56);
matches_even_vec = _sz_u64_each_2byte_equal(h_even_vec, n_vec);
matches_odd_vec = _sz_u64_each_2byte_equal(h_odd_vec, n_vec);
matches_even_vec.u64 >>= 8;
if (matches_even_vec.u64 + matches_odd_vec.u64) {
sz_u64_t match_indicators = matches_even_vec.u64 | matches_odd_vec.u64;
return h + sz_u64_ctz(match_indicators) / 8;
}
}
for (; h + 2 <= h_end; ++h)
if ((h[0] == n[0]) + (h[1] == n[1]) == 2) return h;
return SZ_NULL;
}

Going beyond that, to long needles, Boyer-Moore (BM) and its variants are often the best choice. It has two tables: the good-suffix shift and the bad-character shift. Common choice is to use the simplified BMH algorithm, which only uses the bad-character shift table, reducing the pre-processing time. We do the same for mid-length needles up to 256 bytes long. That way the stack-allocated shift table remains small.

/**
* @brief Boyer-Moore-Horspool algorithm for exact matching of patterns up to @b 256-bytes long.
* Uses the Raita heuristic to match the first two, the last, and the middle character of the pattern.
*/
SZ_INTERNAL sz_cptr_t _sz_find_horspool_upto_256bytes_serial(sz_cptr_t h_chars, sz_size_t h_length, //
sz_cptr_t n_chars, sz_size_t n_length) {
sz_assert(n_length <= 256 && "The pattern is too long.");
// Several popular string matching algorithms are using a bad-character shift table.
// Boyer Moore: https://www-igm.univ-mlv.fr/~lecroq/string/node14.html
// Quick Search: https://www-igm.univ-mlv.fr/~lecroq/string/node19.html
// Smith: https://www-igm.univ-mlv.fr/~lecroq/string/node21.html
union {
sz_u8_t jumps[256];
sz_u64_vec_t vecs[64];
} bad_shift_table;
// Let's initialize the table using SWAR to the total length of the string.
sz_u8_t const *h = (sz_u8_t const *)h_chars;
sz_u8_t const *n = (sz_u8_t const *)n_chars;
{
sz_u64_vec_t n_length_vec;
n_length_vec.u64 = n_length;
n_length_vec.u64 *= 0x0101010101010101ull; // broadcast
for (sz_size_t i = 0; i != 64; ++i) bad_shift_table.vecs[i].u64 = n_length_vec.u64;
for (sz_size_t i = 0; i + 1 < n_length; ++i) bad_shift_table.jumps[n[i]] = (sz_u8_t)(n_length - i - 1);
}
// Another common heuristic is to match a few characters from different parts of a string.
// Raita suggests to use the first two, the last, and the middle character of the pattern.
sz_u32_vec_t h_vec, n_vec;
// Pick the parts of the needle that are worth comparing.
sz_size_t offset_first, offset_mid, offset_last;
_sz_locate_needle_anomalies(n_chars, n_length, &offset_first, &offset_mid, &offset_last);
// Broadcast those characters into an unsigned integer.
n_vec.u8s[0] = n[offset_first];
n_vec.u8s[1] = n[offset_first + 1];
n_vec.u8s[2] = n[offset_mid];
n_vec.u8s[3] = n[offset_last];
// Scan through the whole haystack, skipping the last `n_length - 1` bytes.
for (sz_size_t i = 0; i <= h_length - n_length;) {
h_vec.u8s[0] = h[i + offset_first];
h_vec.u8s[1] = h[i + offset_first + 1];
h_vec.u8s[2] = h[i + offset_mid];
h_vec.u8s[3] = h[i + offset_last];
if (h_vec.u32 == n_vec.u32 && sz_equal((sz_cptr_t)h + i, n_chars, n_length)) return (sz_cptr_t)h + i;
i += bad_shift_table.jumps[h[i + n_length - 1]];
}
return SZ_NULL;
}

In the C++ Standards Library, the std::string::find function uses the BMH algorithm with Raita's heuristic. Before comparing the entire string, it matches the first, last, and the middle character. Very practical, but can be slow for repetitive characters. Both SWAR and SIMD backends of StringZilla have a cheap pre-processing step, where we locate unique characters. This makes the library a lot more practical when dealing with non-English corpora.

* @brief Chooses the offsets of the most interesting characters in a search needle.
*
* Search throughput can significantly deteriorate if we are matching the wrong characters.
* Say the needle is "aXaYa", and we are comparing the first, second, and last character.
* If we use SIMD and compare many offsets at a time, comparing against "a" in every register is a waste.
*
* Similarly, dealing with UTF8 inputs, we know that the lower bits of each character code carry more information.
* Cyrillic alphabet, for example, falls into [0x0410, 0x042F] code range for uppercase [А, Я], and
* into [0x0430, 0x044F] for lowercase [а, я]. Scanning through a text written in Russian, half of the
* bytes will carry absolutely no value and will be equal to 0x04.
*/
SZ_INTERNAL void _sz_locate_needle_anomalies(sz_cptr_t start, sz_size_t length, //
sz_size_t *first, sz_size_t *second, sz_size_t *third) {
*first = 0;
*second = length / 2;
*third = length - 1;
//
int has_duplicates = //
start[*first] == start[*second] || //
start[*first] == start[*third] || //
start[*second] == start[*third];
// Loop through letters to find non-colliding variants.
if (length > 3 && has_duplicates) {
// Pivot the middle point left, until we find a character different from the first one.
for (; start[*second] == start[*first] && *second; --(*second)) {}
// Pivot the middle point right, until we find a character different from the first one.
for (; start[*second] == start[*first] && *second + 1 < *third; ++(*second)) {}
// Pivot the third (last) point left, until we find a different character.
for (; (start[*third] == start[*second] || start[*third] == start[*first]) && *third > (*second + 1);
--(*third)) {}
}
}

All those, still, have $O(hn)$ worst case complexity. To guarantee $O(h)$ worst case time complexity, the Apostolico-Giancarlo (AG) algorithm adds an additional skip-table. Preprocessing phase is $O(n+sigma)$ in time and space. On traversal, performs from $(h/n)$ to $(3h/2)$ comparisons. It however, isn't practical on modern CPUs. A simpler idea, the Galil-rule might be a more relevant optimizations, if many matches must be found.

Other algorithms previously considered and deprecated:

  • Apostolico-Giancarlo algorithm for longer needles. Control-flow is too complex for efficient vectorization.
  • Shift-Or-based Bitap algorithm for short needles. Slower than SWAR.
  • Horspool-style bad-character check in SIMD backends. Effective only for very long needles, and very uneven character distributions between the needle and the haystack. Faster "character-in-set" check needed to generalize.

§ Reading materials. Exact String Matching Algorithms in Java. SIMD-friendly algorithms for substring searching.

Levenshtein Edit Distance

Levenshtein distance is the best known edit-distance for strings, that checks, how many insertions, deletions, and substitutions are needed to transform one string to another. It's extensively used in approximate string-matching, spell-checking, and bioinformatics.

The computational cost of the Levenshtein distance is $O(n * m)$, where $n$ and $m$ are the lengths of the string arguments. To compute that, the naive approach requires $O(n * m)$ space to store the "Levenshtein matrix", the bottom-right corner of which will contain the Levenshtein distance. The algorithm producing the matrix has been simultaneously studied/discovered by the Soviet mathematicians Vladimir Levenshtein in 1965, Taras Vintsyuk in 1968, and American computer scientists - Robert Wagner, David Sankoff, Michael J. Fischer in the following years. Several optimizations are known:

  1. Space Optimization: The matrix can be computed in $O(min(n,m))$ space, by only storing the last two rows of the matrix.
  2. Divide and Conquer: Hirschberg's algorithm can be applied to decompose the computation into subtasks.
  3. Automata: Levenshtein automata can be effective, if one of the strings doesn't change, and is a subject to many comparisons.
  4. Shift-Or: Bit-parallel algorithms transpose the matrix into a bit-matrix, and perform bitwise operations on it.

The last approach is quite powerful and performant, and is used by the great RapidFuzz library. It's less known, than the others, derived from the Baeza-Yates-Gonnet algorithm, extended to bounded edit-distance search by Manber and Wu in 1990s, and further extended by Gene Myers in 1999 and Heikki Hyyro between 2002 and 2004.

StringZilla introduces a different approach, extensively used in Unum's internal combinatorial optimization libraries. The approach doesn't change the number of trivial operations, but performs them in a different order, removing the data dependency, that occurs when computing the insertion costs. This results in much better vectorization for intra-core parallelism and potentially multi-core evaluation of a single request.

Next design goals:

  • Generalize fast traversals to rectangular matrices.
  • Port x86 AVX-512 solution to Arm NEON.

§ Reading materials. Faster Levenshtein Distances with a SIMD-friendly Traversal Order.

Needleman-Wunsch Alignment Score for Bioinformatics

The field of bioinformatics studies various representations of biological structures. The "primary" representations are generally strings over sparse alphabets:

  • DNA sequences, where the alphabet is {A, C, G, T}, ranging from ~100 characters for short reads to 3 billion for the human genome.
  • RNA sequences, where the alphabet is {A, C, G, U}, ranging from ~50 characters for tRNA to thousands for mRNA.
  • Proteins, where the alphabet is made of 22 amino acids, ranging from 2 characters for dipeptide to 35,000 for Titin, the longest protein.

The shorter the representation, the more often researchers may want to use custom substitution matrices. Meaning that the cost of a substitution between two characters may not be the same for all pairs.

StringZilla adapts the fairly efficient two-row Wagner-Fisher algorithm as a baseline serial implementation of the Needleman-Wunsch score. It supports arbitrary alphabets up to 256 characters, and can be used with either BLOSUM, PAM, or other substitution matrices. It also uses SIMD for hardware acceleration of the substitution lookups. This however, does not yet break the data-dependency for insertion costs, where 80% of the time is wasted. With that solved, the SIMD implementation will become 5x faster than the serial one.

Memory Copying, Fills, and Moves

A lot has been written about the time computers spend copying memory and how that operation is implemented in LibC. Interestingly, the operation can still be improved, as most Assembly implementations use outdated instructions. Even performance-oriented STL replacements, like Meta's Folly v2024.09.23 focus on AVX2, and don't take advantage of the new masked instructions in AVX-512 or SVE.

In AVX-512, StringZilla uses non-temporal stores to avoid cache pollution, when dealing with very large strings. Moreover, it handles the unaligned head and the tails of the target buffer separately, ensuring that writes in big copies are always aligned to cache-line boundaries. That's true for both AVX2 and AVX-512 backends.

StringZilla also contains "drafts" of smarter, but less efficient algorithms, that minimize the number of unaligned loads, perfoming shuffles and permutations. That's a topic for future research, as the performance gains are not yet satisfactory.

§ Reading materials. memset benchmarks by Nadav Rotem. Cache Associativity by Sergey Slotin.

Random Generation

Generating random strings from different alphabets is a very common operation. StringZilla accepts an arbitrary Pseudorandom Number Generator to produce noise, and an array of characters to sample from. Sampling is optimized to avoid integer division, a costly operation on modern CPUs. For that a 768-byte long lookup table is used to perform 2 lookups, 1 multiplication, 2 shifts, and 2 accumulations.

/**
* @brief Uses two small lookup tables (768 bytes total) to accelerate division by a small
* unsigned integer. Performs two lookups, one multiplication, two shifts, and two accumulations.
*
* @param divisor Integral value larger than one.
* @param number Integral value to divide.
*/
SZ_INTERNAL sz_u8_t sz_u8_divide(sz_u8_t number, sz_u8_t divisor) {
static sz_u16_t const multipliers[256] = {
0, 0, 0, 21846, 0, 39322, 21846, 9363, 0, 50973, 39322, 29790, 21846, 15124, 9363, 4370,
0, 57826, 50973, 44841, 39322, 34329, 29790, 25645, 21846, 18351, 15124, 12137, 9363, 6780, 4370, 2115,
0, 61565, 57826, 54302, 50973, 47824, 44841, 42011, 39322, 36765, 34329, 32006, 29790, 27671, 25645, 23705,
21846, 20063, 18351, 16706, 15124, 13602, 12137, 10725, 9363, 8049, 6780, 5554, 4370, 3224, 2115, 1041,
0, 63520, 61565, 59668, 57826, 56039, 54302, 52614, 50973, 49377, 47824, 46313, 44841, 43407, 42011, 40649,
39322, 38028, 36765, 35532, 34329, 33154, 32006, 30885, 29790, 28719, 27671, 26647, 25645, 24665, 23705, 22766,
21846, 20945, 20063, 19198, 18351, 17520, 16706, 15907, 15124, 14356, 13602, 12863, 12137, 11424, 10725, 10038,
9363, 8700, 8049, 7409, 6780, 6162, 5554, 4957, 4370, 3792, 3224, 2665, 2115, 1573, 1041, 517,
0, 64520, 63520, 62535, 61565, 60609, 59668, 58740, 57826, 56926, 56039, 55164, 54302, 53452, 52614, 51788,
50973, 50169, 49377, 48595, 47824, 47063, 46313, 45572, 44841, 44120, 43407, 42705, 42011, 41326, 40649, 39982,
39322, 38671, 38028, 37392, 36765, 36145, 35532, 34927, 34329, 33738, 33154, 32577, 32006, 31443, 30885, 30334,
29790, 29251, 28719, 28192, 27671, 27156, 26647, 26143, 25645, 25152, 24665, 24182, 23705, 23233, 22766, 22303,
21846, 21393, 20945, 20502, 20063, 19628, 19198, 18772, 18351, 17933, 17520, 17111, 16706, 16305, 15907, 15514,
15124, 14738, 14356, 13977, 13602, 13231, 12863, 12498, 12137, 11779, 11424, 11073, 10725, 10380, 10038, 9699,
9363, 9030, 8700, 8373, 8049, 7727, 7409, 7093, 6780, 6470, 6162, 5857, 5554, 5254, 4957, 4662,
4370, 4080, 3792, 3507, 3224, 2943, 2665, 2388, 2115, 1843, 1573, 1306, 1041, 778, 517, 258,
};
// This table can be avoided using a single addition and counting trailing zeros.
static sz_u8_t const shifts[256] = {
0, 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, //
4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, //
5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, //
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, //
6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, //
};
sz_u32_t multiplier = multipliers[divisor];
sz_u8_t shift = shifts[divisor];
sz_u16_t q = (sz_u16_t)((multiplier * number) >> 16);
sz_u16_t t = ((number - q) >> 1) + q;
return (sz_u8_t)(t >> shift);
}

Sorting

For lexicographic sorting of strings, StringZilla uses a "hybrid-hybrid" approach with $O(n * log(n))$ and.

  1. Radix sort for first bytes exported into a continuous buffer for locality.
  2. IntroSort on partially ordered chunks to balance efficiency and worst-case performance.
    1. IntroSort begins with a QuickSort.
    2. If the recursion depth exceeds a certain threshold, it switches to a HeapSort.

A better algorithm is in development. Check #173 for design goals and progress updates.

Hashing

Warning

Hash functions are not cryptographically safe and are currently under active development. They may change in future minor releases.

Choosing the right hashing algorithm for your application can be crucial from both performance and security standpoint. In StringZilla a 64-bit rolling hash function is reused for both string hashes and substring hashes, Rabin-style fingerprints. Rolling hashes take the same amount of time to compute hashes with different window sizes, and are fast to update. Those are not however perfect hashes, and collisions are frequent. StringZilla attempts to use SIMD, but the performance is not yet satisfactory. On Intel Sapphire Rapids, the following numbers can be expected for N-way parallel variants.

  • 4-way AVX2 throughput with 64-bit integer multiplication (no native support): 0.28 GB/s.
  • 4-way AVX2 throughput with 32-bit integer multiplication: 0.54 GB/s.
  • 4-way AVX-512DQ throughput with 64-bit integer multiplication: 0.46 GB/s.
  • 4-way AVX-512 throughput with 32-bit integer multiplication: 0.58 GB/s.
  • 8-way AVX-512 throughput with 32-bit integer multiplication: 0.11 GB/s.

Next design goals:

  • Try gear-hash and other rolling approaches.

Why not CRC32?

Cyclic Redundancy Check 32 is one of the most commonly used hash functions in Computer Science. It has in-hardware support on both x86 and Arm, for both 8-bit, 16-bit, 32-bit, and 64-bit words. The 0x1EDC6F41 polynomial is used in iSCSI, Btrfs, ext4, and the 0x04C11DB7 in SATA, Ethernet, Zlib, PNG. In case of Arm more than one polynomial is supported. It is, however, somewhat limiting for Big Data usecases, which often have to deal with more than 4 Billion strings, making collisions unavoidable. Moreover, the existing SIMD approaches are tricky, combining general purpose computations with specialized instructions, to utilize more silicon in every cycle.

§ Reading materials. Comprehensive derivation of approaches Faster computation for 4 KB buffers on x86 Comparing different lookup tables Great open-source implementations. By Peter Cawley By Stephan Brumme

Other Modern Alternatives

MurmurHash from 2008 by Austin Appleby is one of the best known non-cryptographic hashes. It has a very short implementation and is capable of producing 32-bit and 128-bit hashes. The CityHash from 2011 by Google and the xxHash improve on that, better leveraging the super-scalar nature of modern CPUs and producing 64-bit and 128-bit hashes.

Neither of those functions are cryptographic, unlike MD5, SHA, and BLAKE algorithms. Most of cryptographic hashes are based on the Merkle-Damgård construction, and aren't resistant to the length-extension attacks. Current state of the Art, might be the BLAKE3 algorithm. It's resistant to a broad range of attacks, can process 2 bytes per CPU cycle, and comes with a very optimized official implementation for C and Rust. It has the same 128-bit security level as the BLAKE2, and achieves its performance gains by reducing the number of mixing rounds, and processing data in 1 KiB chunks, which is great for longer strings, but may result in poor performance on short ones.

All mentioned libraries have undergone extensive testing and are considered production-ready. They can definitely accelerate your application, but so may the downstream mixer. For instance, when a hash-table is constructed, the hashes are further shrunk to address table buckets. If the mixer looses entropy, the performance gains from the hash function may be lost. An example would be power-of-two modulo, which is a common mixer, but is known to be weak. One alternative would be the fastrange by Daniel Lemire. Another one is the Fibonacci hash trick using the Golden Ratio, also used in StringZilla.

Unicode, UTF-8, and Wide Characters

Most StringZilla operations are byte-level, so they work well with ASCII and UTF8 content out of the box. In some cases, like edit-distance computation, the result of byte-level evaluation and character-level evaluation may differ. So StringZilla provides following functions to work with Unicode:

  • sz_edit_distance_utf8 - computes the Levenshtein distance between two UTF-8 strings.
  • sz_hamming_distance_utf8 - computes the Hamming distance between two UTF-8 strings.

Java, JavaScript, Python 2, C#, and Objective-C, however, use wide characters (wchar) - two byte long codes, instead of the more reasonable fixed-length UTF32 or variable-length UTF8. This leads to all kinds of offset-counting issues when facing four-byte long Unicode characters. So consider transcoding with simdutf, if you are coming from such environments.

Contributing 👾

Please check out the contributing guide for more details on how to setup the development environment and contribute to this project. If you like this project, you may also enjoy USearch, UCall, UForm, and SimSIMD. 🤗

If you like strings and value efficiency, you may also enjoy the following projects:

  • simdutf - transcoding UTF8, UTF16, and UTF32 LE and BE.
  • hyperscan - regular expressions with SIMD acceleration.
  • pyahocorasick - Aho-Corasick algorithm in Python.
  • rapidfuzz - fast string matching in C++ and Python.

If you are looking for more reading materials on this topic, consider the following:

License 📜

Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.