The support for BMI1 instruction set extension is almost universal by now. The extension was introduced in AMD Jaguar and Intel Haswell, both launched in 2013 i.e. 12 years ago.
Instead of doing stuff like (word >> bit_offset) & self.mask, in C or C++ I usually write _bextr_u64, or when using modern C# Bmi1.X64.BitFieldExtract. Note however these intrinsics compile into 2 instructions not one, because the CPU instruction takes start/length arguments from lower/higher bytes of a single 16-bit number.
lukefleed19 hours ago
BEXTR basically does the same thing, yes. I'm sticking with the portable shift-and-mask, though. My bet is that LLVM is smart enough to see that pattern and emit a BEXTR on its own when the target supports BMI1.
Using the intrinsic directly would also kill portability. I'd need #[cfg] for ARM and runtime checks for older x86 CPUs, which adds complexity for a tiny, if any, gain. The current code just lets the compiler pick the best instruction on any architecture.
Phemist1 day ago
I am handling a lot of vectors of size 6 or less with integers of values lower than 500.000. I am packing individual vectors in a u128, which comes down to 6*19 + a 3 bit length field = 117 bits with 11 bit left to spare.
I chose u128 instead of an array of [u64;2] due to the boundary crossing issue that I saw would happen, which I could avoid using u128.
I am not very familiar with these specific bit manipulation instruction sets, so would you know if there is something similar that works on u128 instead of just u32/u64?
NaN years ago
undefined
NaN years ago
undefined
pjsg1 day ago
I suspect that this unaligned read apporach doesn't work for a bit length of 59, 61, 62 and 63.
In the case of 63, reading out value at offset 1, requires two reads as it need 1 bit from the first byte, then 56 bits from the next 7 bytes and finally 6 bits from the 9th byte. Nine bytes cannot be loaded in a single u64 load.
lukefleed17 hours ago
That's a good catch!! Thank you, you are right. I (incorrectly) assumed that a single u64 could capture the entire bit_width-value read starting from byte_pos. However, as you said, this assumption breaks for some large bit widths.
I already patched it, thanks again.
fainpul19 hours ago
Maybe this is a thing people commonly have to do, but I have never heard of it and would have wished the article started with an explanation why simply casting the u64 values to a more appropriate type (u16 in this case) is not good enough. 6 of 16 bits would go to waste. It isn't obvious to me why this is considered substantial enough to go through all this trouble.
lukefleed16 hours ago
For many applications, casting to u16 and wasting 6 bits is perfectly fine. The "trouble" is only worth it when you're operating at a scale where those wasted bits add up to gigabytes.
This is common in fields like bioinformatics, search engine indexing, or implementing other succinct data structures. In these areas, the entire game is about squeezing massive datasets into RAM to avoid slow disk I/O. Wasting 6 out of 16 bits means your memory usage is almost 40% higher than it needs to be. That can be the difference between a server needing 64GB of RAM versus 100GB.
On top of that, as I mentioned in another comment, packing the data more tightly often makes random access faster than a standard Vec, not slower. Better cache locality means the CPU spends less time waiting for data from main memory, and that performance gain often outweighs the tiny cost of the bit-fiddling instructions.
gspr17 hours ago
I've certainly found myself in a situation where a few percent of memory use would be the difference between holding all data in RAM and having to do IO to load it on demand. 6 of 16 bits wasted is enormous then.
robert300516 hours ago
You might be interested in https://github.com/spiraldb/fastlanes. It gives you fast bit packed vectors. It needs padding to 1024 elements though for the performance.
capyba23 hours ago
Forgive me - much of the article went over my head so I’m trying to understand: this article is excellently written, but I’m struggling to understand why a `Vec<u8>` wouldn’t have sufficed? And disregarding the first question, is there a way to extend this to floats, for instance if I have a collection of values that I know will never exceed the bounds of +/- 10.0 with 1e-6 precision, can I use this technique for more efficient storage of them?
dathinab23 hours ago
> wouldn’t have sufficed?
most times it will suffice and be the recommended solution (less complexity; no ups I though I only need 3bit but at runtime needed 4bits situations; one less dependencies to review for soundness issues / bugs with every update)
while using Vec<u64> with a 3bit number is a pretty strange/misleading example (as you can use a Vec<u8>) my guess is that they used it because many packed vectors operate on u64s internally and allows up-to 64bit values.
anyway while often it doesn't matter sometimes storing a u8 where you only need 3bit numbers or similar makes a huge difference (e.g. if with packaging you application state fits in memory and without it doesn't and start swapping)
you can do similar stuff for non integer values, but things get more complicated as you can't just truncate/mask floats in the same way you can do so for integers. Further complications come from decimal based precision bounds and the question of if you need perfect precision in the whole number range (floats aren't evenly spaced).
A common solution is to use fixed point floats or in general convert your floats to integers. For example, assuming a perfect precision requirement for you range, you could store a number in range [-1e7; 1e7] and by context/in code remember there is a implicit missing div 1e6. Then you can store it in a 25 bit integer and bit package it (log2(1e7).ceil()+1 sign bit).
lukefleed16 hours ago
The other replies nailed it, but I'll add my two cents.
Vec<u8> may be the right call most of the time for most use cases. This library, however, is for when even 8 bits compared to 4 is too much. Another example, if all your values fit in 9 bits, you'd be forced to use Vec<u16> and waste 7 bits for every single number. With this structure, you just use 9 bits. That's almost a 2x space saving. At scale, that makes a difference.
For floats, you'd use fixed-point math, just like the other replies said. Your example of +/- 10.0 with 1e-6 precision would mean multiplying by 1,000,000 and storing the result as a 25-bit signed integer. When you read it back, you just divide. It's a decent saving over a 32-bit float.
Dylan1680720 hours ago
If you need absolute precision then floating points aren't going to be optimal. In particular your constraints need pretty much all the precision a 32-bit float has to offer, and you still need a handful of exponent bits, so your total savings would be like 4 bits.
simgt22 hours ago
For the 8 bits case in the first benchmark, why isn't `Vec<u8>` just as fast as the rest? Is it due to the compiler emitting poorer instructions in that case or doing extra checks the other implementations don't?
lukefleed16 hours ago
It's not about poorer instructions; a get_unchecked on a Vec<u8> is just a single memory access, which is as good as it gets. The difference is likely down to cache locality effects created by the benchmark loop itself.
The benchmark does a million random reads. For the FixedVec implementation with bit_width=8, the underlying storage is a Vec<u64>. This means the data is 8x more compact than the Vec<u8> baseline for the same number of elements.
When the random access indices are generated, they are more likely to fall within the same cache lines for the Vec<u64>-backed structure than for the Vec<u8>. Even though Vec<u8> has a simpler access instruction, it suffers more cache misses across the entire benchmark run.
tln1 day ago
"The first step is to abstract the physical storage layer"
How deliciously rusty
burnt-resistor15 hours ago
Tangentially (un)related, I recently had to implement a arbitrary bit vector Rust type backed by bytes 0-padded on the MSB side. It also offers an AsRef<[bool]> trait and iterator. Probably going to add serde as a feature and crate it at some point. The hardest part was removing bit(s). push() and append() were cake.
The support for BMI1 instruction set extension is almost universal by now. The extension was introduced in AMD Jaguar and Intel Haswell, both launched in 2013 i.e. 12 years ago.
Instead of doing stuff like (word >> bit_offset) & self.mask, in C or C++ I usually write _bextr_u64, or when using modern C# Bmi1.X64.BitFieldExtract. Note however these intrinsics compile into 2 instructions not one, because the CPU instruction takes start/length arguments from lower/higher bytes of a single 16-bit number.
BEXTR basically does the same thing, yes. I'm sticking with the portable shift-and-mask, though. My bet is that LLVM is smart enough to see that pattern and emit a BEXTR on its own when the target supports BMI1.
Using the intrinsic directly would also kill portability. I'd need #[cfg] for ARM and runtime checks for older x86 CPUs, which adds complexity for a tiny, if any, gain. The current code just lets the compiler pick the best instruction on any architecture.
I am handling a lot of vectors of size 6 or less with integers of values lower than 500.000. I am packing individual vectors in a u128, which comes down to 6*19 + a 3 bit length field = 117 bits with 11 bit left to spare.
I chose u128 instead of an array of [u64;2] due to the boundary crossing issue that I saw would happen, which I could avoid using u128.
I am not very familiar with these specific bit manipulation instruction sets, so would you know if there is something similar that works on u128 instead of just u32/u64?
undefined
undefined
I suspect that this unaligned read apporach doesn't work for a bit length of 59, 61, 62 and 63.
In the case of 63, reading out value at offset 1, requires two reads as it need 1 bit from the first byte, then 56 bits from the next 7 bytes and finally 6 bits from the 9th byte. Nine bytes cannot be loaded in a single u64 load.
That's a good catch!! Thank you, you are right. I (incorrectly) assumed that a single u64 could capture the entire bit_width-value read starting from byte_pos. However, as you said, this assumption breaks for some large bit widths.
I already patched it, thanks again.
Maybe this is a thing people commonly have to do, but I have never heard of it and would have wished the article started with an explanation why simply casting the u64 values to a more appropriate type (u16 in this case) is not good enough. 6 of 16 bits would go to waste. It isn't obvious to me why this is considered substantial enough to go through all this trouble.
For many applications, casting to u16 and wasting 6 bits is perfectly fine. The "trouble" is only worth it when you're operating at a scale where those wasted bits add up to gigabytes.
This is common in fields like bioinformatics, search engine indexing, or implementing other succinct data structures. In these areas, the entire game is about squeezing massive datasets into RAM to avoid slow disk I/O. Wasting 6 out of 16 bits means your memory usage is almost 40% higher than it needs to be. That can be the difference between a server needing 64GB of RAM versus 100GB.
On top of that, as I mentioned in another comment, packing the data more tightly often makes random access faster than a standard Vec, not slower. Better cache locality means the CPU spends less time waiting for data from main memory, and that performance gain often outweighs the tiny cost of the bit-fiddling instructions.
I've certainly found myself in a situation where a few percent of memory use would be the difference between holding all data in RAM and having to do IO to load it on demand. 6 of 16 bits wasted is enormous then.
You might be interested in https://github.com/spiraldb/fastlanes. It gives you fast bit packed vectors. It needs padding to 1024 elements though for the performance.
Forgive me - much of the article went over my head so I’m trying to understand: this article is excellently written, but I’m struggling to understand why a `Vec<u8>` wouldn’t have sufficed? And disregarding the first question, is there a way to extend this to floats, for instance if I have a collection of values that I know will never exceed the bounds of +/- 10.0 with 1e-6 precision, can I use this technique for more efficient storage of them?
> wouldn’t have sufficed?
most times it will suffice and be the recommended solution (less complexity; no ups I though I only need 3bit but at runtime needed 4bits situations; one less dependencies to review for soundness issues / bugs with every update)
while using Vec<u64> with a 3bit number is a pretty strange/misleading example (as you can use a Vec<u8>) my guess is that they used it because many packed vectors operate on u64s internally and allows up-to 64bit values.
anyway while often it doesn't matter sometimes storing a u8 where you only need 3bit numbers or similar makes a huge difference (e.g. if with packaging you application state fits in memory and without it doesn't and start swapping)
you can do similar stuff for non integer values, but things get more complicated as you can't just truncate/mask floats in the same way you can do so for integers. Further complications come from decimal based precision bounds and the question of if you need perfect precision in the whole number range (floats aren't evenly spaced).
A common solution is to use fixed point floats or in general convert your floats to integers. For example, assuming a perfect precision requirement for you range, you could store a number in range [-1e7; 1e7] and by context/in code remember there is a implicit missing div 1e6. Then you can store it in a 25 bit integer and bit package it (log2(1e7).ceil()+1 sign bit).
The other replies nailed it, but I'll add my two cents.
Vec<u8> may be the right call most of the time for most use cases. This library, however, is for when even 8 bits compared to 4 is too much. Another example, if all your values fit in 9 bits, you'd be forced to use Vec<u16> and waste 7 bits for every single number. With this structure, you just use 9 bits. That's almost a 2x space saving. At scale, that makes a difference.
For floats, you'd use fixed-point math, just like the other replies said. Your example of +/- 10.0 with 1e-6 precision would mean multiplying by 1,000,000 and storing the result as a 25-bit signed integer. When you read it back, you just divide. It's a decent saving over a 32-bit float.
If you need absolute precision then floating points aren't going to be optimal. In particular your constraints need pretty much all the precision a 32-bit float has to offer, and you still need a handful of exponent bits, so your total savings would be like 4 bits.
For the 8 bits case in the first benchmark, why isn't `Vec<u8>` just as fast as the rest? Is it due to the compiler emitting poorer instructions in that case or doing extra checks the other implementations don't?
It's not about poorer instructions; a get_unchecked on a Vec<u8> is just a single memory access, which is as good as it gets. The difference is likely down to cache locality effects created by the benchmark loop itself.
The benchmark does a million random reads. For the FixedVec implementation with bit_width=8, the underlying storage is a Vec<u64>. This means the data is 8x more compact than the Vec<u8> baseline for the same number of elements.
When the random access indices are generated, they are more likely to fall within the same cache lines for the Vec<u64>-backed structure than for the Vec<u8>. Even though Vec<u8> has a simpler access instruction, it suffers more cache misses across the entire benchmark run.
"The first step is to abstract the physical storage layer"
How deliciously rusty
Tangentially (un)related, I recently had to implement a arbitrary bit vector Rust type backed by bytes 0-padded on the MSB side. It also offers an AsRef<[bool]> trait and iterator. Probably going to add serde as a feature and crate it at some point. The hardest part was removing bit(s). push() and append() were cake.