A Stable Bloom Filter (SBF) implementation in Go, providing an approximate membership data structure with support for element decay over time.
It allows you to efficiently test whether an element is likely present in a set, with a configurable false positive rate, and automatically forgets old elements based on a decay mechanism.
- Approximate Membership Testing: Quickly check if an element is likely in the set.
- Element Decay: Automatically removes old elements over time to prevent filter saturation.
- Concurrent Access: Safe for use by multiple goroutines simultaneously.
- Customizable Parameters: Configure false positive rate, decay rate, and decay interval.
- Optimized for Performance: Efficient memory usage and fast operations.
- Scalability: Designed to handle large data sets and high-throughput applications.
- Stable Bloom Filter (SBF) for Go
go get github.com/Alfex4936/sbf-go
Here's how to create a new Stable Bloom Filter and use it to add and check elements:
package main
import (
"fmt"
"github.com/Alfex4936/sbf-go"
"time"
)
func main() {
// Parameters for the Stable Bloom Filter
expectedItems := uint32(1_000_000) // Expected number of items
falsePositiveRate := 0.01 // Desired false positive rate (1%)
// Create a Stable Bloom Filter with default decay settings
sbfInstance, err := sbf.NewDefaultStableBloomFilter(expectedItems, falsePositiveRate, 0, 0)
if err != nil {
panic(err)
}
defer sbfInstance.StopDecay() // Ensure resources are cleaned up
// Add an element
element := []byte("example_element")
sbfInstance.Add(element)
// Check if the element is in the filter
if sbfInstance.Check(element) {
fmt.Println("Element is probably in the set.")
} else {
fmt.Println("Element is definitely not in the set.")
}
// Wait for some time to let decay happen
time.Sleep(2 * time.Minute)
// Check again after decay
if sbfInstance.Check(element) {
fmt.Println("Element is probably still in the set.")
} else {
fmt.Println("Element has likely decayed from the set.")
}
}
In this example, we'll use the Stable Bloom Filter to detect duplicates among a stream of user registrations. This can be useful in preventing duplicate entries, replay attacks, or filtering repeated events.
package main
import (
"fmt"
"math/rand"
"time"
"github.com/Alfex4936/sbf-go"
)
func main() {
// Parameters for the Stable Bloom Filter
expectedItems := uint32(1_000_000) // Expected number of items
falsePositiveRate := 0.01 // Desired false positive rate (1%)
// Create a Stable Bloom Filter with default decay settings
sbfInstance, err := sbf.NewDefaultStableBloomFilter(expectedItems, falsePositiveRate, 0, 0)
if err != nil {
panic(err)
}
defer sbfInstance.StopDecay() // Ensure resources are cleaned up
// Seed the random number generator
rand.Seed(time.Now().UnixNano())
// Simulate a stream of usernames with potential duplicates
totalUsers := 1_000_000
maxUserID := totalUsers / 2 // Adjust to increase the chance of duplicates
duplicateCount := 0
for i := 0; i < totalUsers; i++ {
// Generate a username (e.g., "user_123456")
userID := rand.Intn(maxUserID)
username := fmt.Sprintf("user_%d", userID)
// Check if the username might have been seen before
if sbfInstance.Check([]byte(username)) {
// Likely a duplicate
duplicateCount++
} else {
// New username, add it to the filter
sbfInstance.Add([]byte(username))
}
}
fmt.Printf("Total users processed: %d\n", totalUsers)
fmt.Printf("Potential duplicates detected: %d\n", duplicateCount)
// Estimate the false positive rate after processing
estimatedFPR := sbfInstance.EstimateFalsePositiveRate()
fmt.Printf("Estimated false positive rate: %.6f\n", estimatedFPR)
}
Output:
Total users processed: 1000000
Potential duplicates detected: 567435
Estimated false positive rate: 0.000107
Explanation:
- We simulate one million user registrations, with user IDs ranging from 0 to 499,999. This means duplicates are likely, as we have more registrations than unique user IDs.
- We use the SBF to check for duplicates:
- If
Check
returnstrue
, we increment theduplicateCount
. - If
Check
returnsfalse
, we add the username to the filter.
- If
- The high number of duplicates detected is expected due to the limited range of user IDs, not because of false positives.
- The estimated false positive rate is very low (
0.0107%
), indicating that almost all duplicates detected are actual duplicates.
- High Throughput Systems: Applications that require fast insertion and query times with minimal memory overhead.
- Streaming Data: Scenarios where data is continuously flowing, and old data becomes less relevant over time.
- Duplicate Detection: Identifying duplicate events or entries without storing all elements.
- Cache Expiration: Probabilistically determining if an item is still fresh or should be re-fetched.
- Approximate Membership Testing: When exact membership testing is less critical than speed and memory usage.
- Exact Membership Required: Applications that cannot tolerate false positives or require exact deletions of elements.
- Small Data Sets: When the data set is small enough to be stored and managed with precise data structures.
- Sensitive Data: Scenarios where the cost of a false positive is too high (e.g., financial transactions, critical security checks).
- Complex Deletion Requirements: If you need to delete specific elements immediately, a Stable Bloom Filter is not suitable.
expectedItems
: The estimated number of unique items you expect to store in the filter. This helps calculate the optimal size of the filter.falsePositiveRate
: The desired probability of false positives. Lowering this value reduces false positives but increases memory usage.decayRate
: The probability that each bit in the filter will decay (be unset) during each decay interval. Default is0.01
(1%).decayInterval
: The time duration between each decay operation. Default is1 * time.Minute
.
Choosing decayRate
and decayInterval
:
- Element Retention Time: If you want elements to persist longer in the filter, decrease the
decayRate
or increase thedecayInterval
. - High Insertion Rate: For applications with high insertion rates, you may need a higher
decayRate
or shorterdecayInterval
to prevent the filter from becoming saturated.
The Stable Bloom Filter is designed to be scalable and can handle large data sets and high-throughput applications efficiently. Here's how:
- Thread-Safe Operations: The SBF implementation uses atomic operations, making it safe for concurrent use by multiple goroutines without additional locking mechanisms.
- High Throughput: Insertion (
Add
) and query (Check
) operations are fast and have constant time complexityO(k)
, wherek
is the number of hash functions. This allows the filter to handle a high rate of operations per second.
- Low Memory Footprint: The SBF is space-efficient, requiring minimal memory to represent large sets. Memory usage is directly related to the desired false positive rate and the expected number of items.
- Configurable Parameters: Adjusting the
falsePositiveRate
andexpectedItems
allows you to scale the filter to match your application's memory constraints and performance requirements.
- Sharding Filters: For extremely large data sets or to distribute load, you can partition your data and use multiple SBF instances (shards). Each shard handles a subset of the data, allowing the system to scale horizontally.
- Distributed Systems: In distributed environments, you can deploy SBF instances across multiple nodes, ensuring that each node maintains its own filter or shares filters through a coordination mechanism.
// Number of shards (e.g., based on the number of CPUs or nodes)
numShards := 10
sbfShards := make([]*sbf.StableBloomFilter, numShards)
// Initialize each shard
for i := 0; i < numShards; i++ {
sbfInstance, err := sbf.NewDefaultStableBloomFilter(expectedItems/uint32(numShards), falsePositiveRate, 0, 0)
if err != nil {
panic(err)
}
sbfShards[i] = sbfInstance
defer sbfInstance.StopDecay()
}
// Function to determine which shard to use (e.g., based on hash of the element)
func getShardIndex(element []byte) int {
hashValue := someHashFunction(element)
return int(hashValue % uint32(numShards))
}
// Adding and checking elements
element := []byte("example_element")
shardIndex := getShardIndex(element)
sbfShards[shardIndex].Add(element)
if sbfShards[shardIndex].Check(element) {
fmt.Println("Element is probably in the set.")
}
Explanation:
- Sharding Logic: Elements are distributed among shards based on a hash function. This reduces the load on individual filters and allows the system to handle more data and higher throughput.
- Scalability: By adding more shards, you can scale horizontally to accommodate growing data volumes or increased performance demands.
- Consistent Hashing: Use consistent hashing to minimize data redistribution when adding or removing shards.
- Synchronization: In some cases, you might need to synchronize filters or handle cross-shard queries, which can add complexity.
- Monitoring and Balancing: Monitor the load on each shard to ensure even distribution and adjust the sharding strategy if necessary.
- Memory Efficiency: Bloom filters are space-efficient, requiring minimal memory to represent large sets.
- Fast Operations: Both insertion (
Add
) and query (Check
) operations are fast and have constant time complexityO(k)
, wherek
is the number of hash functions. - Concurrency: The implementation is safe for concurrent use by multiple goroutines without additional locking mechanisms.
- Decay Overhead: The decay process runs in a separate goroutine. The overhead is minimal but should be considered in resource-constrained environments.
- No Deletion of Specific Elements: You cannot remove specific elements from the filter. Elements decay over time based on the decay parameters.
- False Positives: The filter can return false positives (i.e., it may indicate that an element is present when it's not). The false positive rate is configurable but cannot be entirely eliminated.
- Not Suitable for Counting: If you need to count occurrences of elements, consider using a Counting Bloom Filter instead.
- Sharding Complexity: While sharding allows horizontal scaling, it introduces additional complexity in managing multiple filters and ensuring consistent hashing.
This project is licensed under the MIT License.