A real-time Bluesky Jetstream firehose consumer that filters and forwards posts to Kafka topics based on configurable rules.
- Connect to Bluesky's firehose via Jetstream
- Filter posts using configurable regex patterns
- Forward matched posts to Kafka topics
- Prometheus-compatible metrics endpoint
- Health monitoring endpoint
- Configurable historical backfill
- Docker-based development environment
Edit docker-config.json
to setup rules
for what you want to monitor then run:
docker compose up --build -d
or
just start
and Docker Compose will do everything for you.
It's pretty efficient, resource-wise:
$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
24aa1f3f6169 skygrep 33.67% 59.73MiB / 15.66GiB 0.37% 1.71GB / 23.5MB 169MB / 3.46MB 5
962906967b45 redpanda-console 0.01% 24.26MiB / 15.66GiB 0.15% 153kB / 269kB 167MB / 8.19kB 10
3d5bbc67bf81 redpanda-0 1.03% 1.824GiB / 15.66GiB 11.65% 2.85MB / 397kB 255MB / 7.21MB 3
- Clone the repository:
git clone https://codeberg.org/hrbrmstr/skygrep.git
cd skygrep
- Create a
config.json
file:
{
"jetstream": {
"endpoint": "wss://jetstream2.us-east.bsky.network/subscribe"
},
"kafka": {
"brokers": ["localhost:19092"]
},
"rules": [
{
"field": "text",
"pattern": "(?i)(bitcoin|crypto|eth|nft)",
"kafkaTopic": "crypto_posts"
},
{
"field": "text",
"pattern": "(?i)CVE-\\d{4}-\\d{4,}",
"kafkaTopic": "cve_mentions"
}
]
}
- Start the development environment:
just dev
just build
— build clijust clean
— clean up docker resources — this also deletes the volumejust default
— show tasksjust dev
— dev modejust health-check
— monitor the health of Skygrepjust reset
— rebuild and run fresh instance — this also deletes the volumejust start
— start servicesjust stop
— stop docker w/o deleting the volumejust watch-metrics
— watch metrics with live updates every 5 seconds
Access metrics at http://localhost:3030/metrics
Example response:
{
"crypto_posts": 42,
"cve_mentions": 7
}
Access health status at http://localhost:3030/health
Example response:
{
"status": "healthy",
"uptime_ms": 20918,
"last_event_ms_ago": 0
}
The application consists of several key components:
- Jetstream Client: Connects to Bluesky's firehose and receives real-time posts
- Kafka Producer: Forwards matched posts to configured Kafka topics
- Rule Engine: Applies regex patterns to filter relevant posts
- Metrics Server: Exposes operational metrics and health status
- Redpanda: Kafka-compatible event streaming platform
- Kafka API: localhost:19092
- Schema Registry: localhost:18081
- Admin API: localhost:19644
- Redpanda Console: Web UI for managing Kafka
- Interface: http://localhost:9080
- Skygrep:
- Health: http://localhost:3030/health
- Metrics: http://localhost:3030/metrics
Command line flags:
--hours
: Number of hours to look back in history (default: 24)--port
: HTTP server port (default: 3030)--help
: Show help message
The application provides:
- Real-time metrics for rule matches
- Health status monitoring
- Graceful shutdown on SIGINT/SIGTERM
- Connection status logging
MIT