kritibehl 6 hours ago

Most distributed job queue implementations rely on lease timeouts to prevent duplicate execution. The assumption is: if a worker hasn't checked in by the deadline, it's safe to reassign the job.

The assumption is wrong. A slow worker can finish the job, commit the result, and fail before acknowledging the expiry — while another worker has already picked up the "abandoned" job. Two workers, one job, undefined behavior.

Faultline uses two layers to prevent this:

Fencing tokens — every lease acquisition increments a monotonic counter. A worker holding generation 6 cannot commit using generation 7's slot. The application layer rejects the stale write before it reaches the database.

UNIQUE(job_id, fencing_token) — a database constraint that makes duplicate commits physically impossible regardless of application logic. Even if there's a bug in the token check, the DB rejects the second insert.

I wanted to actually validate these guarantees rather than just reason about them, so I built a 29-assertion drill suite covering 16 failure scenarios (mid-execution crashes, stale worker commit attempts, concurrent reclaim races, retry exhaustion, network interruptions during commit).

Then I added FaultProxy — a configurable psycopg2 wrapper that injects latency, connection drops and query timeouts independently. I ran the full 500-reproduction suite at 0%, 5% and 10% fault injection rates. Results across 1,500 total race reproductions: 0 duplicate commits, 1,500 stale-write rejections confirmed.

The repo has the full drill suite, FaultProxy implementation and instructions for running against a real PostgreSQL instance.

GitHub: https://github.com/kritibehl/faultline

Curious what failure modes others have hit in distributed job processing that aren't covered by the fencing token + DB constraint approach.