On implementing distributed locks

I’ve been thinking about distributed locks lately. Not because they’re particularly exciting, but because they’re one of those problems that seems simple until you actually try to implement them correctly. You need to prevent two processes from doing the same thing at the same time, across multiple machines. How hard could it be?

Turns out, it’s pretty hard.

The classic scenario: you have a distributed cron job that should only run once across your entire stack. Or maybe you’re processing payments and need to ensure only one service instance handles a specific transaction at any given time. Local locks won’t help you because you need coordination across machines.

Why Redis for Distributed Locks?

Before the flamewar starts on the mention of the word ‘Redis’, know that there are options. Zookeeper is the enterprise choice, battle-tested and reliable. As is Consul. But Redis has a compelling advantage: you probably already have it. If not, replace all mentions of Redis with Valkey which is more permissive and open source.

Redis is fast, supports atomic operations which make implementing locks very easy. The key insight is that Redis’s SET command can atomically set a key only if it doesn’t exist, with an automatic expiration. That’s basically a lock primitive right there.

// The basic idea
await redis.set('my-lock', 'token', 'NX', 'PX', 10000);
// where
// NX: only set if not exists
// PX: expiration in milliseconds

Simple, right? But as always with distributed systems, the devil is in the details.

How Redis Locks Actually Work

The fundamental operation is SET key value NX PX milliseconds. This command does three things atomically:

Sets the key only if it doesn’t already exist (NX)
Sets an expiration time (PX)
Returns OK if it succeeded, nil if the key already existed

Here’s a basic lock implementation in TypeScript:

class RedisLock {
  private redis: Redis;
  private lockKey: string;
  private lockValue: string;
  private ttl: number;

  constructor(redis: Redis, resourceName: string, ttl: number = 10000) {
    this.redis = redis;
    this.lockKey = `lock:${resourceName}`;
    this.lockValue = crypto.randomUUID(); // unique token
    this.ttl = ttl;
  }

  async acquire(): Promise<boolean> {
    const result = await this.redis.set(
      this.lockKey,
      this.lockValue,
      'NX',
      'PX',
      this.ttl
    );
    return result === 'OK';
  }

  async release(): Promise<void> {
    const script = `
      if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
      else
        return 0
      end
    `;
    await this.redis.eval(script, 1, this.lockKey, this.lockValue);
  }
}

Why Lua script though instead of just DEL from the SDK? You can’t just delete the key - what if your lock expired and someone else acquired it? The Lua script ensures you only delete the lock if you still own it. (How does Redis check ownership? By signing every lock with unique client string!)

The Redlock Algorithm

The single instance approach works fine for many use cases, but what about when Redis itself fails? This is where things get interesting.

Salvatore Sanfilippo (antirez), Redis’s creator, proposed the Redlock algorithm.

The idea: instead of using a single Redis instance, use an odd number of independent instances (typically 5).

To acquire a lock:

Get the current time in milliseconds
Try to acquire the lock on all N instances sequentially
Calculate how long acquisition took
Consider the lock acquired only if:
- You got it on the majority of instances (N/2 + 1)
- The total acquisition time is less than the lock validity time

You also need to account for the drift between system clocks for each redis instance.

class Redlock {
  private instances: Redis[];
  private quorum: number;

  constructor(instances: Redis[]) {
    this.instances = instances;
    this.quorum = Math.floor(instances.length / 2) + 1;
  }

  async acquire(resource: string, ttl: number): Promise<Lock | null> {
    const token = crypto.randomUUID();
    const startTime = Date.now();
    const driftFactor = 0.01; // 1% drift
    const drift = Math.round(ttl * driftFactor) + 2;

    let successCount = 0;
    const promises = this.instances.map(async (redis) => {
      try {
        const result = await redis.set(
          `lock:${resource}`,
          token,
          'NX',
          'PX',
          ttl
        );
        return result === 'OK';
      } catch (err) {
        return false;
      }
    });

    const results = await Promise.all(promises);
    successCount = results.filter(r => r).length;

    const elapsed = Date.now() - startTime;
    const validityTime = ttl - elapsed - drift;

    if (successCount >= this.quorum && validityTime > 0) {
      return { token, validityTime };
    }

    await this.release(resource, token);
    return null;
  }
}

The drift calculation is crucial. Clocks on different machines aren’t perfectly synchronized. If your lock has a 10-second TTL but clock drift is 2 seconds, you can’t assume you have the full 10 seconds.

Martin Kleppmann’s Critique

Martin Kleppmann (shoutout to the epic Designing Data Intensive Applications!) wrote a popular argu,ent against Redlock, arguing that it doesn’t provide the safety guarantees you might think it does.

His main points:

Clock drift and process pauses (like GC pauses) can cause you to think you hold a lock when you don’t
Redlock doesn’t provide fencing tokens to prevent this

What a fencing token, you ask? Imagine:

Client A acquires the lock
Client A experiences a long GC pause
The lock expires
Client B acquires the lock
Client A wakes up, thinks it still has the lock, proceeds to modify shared resource
Both clients are now modifying the same resource

Antirez’s/Redis committee’s response was essentially: Redlock is for efficiency, not correctness. If you need absolute correctness, use a consensus system like Zookeeper. But for many practical applications-preventing duplicate cron jobs, avoiding unnecessary work-Redlock is fine.

Production Considerations

In production, you need to think beyond just acquiring and releasing locks.

Lock Extension

What if your work takes longer than expected? You need to extend the lock:

async extend(additionalTime: number): Promise<boolean> {
  const script = `
    if redis.call("get", KEYS[1]) == ARGV[1] then
      return redis.call("pexpire", KEYS[1], ARGV[2])
    else
      return 0
    end
  `;
  const result = await this.redis.eval(
    script,
    1,
    this.lockKey,
    this.lockValue,
    additionalTime
  );
  return result === 1;
}

Retry Logic

You usually don’t want to fail immediately if you can’t acquire a lock. You could implement retry with exponential backoff:

async acquireWithRetry(
  maxRetries: number = 3,
  baseDelay: number = 100
): Promise<boolean> {
  for (let i = 0; i < maxRetries; i++) {
    if (await this.acquire()) {
      return true;
    }
    const delay = baseDelay * Math.pow(2, i) + Math.random() * 100;
    await new Promise(resolve => setTimeout(resolve, delay));
  }
  return false;
}

Sometimes you want multiple readers but exclusive writers. Redis doesn’t have this built-in, but you can implement it like so.

One common approach to guarantee consistency is to only allow writes when there are no reader locks.

class RWLock {
  async acquireRead(resource: string): Promise<boolean> {
    const script = `
      if redis.call("get", "write:" .. KEYS[1]) then
        return 0
      else
        return redis.call("incr", "read:" .. KEYS[1])
      end
    `;
    const result = await this.redis.eval(script, 1, resource);
    return result > 0;
  }

  async acquireWrite(resource: string): Promise<boolean> {
    const script = `
      if redis.call("get", "read:" .. KEYS[1]) or
         redis.call("get", "write:" .. KEYS[1]) then
        return 0
      else
        return redis.call("set", "write:" .. KEYS[1], "1", "NX", "PX", ARGV[1])
      end
    `;
    const result = await this.redis.eval(script, 1, resource, this.ttl);
    return result === 'OK';
  }
}

But wait, does this mean we would need to wait for no more read locks before exclusive write locks are obtained? What if your application is read intensive? Would that mean infinite time to update locks?

Based on the above implementation, yes. But we could make locking fair with queues to solve this.

Making Locking fair

The basic implementation above is what you could call unfair. Lets say if a client acquired a lock and then released it, it could immediately reacquire it, other clients would be starved.

For fairness, we need to implement a queue. Luckily, we don’t need to implement or adopt another system like rabbitmq or Kafka, redis comes with an inbuilt queue.

async acquireFair(resource: string): Promise<boolean> {
  const queueKey = `queue:${resource}`;
  const lockKey = `lock:${resource}`;
  const token = crypto.randomUUID();

  await this.redis.zadd(queueKey, Date.now(), token);

  while (true) {
    // Check if we're first in queue
    const first = await this.redis.zrange(queueKey, 0, 0);
    if (first[0] !== token) {
      await sleep(100);
      continue;
    }

    // now, we try to acquire
    const acquired = await this.redis.set(lockKey, token, 'NX', 'PX', this.ttl);
    if (acquired === 'OK') {
      await this.redis.zrem(queueKey, token);
      return true;
    }
  }
}

Alternatives to Redis

As mentioned before, there are many alternatives to redis. IN some cases, you don’t even need Redis/In-Mem key value store.

Database-Level Locks

If your core intention of acquiring locks is to make DB read/edits, and you use Postgres or MySQL (or modern DBs), you’re gonna want to hear this.

Advisory locks are provided by DB to allow control by application. For example, in Postgres, they’re simple and easy to use.

-- 12345 is a unique 64 bit number
SELECT pg_advisory_lock(12345);
-- do work
SELECT pg_advisory_unlock(12345);

Once you acquire the lock, you cannot reacquire the lock unless the session releases it. This can be used in your application, for example, by assigning actions to the same lock ID.

Optimistic Concurrency

You may not even need locks in some cases. For many use cases, compare-and-swap is better than locks.

For example, if you want to update an order:

UPDATE orders
SET status = 'processing', version = version + 1
WHERE id = 123 AND version = 5;

If the update affects 0 rows, someone else got there first. In that case, you can re-read and retry or abort.

Libraries Worth Using

Realistically speaking, you don’t need to build your own implementation unless it’s for self learning.

These implementations are battle tested on production and almost always a better idea than building your own. These libraries handle edge cases you probably haven’t thought of yet.

node-redlock (Node.js): Full Redlock implementation with automatic extension
Redisson (Java): Excellent Redis client with distributed locks, maps, queues, etc.
redlock-py (Python): Python implementation of Redlock
go-redsync (Go): Clean Go implementation

Final Thoughts

Distributed locks are a tool, not a silver bullet. They work well for coordination in systems where:

You need to prevent duplicate work
Occasional failures are acceptable
You need low latency
You already have Redis

They’re not appropriate when:

You need absolute correctness
You’re protecting financial transactions
Failures have serious consequences
You’re better off with a proper consensus system

The 2022 Stack Overflow survey showed Redis is used by about 22% of professional developers. Many of those are using it for caching, but increasingly, teams are discovering its usefulness for coordination. The key is understanding the tradeoffs.

When I first implemented distributed locks, I thought the hard part was the code. It’s not. The hard part is choosing the right TTL, handling edge cases gracefully, and building systems that degrade nicely when locks fail. Redis makes the easy parts easy, but the hard parts are still hard.

References

Sanfilippo, S. (2015). Distributed locks with Redis. Redis Documentation. https://redis.io/docs/latest/develop/clients/patterns/distributed-locks/
Kleppmann, M. (2016, February 8). How to do distributed locking. Martin Kleppmann’s Blog. https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
Sanfilippo, S. (2016, February 9). Is Redlock safe? Antirez Weblog. http://antirez.com/news/101