A Way To Resolve Redelivered Messages Stuck In A NATS Jetstream

Fair warning, this is probably an approach that’ll only work for me, since the services I’m working on upserts1 NATS Jetstream consumers on startup.

We’re using NATS Jetstream to dispatch jobs to a pool of workers. Jobs will come in via the job-inbox stream, and the workers publish results to a jobs-results stream. These streams are created with the WorkQueue policy and do not have a dead-letter queue configured.

The job-results stream has a durable consumer, configured with a deliver policy of all. When the service using this consumer begins, it will upsert the consumer config, then start processing messages earmarked for that consumer.

Recently we started seeing a number of “stuck” messages in our jobs-results stream. Despite the consumer running correctly and keeping up with new messages, running nats stream ls will always report the stream having 220 messages:

╭──────────────────────────────────────────────────────────────────────────────────────╮
│                                        Streams                                       │
├──────────────┬─────────────┬─────────────────────┬──────────┬─────────┬──────────────┤
│ Name         │ Description │ Created             │ Messages │ Size    │ Last Message │
├──────────────┼─────────────┼─────────────────────┼──────────┼─────────┼──────────────┤
│ job-results  │             │ 2024-10-15 10:00:00 │ 220      │ 169 KiB │ 238ms        │
│ job-inbox    │             │ 2024-10-15 10:00:00 │ 1041     │ 989 KiB │ 919ms        │
╰──────────────┴─────────────┴─────────────────────┴──────────┴─────────┴──────────────╯

Getting information about the consumer indicateed that the messages were being recorded as Redelivered, indicating that they failed to be acknowledged at least once.

State:

Last Delivered Message: Consumer sequence: 18,472,541 Stream sequence: 21,637,785 Last delivery: 2ms ago Acknowledgment Floor: Consumer sequence: 18,472,539 Stream sequence: 21,637,783 Last Ack: 31ms ago Outstanding Acks: 2 out of maximum 10,000 Redelivered Messages: 220 Unprocessed Messages: 0 Waiting Pulls: 100 of maximum 512

I wanted to get these messages replayed. However, I couldn’t find an easy way to do so. I assumed that since the consumer had a deliver policy of all I only had to “restart” the consumer in some way, and those messages will get redelivered.

But this was a miss-understanding of how these consumers operate. For these consumers are naught but bits of state tracked by the server.2 Restarting the app is not the same as “restarting” the consumer, and wouldn’t force the consumer to read from the start of the stream.

Furthermore, I suspect these messages, which obviously failed once, continued to fail for some reason until the max delivery and back-off thresholds were reached. Since they were never acknowledged and the stream didn’t have a dead-letter queue they would simply remain in the stream until another consumer comes along and processes them.

Which leads to how I got these messages replayed. There are a few advanced ways to do this, such as setting up a brand new consumers, which is only really possible if the consumers have different subject filters. But the solution I found was to simply recreate the consumer. Shut-down the system using that consumer, delete it using nats consumer rm, then restart the system again. The consumer will be recreated with the sequence set to that of the messages in the stream, and the messages will be replayed.

Obviously this won’t be an approach that works with everyone, but I’m surprised how hard it was to find others suggest this approach (here’s an example of one other proposing this). So consider this as evidence that this might be a viable approach to getting these messages replayed.

  1. Create or updates

  2. Someone I work with described these consumers as “topic subscriptions that live on the NATS server.” This really cleared up how I should understand Jetstream consumers, and I’m surprised that NATS doesn’t have a statement like this in big bold letters in their documentation.