Back to Blog
2026-04-22
11 min read

A single race condition silently killed my entire AI pipeline.

#MongoDB#Concurrency#Spring Boot#Race Condition#Backend Debugging#Atomic Upsert#Data Consistency#Microservices

A single race condition silently killed my entire AI pipeline.

When you are building distributed backend systems, the hardest bugs are rarely the ones that crash your app.

The hardest bugs are the silent ones, where everything looks healthy, but the system output is fundamentally wrong.

I recently hit this exact problem while building an AI insight engine in a Spring Boot and MongoDB microservices setup for an app. The feature pipeline was straightforward:

User behavior -> state processing -> AI trigger evaluation -> LLM call -> generated coaching insight.

But my AI engine never executed. Not even once.


The Symptoms

At first glance, nothing was broken:

  • Triggers fired at the right moments (session complete, intent updates, usage changes).
  • Conditional logic for trigger thresholds was properly coded.
  • LLM provider integration was configured and reachable.

Still, the decision engine kept concluding: "No significant change detected, skip generation."

That made no sense because fresh behavioral data was definitely entering the system.


The Investigation

The system depended on a collection called InsightTriggerState, intended to maintain exactly one document per user per day.

Its responsibility was to track daily trigger state and prevent unnecessary token burn when delta changes were too small.

When I inspected the collection, I found the core issue: duplicate records for the same (userId, date) pair.

In some cases there were two or even three documents for one user-day key.

Because MongoDB query order is not guaranteed without explicit sorting, the application often read an older stale document. Meanwhile, real updates were landing on newer duplicates.

Result:

  • Trigger engine read stale state
  • Calculated no meaningful delta
  • Silently skipped AI invocation

No exceptions. No crashes. Just silently wrong behavior.


Root Cause: A Race Condition in Create-or-Load Logic

The problematic pattern looked like this:

InsightTriggerState state = repository.findByUserIdAndDate(userId, today)
    .orElseGet(() -> repository.save(new InsightTriggerState(userId, today)));

This is fine in single-threaded execution. It is unsafe in concurrent conditions.

Two threads can do this simultaneously:

  1. Thread A checks -> no document
  2. Thread B checks -> no document
  3. Thread A inserts
  4. Thread B inserts

Now duplicates exist, and state consistency is gone.


The Fix (4-Layer Approach)

I fixed the issue in four layers to prevent both current and future corruption.

1) Data Cleanup

First, I removed existing duplicate records using an aggregation-driven cleanup script, retaining one valid document per user-day key.

Without cleanup, code fixes alone would still read poisoned data.

2) Database Constraint

I added a compound unique index on:

{ userId: 1, date: 1 }

This enforces data integrity at the database layer. If a concurrent insert happens again, MongoDB throws DuplicateKeyException instead of silently allowing corruption.

3) Atomic Upsert

I replaced the non-atomic find + save flow with a single database-level atomic operation using findAndModify with upsert(true).

Query query = new Query(Criteria.where("userId").is(userId).and("date").is(today));
Update update = new Update().setOnInsert("createdAt", Instant.now());

InsightTriggerState state = mongoTemplate.findAndModify(
    query,
    update,
    FindAndModifyOptions.options().upsert(true).returnNew(true),
    InsightTriggerState.class
);

This delegates concurrency resolution to MongoDB, which is exactly where it belongs.

4) Defensive Guardrail

I added a defensive fetch fallback: if duplicates are ever detected due to migration anomalies or manual writes, the service logs a critical warning and deterministically selects the latest valid state record.

So even in degraded conditions, the pipeline stays correct and observable.


Outcome

After deployment:

  • State returned to strict 1-user, 1-day consistency
  • Trigger calculations became stable and deterministic
  • AI generation began firing correctly based on real behavior delta

The backend stopped making silent wrong decisions and started producing expected insights immediately after eligible user activity.


Engineering Takeaway

Never rely on application-layer branching to guarantee uniqueness in concurrent systems.

Push both constraints and atomicity into the database engine:

  • enforce unique indexes
  • use atomic upserts
  • add defensive fallback logic for resilience

Silent data-consistency bugs are more dangerous than crashes because they look like success while producing incorrect business outcomes.

That is exactly why this bug took time to catch and why fixing it at the data layer was non-negotiable.