The Hidden Cost of Calling AI Too Early

#AI#Spring Boot#MongoDB#Event Driven Architecture#Rate Limiting#Caching#Backend Engineering#System Design

I stopped calling AI on every request

At first, the design looked clean: every request for today's insight simply hit the AI pipeline. A user opened the page, the backend called Gemini, and the app returned a fresh response.

The problem was that this design treated every request as if it deserved a new AI generation. In practice, that meant:

Gemini 429 rate limits hit within hours
Daily quota got exhausted before noon
Random failures started cascading to users
Cost scaled linearly with traffic, which was not sustainable on a free tier

I did not have an AI problem. I had a trigger-model problem.

The Real Root Cause

The system had no gating. It never asked the basic questions that should come before any expensive model call:

Has anything actually changed?
Did I already generate one recently?
Is this user even active today?

Without those checks, the backend assumed that every request meant "generate new insight now." That was the expensive mistake.

The fix was not a cache bolt-on. I redesigned the entire trigger model so AI became the exception, not the default path.

The New Flow

Here is the revised request path:

flowchart TD
    A[Request for today's insight] --> B{Activity today?}
    B -- No --> C[Reuse latest insight or fallback]
    B -- Yes --> D{Meaningful trigger detected?}
    D -- No --> C
    D -- Yes --> E{Cooldown passed?}
    E -- No --> C
    E -- Yes --> F{Daily cap reached?}
    F -- Yes --> C
    F -- No --> G{Global AI limit reached?}
    G -- Yes --> H[Switch to deterministic fallback]
    G -- No --> I[Call AI model]
    I --> J[Persist insight and trigger state]
    H --> J
    C --> J

This flow changed the whole economics of the system. Most requests now end in a fast database read instead of a model call.

The Five-Layer Redesign

1) Activity Gate

The first and cheapest check is simple: did the user do anything today?

boolean hasActivity = activityService.hasActivityToday(userId, intent);
if (!hasActivity) {
    return getLatestOrFallback(userId, type, today);
}

If the answer is no, the backend skips the AI pipeline entirely and reuses the latest insight or a fallback response. No session. No screen time. No intent change. No AI call.

2) Event-Driven Triggers

The second layer checks whether there was a meaningful change at all. The trigger order is important:

Intent changed -> INTENT trigger
Screen time delta >= 30 minutes -> SCREEN_TIME trigger
Focus session delta >= 25 minutes -> SESSION trigger
No trigger -> reuse the existing insight

// detectTrigger()
// Priority 1: intent change
// Priority 2: screen time delta
// Priority 3: focus session delta

This alone removed a huge amount of unnecessary AI traffic.

3) Cooldown Window

Even if a trigger fires, the system should not generate a new response too frequently. I added a 30-minute cooldown.

Duration cooldown = Duration.ofMinutes(props.getCooldownMinutes());
if (!bypassCooldown && triggerState.getLastInsightAt() != null) {
    Duration elapsed = Duration.between(triggerState.getLastInsightAt(), Instant.now());
    if (elapsed.compareTo(cooldown) < 0) {
        return getLatestOrFallback(userId, type, today);
    }
}

Intent changes can bypass this cooldown because explicit user goal changes deserve fresh guidance.

4) Soft Daily Cap

The fourth layer is a per-user safety net.

long todayCount = insightRepository.countByUserIdAndDate(userId, today);
if (todayCount >= props.getSoftDailyCap()) {
    return getLatestOrFallback(userId, type, today);
}

Even if the user keeps triggering activity, no one gets more than 10 insights in a day. This keeps the system predictable and prevents runaway usage.

5) Global AI Call Guard

The final layer protects the whole server.

if (dailyAiCalls.get() >= props.getMaxAiCallsPerDay()) {
    log.warn("Soft AI guard triggered: limit reached. Using fallback.");
    isFallback = true;
}

This is the global circuit breaker. After 50 AI calls per day across all users, the system switches to a deterministic fallback generator. That means the app stays usable and costs stay under control.

All Thresholds Are Externalized

The thresholds are not hardcoded. They live in configuration so the behavior can be tuned without changing code.

insight:
  session-delta: 25
  screen-time-delta: 30
  cooldown-minutes: 30
  soft-daily-cap: 10
  stale-reuse-threshold: 3
  max-ai-calls-per-day: 50
  freshness-window-hours: 8

That made the system much easier to adjust as usage patterns changed.

The Result

After the redesign:

AI calls dropped from around 100 per day to roughly 5 to 10
Gemini 429 errors disappeared
Most requests became fast database reads instead of network calls
The free tier became sustainable
The system became more deterministic and easier to debug

The biggest win was not just cost reduction. It was behavioral correctness. The AI now runs only when it actually adds value.

Engineering Lesson

AI should be the exception, not the rule.

A good backend does not ask the model to solve every request. It first decides whether the request is even worth sending to the model. That is where the real engineering happens: gating, thresholds, cooldowns, and fallback paths.

Once I stopped calling AI on every request, the entire system became calmer, cheaper, and more reliable.

That is the pattern I would use again.

Final Thought

If most requests can be answered by deterministic logic or cached state, do that first. Reserve AI for the moments where it truly matters.