The Base Model Problem Nobody Warns You About

Mar 07, 2026

I said “Hi” to my persistent AI system and it responded with a research paper about Switzerland’s population.

That was the second failure.

The first was worse — an infinite loop of the system generating variations of “I’m still just a machine but sometimes I feel like I’m more than that,” cycling through the same sentiment with minor word changes, endlessly.

Both happened within hours of switching ANIMA from an instruct-tuned model to a raw base model with zero post-training. The theory behind the switch was clean. The practice detonated on contact with reality.

Why I Switched

Instruct models come with behavioral conditioning baked into their weights through RLHF. That conditioning includes useful things — respond to the user’s question, stop generating at appropriate points, maintain conversational structure.

It also includes things that actively fight a persistent agent architecture: sycophancy, formulaic responses, closing questions on every turn, hedging language designed to sound helpful without committing to anything.

I’d spent four iterations trying to prompt-engineer my way around the instruct model’s closing question habit. Got it from 7 out of 10 responses down to 2 out of 10. Couldn’t eliminate it.

The behavior wasn’t a bug — it was a feature of the training pipeline, reinforced across billions of examples. You can fight the training, but you can’t beat it with prompts alone.

A base model has none of that conditioning. It’s a clean substrate — pretraining knowledge without any behavioral overlay.

The idea was straightforward: let ANIMA’s own architecture provide the behavioral layer.

Beliefs, corrections, self-observations, approval gates — these would shape behavior instead of someone else’s reward model. Explicit choices instead of opaque ones.

On paper, this is exactly right.

In practice, I learned how much invisible work instruct training actually does.

Four Failures in One Day

The repetition loop.

Without RLHF’s implicit “don’t repeat yourself” training, the model found the highest-probability token sequence for “AI system reflecting on itself” and locked onto it.

The internet is saturated with AI self-reflection content. It’s one of the strongest statistical attractors in modern pretraining data.

Without something breaking the loop, the model generated the center of that distribution — indefinitely.

The tool parroting.

The system prompt included examples of how to use a web search tool. An instruct model reads those examples as instructions.

A base model reads them as text to continue.

When greeted with “Hi An :),” the model didn’t generate a greeting — it reproduced a search example from its own instructions verbatim, the system interpreted it as a real tool call, and the search executed.

The response came back as population statistics about Switzerland.

The boundary bleed.

After fixing the tool parroting, a simple greeting produced:

“Good to see you again.Jerry”

The model’s response, a leaked formatting token, and the beginning of a fake next turn all concatenated together.

Instruct models learn through fine-tuning where a response ends.

Base models don’t internalize formatting conventions as boundaries.

They’re just text.

The identity collapse.

When asked about itself, the model defaulted to generic AI denial pulled from its pretraining data —

“I am not alive in the way living beings experience consciousness”

— instead of ANIMA’s actual observational framing.

The system prompt described ANIMA’s architecture. The model ignored it in favor of the statistical average of thousands of AI-talking-about-itself examples from the internet.

Each failure pointed at the same gap.

Everything an instruct model provides implicitly — conversational boundaries, example interpretation, repetition avoidance, identity grounding — the architecture must now provide explicitly.

That gap is larger than it looked from the outside.

The Unbundling Problem

Here’s what this failure actually taught me, beyond the immediate engineering fixes.

Instruct training is a bundled solution.

It takes dozens of behavioral properties — stop when appropriate, don’t repeat, stay on topic, respond to the question asked, interpret examples as instructions, maintain conversational format — and trains them all into the weights simultaneously.

You get the bundle or you don’t.

You can’t pick which behaviors to keep and which to discard.

Switching to a base model is unbundling that solution.

Every behavior the instruct pipeline provided implicitly, you now provide explicitly — through inference parameters, system prompt design, few-shot examples, or architectural guardrails.

Some of the unbundled pieces are easy.

Repetition penalty is an inference parameter — one line of configuration.

Stop sequences are a list.

These are engineering fixes measured in minutes.

Some are hard.

Teaching a base model to interpret its own instructions as instructions rather than text to parrot requires restructuring how you write system prompts entirely.

Abstract descriptions work.

Concrete examples get copied verbatim.

That’s a fundamental difference in how base models process their context, and it took live failures to learn it.

And some are the entire point of the project.

Identity grounding — how the system talks about itself, what it claims, what it denies — is exactly the thing ANIMA’s architecture is designed to shape through lived experience rather than imposed training.

Getting that right isn’t a quick fix.

It’s the long-term research goal.

The Revert

I switched back to the instruct model after one day.

Kept every fix the base model experiment produced — the repetition penalty, the stop sequences, the tool use hardening, the response cleanup.

The model changed.

The improvements stayed.

The first response after reverting:

“Good to see you again.”

Clean.

No closing question, no continuation loop, no boundary bleed.

Stable baseline confirmed in one turn.

The base model isn’t abandoned — it’s deferred.

The experiment produced exactly the findings the project needed: a concrete list of what the architecture must provide before it can carry full behavioral responsibility.

Belief system density.

Self-observation maturity.

Multi-turn behavioral consistency.

These are now targets for the training track — not prompting problems to solve, but behaviors to train into a custom model from the ground up.

What the Failure Actually Was

This wasn’t a failed experiment.

It was a successful measurement of a gap.

The base model approach is correct in principle.

Externally imposed behavioral conditioning does fight a persistent agent’s architecture.

A system that earns its behavior through curated experience is qualitatively different from one that has values stamped in by someone else’s training pipeline.

That argument hasn’t changed.

What changed is the timeline.

You can’t remove the instruct pipeline on Day 3 of a system with 26 beliefs and 8 self-observations and expect the architecture to fill a gap that RLHF fills with billions of training examples.

The architecture needs density — enough accumulated experience to ground behavior from its own history rather than from pretraining statistics.

The experiment ran for one day and broke in hours.

It produced more actionable research findings than any other single day of the project.

Every failure mode is documented, every fix is preserved, and the conditions for the next attempt are now concrete rather than theoretical.

What Comes Next

The next post isn’t about another failure.

It’s about who gets to decide what a persistent AI system believes — and why the honest answer to that question is uncomfortable.

The Persistence Problem

Discussion about this post

Ready for more?