Skip to main content
Blog
Blog

Catching bots that don't want to be caught: inside a two-stage neural detection stack

How a two-stage neural stack catches stealth browsers, proxied scrapers, and LLM agents that pass every fingerprint check, and where it hits a wall.

Jun 22, 2026 15 min read
Illustration of a two-stage neural bot detection stack separating human and bot browser sessions

TL;DR

Most automated traffic is easy to stop. A user-agent regular expression and a CAPTCHA catch the bottom of the market. The traffic that matters does not live there. Stealth-patched browsers, residential-proxied scrapers, and LLM-driven agents run real browser engines, carry clean fingerprints, and move a pointer the way a person does. A clean fingerprint does not prove a human, and a datacenter address does not prove a bot.

This is a field report on a detection stack built for that adversary: a deterministic rule layer plus two compact neural models, one reading fingerprint and network signals and one reading behavior, combined by a logical OR. The behavioral model's frozen embedding separates bots from humans almost perfectly. A 16-architecture search shows the performance ceiling comes from the signal, not the model. And the honest result: on the full production population at a strict false-positive budget, recall sits below the bar the stack clears on its training distribution, for reasons of population composition rather than model quality.

Specific detection signals, feature names, and thresholds are withheld by design, so the engineering is reproducible in spirit without turning into an evasion guide.

The adversary worth engineering against

A user-agent regex and a CAPTCHA stop the bottom of the automation market and pass the rest through. The adversaries worth building against are not in that bottom tier. They run real Chromium. They patch the fingerprint surface that naive checks read. They proxy through residential address space. The most capable are language-model agents that pursue a goal across a multi-step flow rather than replaying a fixed script.

The defining property of this adversary is that no single observable settles the question. This is the same reason CAPTCHAs no longer do real work against motivated automation, and why stealth and anti-detect browsers exist as a product category. You cannot read the answer off one value. You have to combine independent signals and accept that each one, on its own, is wrong some of the time.

The detection stack: a deterministic layer and two compact models

The stack decides whether a visitor is automated with a two-stage cascade preceded by a deterministic rule layer.

  • Deterministic rule layer. Encodes the handful of conditions that constitute proof of automation on their own.
  • Model 1, fingerprint (MorphNet). Scores a single page evaluation from device, browser, and network signals.
  • Model 2, behavior (GammaNet). Scores an entire session from behavioral telemetry, and takes the first model's score as an input.

The final label is the logical OR of all three. A visitor is flagged if the rule layer fires, or if either model fires. That structure has a consequence that shapes everything else: because the rule layer already catches the cases it can prove, the models do not need to. The models train only on the residual, the traffic the rules cannot settle, which is both the hard part of the problem and the part where a model earns its place.

The bar held throughout is recall of at least 0.90 at a false-positive rate of at most 0.01, measured on real production traffic at the session level.

How the evaluation stays honest

The numbers below only mean something because of how the evaluation is built, so it is worth stating the guards first.

Training and evaluation data are real page evaluations grouped into sessions. Labels are deterministic, applied only when a signal constitutes proof; anything ambiguous is left unlabeled rather than guessed, which keeps label noise out of the ground truth. For the stealth classes that evade fingerprinting by construction, ground truth comes from a controlled traffic-generation harness that runs known automation against instrumented targets, so the label is known by provenance rather than inferred.

Three leakage guards keep the reported numbers honest:

  1. The signals the deterministic rule layer reads are excluded from model training, so the model learns the residual instead of re-memorizing what the rules already own.
  2. The events that directly announce an automation agent are stripped from the behavioral stream before features are computed, so the behavioral model learns motion and timing structure rather than the presence of a tell-tale event type.
  3. The cascade is wired with out-of-fold scores: the first model's contribution to the second is its cross-validated score, never a prediction from a model that has already seen the session.

Every split is by session, never by row, since multiple evaluations share a session and a row-level split would let the same session appear on both sides. All results use a fixed seed.

Model 1: MorphNet, the fingerprint model

The first model reads roughly 90 per-evaluation features describing fingerprint coherence, transport-layer characteristics, and network context. The first production version was a gradient-boosted tree ensemble. It was accurate on easy traffic, heavy to ship into a latency-sensitive request path, and hard to introspect. It was replaced with a from-scratch network.

MorphNet gives a tiny network the expressive shape of a tree ensemble: a learned linear reduction into a compact internal space, then per-channel activations where each channel fits its own response curve, then a low-rank interaction layer that lets pairs of signals modulate each other, a weighted read-out, and a short funnel to a single output. The whole model is about 6,500 parameters and trains in roughly 25 seconds on a laptop GPU.

Two latent failures in the first draft were caught at design review, before any training code was written. One was a terminal softmax over a single output, which is mathematically constant and carries no gradient, so the model could not learn. The other was an interaction read-out that collapsed to a constant vector, a dead layer that contributes nothing. A third lesson came from training: the normalization choice is load-bearing. Batch normalization's running statistics dropped validation AUC from 0.99 to 0.50, and layer normalization is the correct fit for the single-example model that runs in production.

On an honest session-level split, MorphNet results:

MetricValue
AUC0.9994
Recall at operating point0.9959
Recall across 5 seeds0.9962 ± 0.0007
Inference-vs-training numerical agreementwithin 5e-7

Every seed clears the gate, and the agreement between the exported inference graph and the training framework means the model served is the model trained.

The contrast with the old baseline is sharpest on the residual hard set, the traffic the rule layer cannot settle. On a held-out stress split of 6,004 sessions:

ModelRecall on residual hard set
Earlier gradient-boosted version A0.139
Earlier gradient-boosted version B0.039
MorphNet0.996

The rebuild is not a marginal gain on this slice. It is the difference between a model that works on the hard cases and two that do not.

Model 2: GammaNet, the behavioral model

Behavior is harder than fingerprint, so the second model has more structure. It reads 179 aggregate behavioral inputs describing the shape of a session, and it is built in two pieces.

The front end is an autoencoder encoder that maps the cleaned, standardized inputs down through a hidden layer to a 32-dimensional latent on the unit hypersphere. This encoder is trained once, without labels, and then frozen. The back end is a small supervised head: each latent channel passes through a compact learned nonlinearity, initialized so training begins from the frozen representation, followed by a short linear funnel to a single output. The model is about 14,000 parameters, the bulk of them in the frozen encoder and only a few hundred trainable in the head.

Freezing the encoder is the decision that makes this model behave. An encoder trained to reconstruct generalizes to live traffic; one trained to separate the classes as sharply as possible does not. On a held-out session-level split, GammaNet reaches an AUC of 0.9989 and recall of 0.9964 at the operating point, within a hair of a freshly retrained tree baseline on the identical split. On a live replay against known bots, its model-only recall was 99.4%, against 96.4% for the previously deployed behavioral model.

The behavioral embedding separates almost perfectly

The most striking artifact of the project is the geometry of GammaNet's frozen latent. Roughly 17,000 sessions were mapped through the encoder to see where bots and humans land. They sit in opposite regions of the sphere.

MetricValueWhat it means
Nearest-neighbour purity (k=1)99.6%Almost every point's nearest neighbour shares its class
Centroid cosine separation-0.99Bot and human centroids sit nearly antipodal on the sphere
Mean local entropy (k=20)0.010Almost no class mixing in local neighbourhoods

In practice there is no confusion zone. The classes sit near opposite poles, the region between them holds almost no probability mass, and the model is confident almost everywhere.

The projection below makes that geometry tangible: each point is a session mapped through the frozen encoder, and the synthetic stress variations pull away from the real-field cluster rather than blending into it. Drag to rotate it.

Interactive 3D UMAP projection of GammaNet's behavioral embedding: 1,000 synthetic realistic-bot variations, colored by GammaNet score, set against real field traffic. A Tier-0 stress view of the latent geometry, not labeled ground truth. Open full view

The separation is not the work of a single direction. When each latent axis is scored on its own, discriminative power is spread across the representation, with a dozen or more axes each carrying real signal rather than one axis doing the work. That distribution matters for a security model. A representation that concentrated its signal in one axis would also concentrate the adversary's target. A distributed one does not offer that single point to push on.

The architecture search, and a wall

This is the point where a writeup usually declares victory. The opposite happened: an attempt to beat the model. If 32 dimensions separate the classes this cleanly, perhaps a richer embedding would raise live performance further.

The test was a bake-off of 16 model families on the behavioral feature set, scored on a battery that included the live production task. The set was deliberately broad: variational autoencoders of several kinds, supervised-contrastive and hyperspherical objectives, rotation-parameterized and feature-group-factored encoders, quaternion and octonion hypercomplex backbones, denoising and contractive variants, and a self-attention transformer over feature blocks.

Two findings came back, and both were clear.

First, the live ceiling is invariant to architecture. Every family, from a plain tied autoencoder to an octonion network to a transformer, landed in a narrow band, and the best of them only matched the existing encoder. Architectural sophistication did not move the limit.

Second, sharper in-distribution separation tracked worse live generalization. The models with the crispest training-distribution split, the aggressive contrastive objectives that reached perfect recall on the training data, were among the weakest on live traffic. Over-separating the training distribution is overfitting under another name, which is exactly why the encoder is frozen rather than reshaped by the classifier.

Then the control that settled the question: discard the embedding entirely and train classifiers directly on the raw features. An unrestricted gradient-boosted tree given every raw feature with no bottleneck, a linear model on the same features, and the compact embedding all land at essentially the same live AUC: a hard ceiling we measured rather than assumed. The live ceiling is a property of what you measure rather than of how you model it. That is a more useful result than another decimal point of AUC, because it says, with evidence, that further modeling work on the current signals is a dead end, and that the way forward is a new and independent signal.

One footnote became a model. Since the best generalizers were the generative heads, the team built one. VMFNet scores a session with a per-class von Mises-Fisher mixture on the sphere, comparing the likelihood under a human mixture against a bot mixture. At about a third of the size of the prior embedding stack, it reaches the same raw-feature ceiling. Generative heads generalized more stably than discriminative ones, the same lesson that led to freezing the encoder.

Defense in depth: layers that fail independently

In-distribution accuracy is a weak guarantee on its own, so each model was red-teamed internally against input perturbation and partial telemetry. The perturbation budgets are withheld, since those numbers amount to a recipe, but the qualitative picture guides the design. The fingerprint model degrades gracefully under moderate input noise and tolerates incomplete telemetry, which is what a request-path model needs when it sometimes sees partial data.

The conclusion the design acts on is that no single model should carry the system. That is why the stack is a cascade rather than one classifier. An adversary who degrades one model's strongest evidence still faces the deterministic layer and a second model that reads an entirely different family of signals, and the final decision fires if any of the three does. Resilience here comes from holding several uncorrelated tells at once rather than one strong one. Defeating the stack means defeating all of them together.

Production footprint

Because the stack was designed for the request path from the start, the neural version is light as well as accurate.

MeasureTree stackNeural stackChange
On-disk size (ONNX)6.465 MB0.079 MBabout 82x smaller
Added inference RAM+89 MB+14 MBabout 6.4x lighter
Throughput0.56 M rows/s1.74 M rows/sabout 3x faster

The metric that matters holds steady while the footprint drops by close to two orders of magnitude. The entire deployed behavioral model is 79 kilobytes, small enough to treat as negligible against the rest of the request path.

Per-class coverage, and the honest number

On the gold-standard evaluation, 5,997 labeled bots and 7,964 humans drawn from production logs with session-level splits and leakage guards, the deployed cascade catches every adversary class at full recall.

Adversary classDeployed-stack recall
Basic automation100%
Hard stealth browsers100%
LLM agents100%
Stealth + fingerprint patch100%
Unmapped live bots100%

That includes the hard classes: stealth browsers with patched fingerprints, language-model agents, and live bots that behave like ordinary users. Sessions that an earlier model scored near zero now score near one after a targeted fix for the distribution they came from.

Then the result that keeps the work honest. At the strict operating point, recall on the full production population sits below the bar the stack clears on its training distribution. The reason is worth stating plainly, because it is not a model-quality problem. The gap is about population composition and the cost of a strict false-positive budget. A real human population contains large groups whose traffic resembles automation at that budget, so holding it necessarily leaves recall on the table. The architecture search already showed that almost all of the separating signal the current features contain has been extracted. Closing the remaining gap calls for a new and independent signal, or for population-specific operating points, rather than a better classifier on the same inputs.

What this means for teams running detection

Strip the specifics and a few operating lessons carry over to any team trying to separate humans from automation that is built to look human:

  1. Fingerprint-only and network-only checks miss the adversary that matters. The classes that cost you money pass those checks by design. Behavioral and runtime signals are where they separate.
  2. The ceiling is the signal, not the model. When a model search converges on the same number across 16 families, more modeling will not help. The next gain comes from a new, independent signal captured where the bot actually executes.
  3. Build layers that fail independently. A cascade of a deterministic layer plus uncorrelated models is harder to evade than one strong classifier, because an attacker has to defeat all of them at once.
  4. Pick operating points per population and per action. A single global threshold at a strict false-positive budget leaves recall on the table. Scope the decision to the action, and tune the budget to who you are protecting.

This is also where cside fits operationally. cside watches the browser runtime in real time, captures the device and real client IP, surfaces runtime script and automation signals, reads fingerprint coherence and session continuity, and flags AI agents and stealth browsers inside the page, then exposes those signals through an API so you can drive an allow, monitor, challenge, or block decision in your own workflow. That browser-layer telemetry is exactly the independent signal a fingerprint-only or network-only stack runs out of, and the layer where a human, a good bot, and a malicious agent finally stop looking alike. For the signal mechanics underneath that decision, see the guide to detecting AI agent traffic and how to block AI agents on your website.

Further reading on cside

Figures and metrics in this report are drawn from cached model artifacts and production-traffic evaluations. Specific detection signals, feature names, and decision thresholds are withheld by design.

Avneh
AI Researcher

Making machines learn. Applied math major currently developing the next generation of bot detection models at cside.

FAQ

Frequently Asked Questions

Because the adversaries worth engineering against run real Chromium, patch the fingerprint surface so naive checks read clean, and proxy through residential address space. A clean fingerprint does not prove a human, and a datacenter address does not prove a bot. No single observable settles the question, so detection has to combine fingerprint coherence, network context, and behavior rather than trusting any one signal.

Behavioral detection scores an entire session from how it moves through your site: the shape and timing of actions rather than the identity it claims. In the stack described here, a frozen autoencoder maps about 179 aggregate behavioral inputs into a 32-dimensional latent space where bots and humans land in opposite regions, and a small supervised head reads that geometry. Behavior is harder to fake convincingly than a user-agent string or a single fingerprint value, which is why it separates the hard classes.

On this problem, yes. Two compact models, roughly 6,500 and 14,000 parameters, match or beat a heavy gradient-boosted baseline while shipping at about one eightieth of the on-disk size. A bake-off of 16 model families showed the live ceiling is the same for a plain autoencoder and a transformer. The limit comes from the signal you collect, not the size or sophistication of the classifier, so a small model that fits the request path is the practical choice.

Population composition and the cost of a strict false-positive budget. At a strict false-positive budget, recall on the full production population sits below the strict internal bar cleared on the training distribution, even though every named adversary class was caught at full recall on the gold-standard set. A real human population contains large groups whose traffic resembles automation at that budget, so a strict budget necessarily leaves recall on the table. Closing the gap calls for a new and independent signal, not a better classifier on the same inputs.

Monitor and Secure Your Third-Party Scripts

Gain full visibility and control over every script delivered to your users to enhance site security and performance.

Start free, or try Business with a 14-day trial.

cside dashboard interface showing script monitoring and security analytics
Related Articles
Book a demo