Blog

Catching bots that don't want to be caught: inside a two-stage neural detection stack

How a two-stage neural stack catches stealth browsers, proxied scrapers, and LLM agents that pass every fingerprint check, and where it hits a wall.

Jun 22, 2026 • 15 min read

Avneh AI Researcher

Illustration of a two-stage neural bot detection stack separating human and bot browser sessions

TL;DR

Most automated traffic is easy to stop. A user-agent regular expression and a CAPTCHA catch the bottom of the market. The traffic that matters does not live there. Stealth-patched browsers, residential-proxied scrapers, and LLM-driven agents run real browser engines, carry clean fingerprints, and move a pointer the way a person does. A clean fingerprint does not prove a human, and a datacenter address does not prove a bot.

This is a field report on a detection stack built for that adversary: a deterministic rule layer plus two compact neural models, one reading fingerprint and network signals and one reading behavior, combined by a logical OR. The behavioral model's frozen embedding separates bots from humans almost perfectly. A 16-architecture search shows the performance ceiling comes from the signal, not the model. And the honest result: on the full production population at a strict false-positive budget, recall sits below the bar the stack clears on its training distribution, for reasons of population composition rather than model quality.

Specific detection signals, feature names, and thresholds are withheld by design, so the engineering is reproducible in spirit without turning into an evasion guide.

The adversary worth engineering against

A user-agent regex and a CAPTCHA stop the bottom of the automation market and pass the rest through. The adversaries worth building against are not in that bottom tier. They run real Chromium. They patch the fingerprint surface that naive checks read. They proxy through residential address space. The most capable are language-model agents that pursue a goal across a multi-step flow rather than replaying a fixed script.

The defining property of this adversary is that no single observable settles the question. This is the same reason CAPTCHAs no longer do real work against motivated automation, and why stealth and anti-detect browsers exist as a product category. You cannot read the answer off one value. You have to combine independent signals and accept that each one, on its own, is wrong some of the time.

The detection stack: a deterministic layer and two compact models

The stack decides whether a visitor is automated with a two-stage cascade preceded by a deterministic rule layer.

Deterministic rule layer. Encodes the handful of conditions that constitute proof of automation on their own.
Model 1, fingerprint (MorphNet). Scores a single page evaluation from device, browser, and network signals.
Model 2, behavior (GammaNet). Scores an entire session from behavioral telemetry, and takes the first model's score as an input.

The final label is the logical OR of all three. A visitor is flagged if the rule layer fires, or if either model fires. That structure has a consequence that shapes everything else: because the rule layer already catches the cases it can prove, the models do not need to. The models train only on the residual, the traffic the rules cannot settle, which is both the hard part of the problem and the part where a model earns its place.

The bar held throughout is recall of at least 0.90 at a false-positive rate of at most 0.01, measured on real production traffic at the session level.

How the evaluation stays honest

The numbers below only mean something because of how the evaluation is built, so it is worth stating the guards first.

Training and evaluation data are real page evaluations grouped into sessions. Labels are deterministic, applied only when a signal constitutes proof; anything ambiguous is left unlabeled rather than guessed, which keeps label noise out of the ground truth. For the stealth classes that evade fingerprinting by construction, ground truth comes from a controlled traffic-generation harness that runs known automation against instrumented targets, so the label is known by provenance rather than inferred.

Three leakage guards keep the reported numbers honest:

The signals the deterministic rule layer reads are excluded from model training, so the model learns the residual instead of re-memorizing what the rules already own.
The events that directly announce an automation agent are stripped from the behavioral stream before features are computed, so the behavioral model learns motion and timing structure rather than the presence of a tell-tale event type.
The cascade is wired with out-of-fold scores: the first model's contribution to the second is its cross-validated score, never a prediction from a model that has already seen the session.

Every split is by session, never by row, since multiple evaluations share a session and a row-level split would let the same session appear on both sides. All results use a fixed seed.

Model 1: MorphNet, the fingerprint model

The first model reads roughly 90 per-evaluation features describing fingerprint coherence, transport-layer characteristics, and network context. The first production version was a gradient-boosted tree ensemble. It was accurate on easy traffic, heavy to ship into a latency-sensitive request path, and hard to introspect. It was replaced with a from-scratch network.

MorphNet gives a tiny network the expressive shape of a tree ensemble: a learned linear reduction into a compact internal space, then per-channel activations where each channel fits its own response curve, then a low-rank interaction layer that lets pairs of signals modulate each other, a weighted read-out, and a short funnel to a single output. The whole model is about 6,500 parameters and trains in roughly 25 seconds on a laptop GPU.

Two latent failures in the first draft were caught at design review, before any training code was written. One was a terminal softmax over a single output, which is mathematically constant and carries no gradient, so the model could not learn. The other was an interaction read-out that collapsed to a constant vector, a dead layer that contributes nothing. A third lesson came from training: the normalization choice is load-bearing. Batch normalization's running statistics dropped validation AUC from 0.99 to 0.50, and layer normalization is the correct fit for the single-example model that runs in production.

On an honest session-level split, MorphNet results:

Metric	Value
AUC	0.9994
Recall at operating point	0.9959
Recall across 5 seeds	0.9962 ± 0.0007
Inference-vs-training numerical agreement	within 5e-7

Every seed clears the gate, and the agreement between the exported inference graph and the training framework means the model served is the model trained.

The contrast with the old baseline is sharpest on the residual hard set, the traffic the rule layer cannot settle. On a held-out stress split of 6,004 sessions:

Model	Recall on residual hard set
Earlier gradient-boosted version A	0.139
Earlier gradient-boosted version B	0.039
MorphNet	0.996

The rebuild is not a marginal gain on this slice. It is the difference between a model that works on the hard cases and two that do not.

Model 2: GammaNet, the behavioral model

Behavior is harder than fingerprint, so the second model has more structure. It reads 179 aggregate behavioral inputs describing the shape of a session, and it is built in two pieces.

The front end is an autoencoder encoder that maps the cleaned, standardized inputs down through a hidden layer to a 32-dimensional latent on the unit hypersphere. This encoder is trained once, without labels, and then frozen. The back end is a small supervised head: each latent channel passes through a compact learned nonlinearity, initialized so training begins from the frozen representation, followed by a short linear funnel to a single output. The model is about 14,000 parameters, the bulk of them in the frozen encoder and only a few hundred trainable in the head.

Freezing the encoder is the decision that makes this model behave. An encoder trained to reconstruct generalizes to live traffic; one trained to separate the classes as sharply as possible does not. On a held-out session-level split, GammaNet reaches an AUC of 0.9989 and recall of 0.9964 at the operating point, within a hair of a freshly retrained tree baseline on the identical split. On a live replay against known bots, its model-only recall was 99.4%, against 96.4% for the previously deployed behavioral model.

The behavioral embedding separates almost perfectly

The most striking artifact of the project is the geometry of GammaNet's frozen latent. Roughly 17,000 sessions were mapped through the encoder to see where bots and humans land. They sit in opposite regions of the sphere.

Metric	Value	What it means
Nearest-neighbour purity (k=1)	99.6%	Almost every point's nearest neighbour shares its class
Centroid cosine separation	-0.99	Bot and human centroids sit nearly antipodal on the sphere
Mean local entropy (k=20)	0.010	Almost no class mixing in local neighbourhoods

In practice there is no confusion zone. The classes sit near opposite poles, the region between them holds almost no probability mass, and the model is confident almost everywhere.

The projection below makes that geometry tangible: each point is a session mapped through the frozen encoder, and the synthetic stress variations pull away from the real-field cluster rather than blending into it. Drag to rotate it.

Interactive 3D UMAP projection of GammaNet's behavioral embedding: 1,000 synthetic realistic-bot variations, colored by GammaNet score, set against real field traffic. A Tier-0 stress view of the latent geometry, not labeled ground truth. Open full view

The separation is not the work of a single direction. When each latent axis is scored on its own, discriminative power is spread across the representation, with a dozen or more axes each carrying real signal rather than one axis doing the work. That distribution matters for a security model. A representation that concentrated its signal in one axis would also concentrate the adversary's target. A distributed one does not offer that single point to push on.

The architecture search, and a wall

This is the point where a writeup usually declares victory. The opposite happened: an attempt to beat the model. If 32 dimensions separate the classes this cleanly, perhaps a richer embedding would raise live performance further.

The test was a bake-off of 16 model families on the behavioral feature set, scored on a battery that included the live production task. The set was deliberately broad: variational autoencoders of several kinds, supervised-contrastive and hyperspherical objectives, rotation-parameterized and feature-group-factored encoders, quaternion and octonion hypercomplex backbones, denoising and contractive variants, and a self-attention transformer over feature blocks.

Two findings came back, and both were clear.

First, the live ceiling is invariant to architecture. Every family, from a plain tied autoencoder to an octonion network to a transformer, landed in a narrow band, and the best of them only matched the existing encoder. Architectural sophistication did not move the limit.

Second, sharper in-distribution separation tracked worse live generalization. The models with the crispest training-distribution split, the aggressive contrastive objectives that reached perfect recall on the training data, were among the weakest on live traffic. Over-separating the training distribution is overfitting under another name, which is exactly why the encoder is frozen rather than reshaped by the classifier.

Then the control that settled the question: discard the embedding entirely and train classifiers directly on the raw features. An unrestricted gradient-boosted tree given every raw feature with no bottleneck, a linear model on the same features, and the compact embedding all land at essentially the same live AUC: a hard ceiling we measured rather than assumed. The live ceiling is a property of what you measure rather than of how you model it. That is a more useful result than another decimal point of AUC, because it says, with evidence, that further modeling work on the current signals is a dead end, and that the way forward is a new and independent signal.

One footnote became a model. Since the best generalizers were the generative heads, the team built one. VMFNet scores a session with a per-class von Mises-Fisher mixture on the sphere, comparing the likelihood under a human mixture against a bot mixture. At about a third of the size of the prior embedding stack, it reaches the same raw-feature ceiling. Generative heads generalized more stably than discriminative ones, the same lesson that led to freezing the encoder.

Defense in depth: layers that fail independently

In-distribution accuracy is a weak guarantee on its own, so each model was red-teamed internally against input perturbation and partial telemetry. The perturbation budgets are withheld, since those numbers amount to a recipe, but the qualitative picture guides the design. The fingerprint model degrades gracefully under moderate input noise and tolerates incomplete telemetry, which is what a request-path model needs when it sometimes sees partial data.

The conclusion the design acts on is that no single model should carry the system. That is why the stack is a cascade rather than one classifier. An adversary who degrades one model's strongest evidence still faces the deterministic layer and a second model that reads an entirely different family of signals, and the final decision fires if any of the three does. Resilience here comes from holding several uncorrelated tells at once rather than one strong one. Defeating the stack means defeating all of them together.

Production footprint

Because the stack was designed for the request path from the start, the neural version is light as well as accurate.

Measure	Tree stack	Neural stack	Change
On-disk size (ONNX)	6.465 MB	0.079 MB	about 82x smaller
Added inference RAM	+89 MB	+14 MB	about 6.4x lighter
Throughput	0.56 M rows/s	1.74 M rows/s	about 3x faster

The metric that matters holds steady while the footprint drops by close to two orders of magnitude. The entire deployed behavioral model is 79 kilobytes, small enough to treat as negligible against the rest of the request path.

Per-class coverage, and the honest number

On the gold-standard evaluation, 5,997 labeled bots and 7,964 humans drawn from production logs with session-level splits and leakage guards, the deployed cascade catches every adversary class at full recall.

Adversary class	Deployed-stack recall
Basic automation	100%
Hard stealth browsers	100%
LLM agents	100%
Stealth + fingerprint patch	100%
Unmapped live bots	100%

That includes the hard classes: stealth browsers with patched fingerprints, language-model agents, and live bots that behave like ordinary users. Sessions that an earlier model scored near zero now score near one after a targeted fix for the distribution they came from.

Then the result that keeps the work honest. At the strict operating point, recall on the full production population sits below the bar the stack clears on its training distribution. The reason is worth stating plainly, because it is not a model-quality problem. The gap is about population composition and the cost of a strict false-positive budget. A real human population contains large groups whose traffic resembles automation at that budget, so holding it necessarily leaves recall on the table. The architecture search already showed that almost all of the separating signal the current features contain has been extracted. Closing the remaining gap calls for a new and independent signal, or for population-specific operating points, rather than a better classifier on the same inputs.

What this means for teams running detection

Strip the specifics and a few operating lessons carry over to any team trying to separate humans from automation that is built to look human:

Fingerprint-only and network-only checks miss the adversary that matters. The classes that cost you money pass those checks by design. Behavioral and runtime signals are where they separate.
The ceiling is the signal, not the model. When a model search converges on the same number across 16 families, more modeling will not help. The next gain comes from a new, independent signal captured where the bot actually executes.
Build layers that fail independently. A cascade of a deterministic layer plus uncorrelated models is harder to evade than one strong classifier, because an attacker has to defeat all of them at once.
Pick operating points per population and per action. A single global threshold at a strict false-positive budget leaves recall on the table. Scope the decision to the action, and tune the budget to who you are protecting.

This is also where cside fits operationally. cside watches the browser runtime in real time, captures the device and real client IP, surfaces runtime script and automation signals, reads fingerprint coherence and session continuity, and flags AI agents and stealth browsers inside the page, then exposes those signals through an API so you can drive an allow, monitor, challenge, or block decision in your own workflow. That browser-layer telemetry is exactly the independent signal a fingerprint-only or network-only stack runs out of, and the layer where a human, a good bot, and a malicious agent finally stop looking alike. For the signal mechanics underneath that decision, see the guide to detecting AI agent traffic and how to block AI agents on your website.

Monitor and Secure Your Third-Party Scripts

Gain full visibility and control over every script delivered to your users to enhance site security and performance.

Book a demo

Start for free

Start free, or try Business with a 14-day trial.

cside dashboard interface showing script monitoring and security analytics

Account sharing detection: how to close the enforcement gap that concurrent session limits miss

Concurrent session limits flag the obvious case. They do not distinguish between a single user on two devices and two people sharing one account.

How to Block Applebot-Extended on Your Website

Applebot-Extended is Apple's AI training crawler that feeds Apple Intelligence. Learn how it differs from Applebot and how to opt out via robots.txt.

Dark cside blog cover with a blue pixel wave and checklist about monitoring third-party scripts across casino domains

How to Monitor Third-Party Scripts Across 100 or More Casino Domains

A practical guide to monitoring third-party scripts across 100-plus casino domains: script sprawl, cross-domain alerts, and scaling cside.

Agentic AI Security Risks for Websites: Privacy, Compliance, and Detection

Agentic AI browsers bypass cookie consent, execute real JavaScript, and create GDPR compliance gaps that CDN-level bot detection cannot see.

Catching bots that don't want to be caught: inside a two-stage neural detection stack

How a two-stage neural stack catches stealth browsers, proxied scrapers, and LLM agents that pass every fingerprint check, and where it hits a wall.

How to Block DeepSeekBot on Your Website

DeepSeekBot crawls your site for a Chinese AI company. Learn how to block it with robots.txt, IP rules, and the real data sovereignty risks it raises.

Dark cside blog cover with a blue pixel wave and checklist about Malta Gaming Authority script compliance

Malta Gaming Authority Compliance and Client-Side Script Security: What MGA-Licensed Operators Need to Cover

MGA rules require a secure, auditable platform. Third-party JavaScript on licensed sites is a compliance gap most operators have not audited.

Dark cside blog cover with a blue pixel wave and checklist about third-party script attacks on iGaming platforms

Third-Party Script Attacks on iGaming Platforms in 2026: The New Attack Surface Operators Are Missing

Third-party JavaScript is the primary unmonitored attack surface on iGaming platforms. The seven attack classes, and why standard tools miss them.

Dark cside blog cover with a blue pixel wave and checklist about unauthorized gambling pixels and GDPR liability

GDPR and Online Gambling: Why Unauthorised Pixels Create a Dual Liability Problem

Unauthorised pixels on gambling sites trigger GDPR liability and ad-account bans at once, even when the operator never installed them. Here's why.

HIPAA Website Tracking Compliance: The Healthcare Guide to Third-Party Scripts

HHS OCR ruled that tracking pixels and third-party scripts on healthcare websites can expose PHI. Here's what covered entities must do to comply.

Catching bots that don't want to be caught: inside a two-stage neural detection stack

TL;DR

The adversary worth engineering against

The detection stack: a deterministic layer and two compact models

How the evaluation stays honest

Model 1: MorphNet, the fingerprint model

Model 2: GammaNet, the behavioral model

The behavioral embedding separates almost perfectly

The architecture search, and a wall

Defense in depth: layers that fail independently

Production footprint

Per-class coverage, and the honest number

What this means for teams running detection

Further reading on cside

Monitor and Secure Your Third-Party Scripts

Account sharing detection: how to close the enforcement gap that concurrent session limits miss

How to Block Applebot-Extended on Your Website

How to Monitor Third-Party Scripts Across 100 or More Casino Domains

Agentic AI Security Risks for Websites: Privacy, Compliance, and Detection

Catching bots that don't want to be caught: inside a two-stage neural detection stack

How to Block DeepSeekBot on Your Website

Malta Gaming Authority Compliance and Client-Side Script Security: What MGA-Licensed Operators Need to Cover

Third-Party Script Attacks on iGaming Platforms in 2026: The New Attack Surface Operators Are Missing

GDPR and Online Gambling: Why Unauthorised Pixels Create a Dual Liability Problem

HIPAA Website Tracking Compliance: The Healthcare Guide to Third-Party Scripts