In an earlier post we made a claim. However a script draws the cursor, whether it teleports straight to the target, follows a bezier curve, or runs a physics library like WindMouse or NaturalMouse, a motion model still catches it. The shape of a movement is only half of what a hand leaves behind. We showed the same thing holds for Playwright and browserless.io's humanlike API.
That invites the obvious question from the other side of the table. What if you stopped hand-drawing the shape and learned the rest? We built the tool that does it. It is called human_nav: a research red-team tool that synthesizes human-like cursor, scroll, and keystroke motion to stress-test behavioral bot detection. This is what it revealed about the sites that lean hardest on behavioral biometrics.
The short version:
- Off-the-shelf humanizers (bezier curves, WindMouse, NaturalMouse) get caught 97% to 100% of the time by a desktop motion model. Geometry was never the answer.
- human_nav throws out hand-tuned geometry. Three small reinforcement-learning policies generate cursor paths, scroll windows, and keystroke timing, trained on real human motion: cadence, reversals, the pause between words.
- Against the frozen detectors it was trained on, it reaches the human band. Against a live, drifted detector the gap reopens. Scroll separability sits around AUC 0.77, the cursor policy can still land above a live threshold, and keystroke cadence can fall outside the human range.
- The result cuts toward defenders. Shape is a solved problem for the attacker. What still separates a policy from a person is cadence stability, coherence across channels, and a detector that keeps moving.
Why these are the hard targets
The bot defenses that survive a competent operator are not the ones checking whether the browser is headless. That fight was won and lost years ago. The defenses that still bite watch how the session behaves: the micro-timing of a scroll, the hesitation in a cursor, the rhythm of typing into a search box. Big marketplaces, professional networks, and high-churn social feeds, the Amazons and LinkedIns and Reddits of the web, lean on this layer, because it is the one a stealth browser and a clean fingerprint do not get past for free.
That makes them the right thing to point a behavioral red-team tool at. Not to break them, but to find out how much of that behavioral layer is real security and how much is a speed bump that folds the moment the motion comes from something better than a bezier curve. So we built the better thing.
How human_nav works
The idea is narrow. Do not script motion. Sample it from a policy that learned what human motion looks like. Three separate models, each a small reinforcement-learning policy running locally, each owning one channel.
Every move, scroll, and keystroke is routed through a local policy server before it touches the page. The automation asks for "move to B" or "type this," and the policy hands back the exact point-by-point motion to replay.
| Channel | Observation → action | Emits | Params |
|---|---|---|---|
| Cursor | 19 → 3 | A→B path of (dx, dy, dt ms) steps | ~40k |
| Scroll | 13 → 2 | wheel window of (dy, dt) ticks | ~9k |
| Typing | 33 → 2 | per-key (hold, flight) timing | ~35k |
Each one is a compact actor-critic MLP trained with PPO: pure PyTorch, CPU, a tanh-squashed Gaussian policy, running observation normalization, an entropy bonus, a target-KL early stop. None of that is exotic. The optimizer was never the point. The reward is.
Here is what sets it apart from every off-the-shelf humanizer. A bezier library optimizes the look of a path. These policies are graded by frozen copies of cside's own detectors, and rewarded in logit space for producing motion those detectors read as human. They are not drawing a nicer curve. They are solving for the thing the curve was only ever a stand-in for.
Inside the cursor policy
The cursor agent sees a 19-dimensional observation: the vector to the target, its last (dx, dy, dt), the step index, cumulative path length, net displacement, running speed and timing statistics, a direction-change count, a straightness ratio, and a next-waypoint lookahead. That accumulator-heavy state is deliberate. The detectors read aggregate kinematics over the whole path, so the policy gets handed running sufficient statistics for exactly the aggregates it is being judged on. It acts by emitting a (dx, dy, dt) triple each step, clipped to ±40 px and 4 to 40 ms.
The reward is a gate. A reach bonus of 20 dominates everything, so the policy first learns to actually arrive at B. Only on arrival does it collect a detector reward, set to the human-margin of the binding critic, the worse of two frozen scorers: cursor_v2 (an MLP head, threshold around 0.992) and cursor_v1 (a LightGBM head, threshold around 0.828). Drop both below threshold at once and it earns a realism bonus on top. Training walks a curriculum: pure navigation first, then rising detector pressure against both critics together. Fighting two detectors at once is what keeps the paths clean and smoothly curved, instead of collapsing into an artifact that fools one scorer and looks broken to the other.
Six real A→B paths straight out of the policy, deterministic, show the cursor accelerating through the middle and easing into the target, the slow-down on approach a hand makes rather than a constant-velocity glide. Median straightness across them is about 0.99, over 12 to 26 points, at roughly 38 ms a step. Per-step timing sits nearly flat in the mid-30s of milliseconds. The policy found that steady micro-timing, not jitter, is what the frozen detectors read as human.
This solution is a sharp, deterministic needle. The policy's mean action is what lands in the human band. Sample it stochastically, or add your own noise on top, and realism falls apart. Randomness is exactly what naive automation reaches for, and here it is the tell. The win is also narrow. It is an adversarial exploit of one frozen detector, not certified-human motion.
Scroll and typing
The scroll policy emits (dy, log1p(dt)) per wheel tick. Timing is generated in log space and mapped back, so one policy covers everything from sub-10 ms bursts to second-long settle pauses. It samples its scroll tasks, the length, net distance, and reversals, from a bank of real human scrolls. Each generated window eases in and out instead of ramping in a straight line, and direction reversals land at longer gaps, a person pausing before they correct.
The typing policy runs on a 33-dimensional observation whose tail is a one-hot of the current and next-two key categories (letter, digit, space, edit), and emits a (hold, flight) pair per key. Hold times stay near 150 ms, but the flight gaps carry the signal.
| Keystroke timing | Value |
|---|---|
| Hold time | ~150 ms |
| Intra-word flight | 250 to 350 ms |
| Word-boundary flight | ~850 ms |
The policy learned to pause between words, a rhythm a fixed inter-key delay never produces. That is the whole idea. Swap hand-authored geometry for a learned policy per channel, grade it against a real detector, and let it find the parts of human motion a person writing mouse.move() would never think to encode: the slow-down on approach, the pause before a correction, the beat between words.
Where these paths land
One way to see what the policy bought us: take a pile of generated paths, from the naive humanizers and from human_nav, boil each down to ten kinematic features, and project the whole thing into three dimensions with PCA. The straight-jitter and bezier families sit in their own tight blobs, because there are only so many ways to draw a smooth-looking curve. The policy spreads across a wider, messier region, closer to how real hand motion scatters, which is the property a fixed curve cannot produce.
Read this honestly. The projection covers 440 paths from four synthetic generators, with 77% of the variance held in three axes. It compares synthetic generators against each other, not against real human captures. So the finding is "the policy occupies a different, wider region than off-the-shelf humanizers," not "the policy is indistinguishable from a human."
How close it actually gets
Here is where honesty matters more than the headline. Against the frozen detectors the policies were trained on, they win. That is what "trained against them" means, and on its own it proves almost nothing. The real test is a detector the policy has never seen, and ideally one that has drifted since. There the picture is mixed, and the mix is the finding.
| Channel | Against the frozen detector | Against a live, drifted detector |
|---|---|---|
| Cursor | In the human band | Can land above a live threshold. Not certified human. |
| Scroll | In the human band | Separable at roughly AUC 0.77. Realistic, not invisible. |
| Typing | In the human band | Pause-rate and speed can fall outside the human range. |
Read that table the way a defender should. A learned policy is a big step up from a bezier curve. It closes most of the geometry gap that off-the-shelf humanizers never touch. But "plausibly human against the detector I trained on" is not "human." The moment the detector on the other side is one it never studied, or one that has moved since training, the residual signal comes back. Motion realism decays under drift, and drift is the one thing a defender fully controls.
What this means if you run behavioral detection
A few things follow, and they are the reason we build red-team tools at all.
First, do not ship a behavioral check and freeze it. Any policy trained against a static detector will eventually match it. The most effective defense in that table is the right-hand column: recapture and retrain on a schedule. That is not upkeep, it is the actual mechanism. A detector that moves faster than an attacker can retrain is one they never converge on.
Second, score coherence, not channels in isolation. A policy that nails cursor motion and a policy that nails typing are still two separate samplers. The correlations a real person produces between moving, scrolling, and typing are much harder to fake than any single channel, because nobody trained a policy on the joint distribution. That seam is where an ensemble of good single-channel fakes comes apart.
Third, keep behavioral as one layer, not the layer. Behavioral motion is powerful, and it is also the layer most exposed to a determined humanizer. Pair it with fingerprint, network, and TLS signals, the way cside's stack decides is_bot as a combined call instead of trusting any one model, and an operator has to beat every layer at once, not only the one they sank a learned policy into. That is the case for a cascade, and it is why bot detection holds when a single check would fold.
Responsible disclosure. This writeup covers the technique and its limits at the level of outcomes. It leaves out the detector internals, the thresholds, the per-channel feature definitions, and any tuning procedure, the same line our other public posts hold. The tool is not distributed. Anything here that touches an outside platform was shared with that platform before publication.
The point of building the attacker
It is easy to read human_nav as an evasion tool that happens to live in a detection shop. We built it for the opposite reason. The only way to know whether a behavioral defense is real security or a speed bump is to build the best attacker you can and measure exactly where it stops working. The answer here is useful either way. Learned motion beats geometry, and it still does not beat a detector that keeps moving and reads more than one channel at once. That is an uncomfortable result if you are selling "humanlike" automation, and a reassuring one if your job is keeping the bots out.
cside shows you every script and session touching your site, including the automation that has learned to move like a hand. See what's actually running in your users' browsers.





