At cside, we place a strong emphasis on memory safety and performance. All of our core services are written in Rust, including the Edge service.
The Edge sits on a sensitive boundary. It has to be fast, safe, and resilient, because it is part of the path where we collect high-quality signals for products like VPN detection and bot detection.
For many web applications, the simplest TLS setup is to let a cloud load balancer terminate TLS and forward plain HTTP to the application. That is a good default when the application only needs the final HTTP request.
The Edge has a different job.
Some detection signals exist before the request becomes ordinary HTTP, at Layer 4 (the transport layer) rather than Layer 7. If TLS is fully terminated before traffic reaches the Edge, those signals are no longer available in the same form. So for this part of the platform, the Edge has to operate at Layer 4 and manage the TLS path directly.
That gives cside better signal quality for detection, but it also means the Edge needs to handle real-world TLS behavior itself.
This post is about one small reliability fix in that path: moving TLS handshake work out of the Axum accept path and into bounded Tokio tasks.
What started failing
The patch itself was not large, but understanding the failure took some digging.
The Edge was healthy. Certificates were loading. The port was reachable. Most traffic behaved normally.
But under more load, some HTTPS checks and client connections appeared to hang or time out.
At first, that can look like a TLS problem. In practice, the important pattern was more specific: some clients opened a connection but did not complete the TLS handshake.
That is normal on an internet-facing service. Public endpoints see incomplete connections all the time:
client connects, then sends nothing
client starts TLS, then disappears
client starts TLS, then stalls
Those failures were not the surprising part.
The surprising part was how much impact one incomplete handshake could have on nearby healthy traffic.
Before the fix
Before the fix, one part of the Edge TLS path did too much work in a single step.
Conceptually, it behaved like this:
accept one connection
finish TLS work for that connection
then accept the next connection
In simplified Rust, the shape was roughly this:
impl axum::serve::Listener for EdgeTlsListener {
async fn accept(&mut self) -> (TlsStream, SocketAddr) {
let (mut tcp_stream, addr) = self.tcp_listener.accept().await;
let hello = read_tls_hello(&mut tcp_stream).await;
let tls_stream = complete_tls_handshake(hello, tcp_stream).await;
(tls_stream, addr)
}
}
That code is easy to reason about, but it puts the whole TLS handshake inside accept().
For Axum, accept() is the front door. If it is busy waiting on one connection, the server is not receiving the next completed connection from that listener.
That looks simple, but it creates head-of-line blocking.
If one connection started TLS and then stalled, the Edge waited for that connection's timeout before moving on. During that wait, healthy connections could be delayed behind it.
The problem was not:
TLS cannot work
It was:
one incomplete TLS handshake can delay later healthy handshakes
That distinction mattered.
Increasing the timeout would not have fixed the issue. It would have made the slow path hold the line for longer.
The fix
The fix was to separate accepting a connection from completing its TLS handshake.
The Edge now accepts new connections quickly and handles each TLS handshake independently. A stalled or incomplete handshake can still time out, but it does not block later healthy connections from making progress.
Conceptually, the new flow looks like this:
accept connections quickly
handle each TLS handshake independently
return only completed secure connections to request handling
The new shape uses Tokio tasks plus a channel of completed secure connections:
let (ready_tx, ready_rx) = tokio::sync::mpsc::channel(limit);
tokio::spawn(async move {
loop {
let (tcp_stream, addr) = tcp_listener.accept().await;
let ready_tx = ready_tx.clone();
tokio::spawn(async move {
if let Some(tls_stream) = finish_tls(tcp_stream).await {
let _ = ready_tx.send((tls_stream, addr)).await;
}
});
}
});
Then the Axum-facing listener becomes much smaller:
impl axum::serve::Listener for EdgeTlsListener {
async fn accept(&mut self) -> (TlsStream, SocketAddr) {
self.ready_rx
.recv()
.await
.expect("TLS accept loop terminated")
}
}
The important part is the boundary: accept() no longer performs the slow handshake work itself. It receives handshakes that have already completed.
So the failure mode changed from this:
one stalled handshake
-> delays the next connection
to this:
one stalled handshake
-> times out independently
-> healthy connections continue
That is the important reliability improvement.
The fix did not remove timeouts. Timeouts are still necessary. An incomplete handshake should not live forever.
The fix changed where the timeout is paid. A bad connection now pays its own timeout instead of making other connections pay for it.
Keeping it bounded
There is a second important part of the fix.
If every new connection can create unlimited work, then the service becomes responsive but not safe under pressure. So the Edge also bounds the number of TLS handshakes that can be in flight at the same time.
The simplified version looks like this:
let permits = Arc::new(tokio::sync::Semaphore::new(limit));
loop {
let (tcp_stream, addr) = tcp_listener.accept().await;
let permit = permits.clone().acquire_owned().await.expect("TLS semaphore closed");
let ready_tx = ready_tx.clone();
tokio::spawn(async move {
let _permit = permit;
if let Some(tls_stream) = finish_tls(tcp_stream).await {
let _ = ready_tx.send((tls_stream, addr)).await;
}
});
}
The permit is owned by the task. When the task finishes, Rust drops the permit and returns capacity to the semaphore. That keeps the concurrency bound tied to the lifetime of the actual handshake work.
That gives us both properties we wanted:
slow handshakes do not block healthy handshakes
and:
slow handshakes cannot create unbounded work
This is the kind of tradeoff we care about in the Edge: improve reliability without giving up predictable resource usage.
Making failure visible
We also tightened an internal failure path.
If the part of the Edge responsible for accepting secure connections ever stops unexpectedly, the service should not silently wait forever. Silent hangs are hard to operate and hard to reason about.
The channel makes this state explicit. If all senders are gone, receiving from the channel returns None. That should not be treated like normal idle time, so the final accept() body replaces the earlier .expect() with an explicit, logged failure:
self.ready_rx.recv().await.unwrap_or_else(|| {
tracing::error!("TLS accept loop terminated");
panic!("TLS accept loop terminated")
})
The failure path now becomes visible immediately instead of turning into a hidden wait.
That does not change normal customer traffic, but it makes the system easier to trust during incidents.
What we learned
The main lesson is that timeouts are not enough if the timeout is paid in the wrong place.
A timeout around TLS work sounds reasonable. But if one slow connection can make unrelated connections wait behind it, the timeout becomes shared pain.
The better model is:
accept quickly
isolate slow work
bound concurrency
make unexpected failure visible
Another lesson is that internet-facing services should treat incomplete connections as normal. Clients disconnect. Health checks retry. Networks flap. Some handshakes never finish.
The Edge should not assume the internet is tidy.
Before the fix:
one incomplete handshake
-> nearby healthy traffic can wait
After the fix:
one incomplete handshake
-> isolated timeout
-> healthy traffic continues
The final patch did not make bad connections disappear.
It made the Edge handle them in the right place, with the right bounds, while keeping the performance and memory safety guarantees we expect from our Rust services.







