rust async on single vs. multi-core machines
Berend De Schouwer
Who Is This Post For?
Rust programmers tracking down strange behaviours that doesn’t always show up in debuggers or tracers
What Was Rust Used For?
I wanted to connect browsers to terminal programs. Think running a terminal in a browser like hterm or ajaxterm.
One side of the websocket runs a program that may at any time send data. In between there are pauses that stretch from milliseconds to hours.
The other side is the same.
This is a perfect fit for asynchronous programming. It’s also a candidate for memory leaks over time.
Both problems were tackled using Rust.
Problem Experienced
Sometimes the Rust program would stop. The program would still run in the background, running epoll(7), indicating that an async wait was running.
The program would not crash, and would not run away on the CPU.
The last statement executed:
debug!("This runs");
Err("This does not run!")
}
Which is strange, to say the least.
This would only happen on single-core machines. On machines with two or more cores, it would run fine.
This would happen on multiple targets architectures, multiple OS-es.
It would go into an infinite look on Err(“…”) on single core machines.
More About the Program
The program runs two parallel asynchronous threads, and waits for either thread to stop.
It does that because the network side or the terminal side could stop and close the connection. So it basically runs:
task::spawn(tty_to_websocket());
task::spawn(websocket_ty_tty());
try_join!(tty_to_websocket, websocket_to_tty);
try_join! should wait for either task to stop with an error.
I’ve setup both tasks to throw an error even on successful completion. This is because join! might wait for both to stop, and it’s possible for either side to stop without the other noticing.
try_join! never completes, because Err() never completes, which is strange.
What Does Err() Do?
Err() ends the function. In Rust that also runs the destructors, like an object-oriented program might. Lets say you have a function like:
fn error() -> Result<> {
let number: int = 2;
Err("error");
}
When Err() runs, Rust de-allocates number. The memory is returned.
For an int this is simple, but it can be more complicated.
What Is A TTY?
A TTY, for a program, is a file descriptor on a character device. It’s a bi-directional stream of data, typically keyboard in and text out.
The C API to use one uses a file descriptor. One way to get such a file descriptor is forkpty(3), which has a Rust crate.
Most Rust code want a std::fs::File, not a raw file descriptor, so it needs to be converted:
unsafe { let rust_fd = std::os::from_raw_fd(c_fd); }
The first bell is unsafe {}. The code is indeed unsafe because we’re working with a raw file descriptor.
The second bell is in the documentation for from_raw_fd. The documentation is in bold in the original:
This function consumes ownership of the specified file
descriptor. The returned object will take responsibility
for closing it when the object goes out of scope.
Where Is The Bug?
The bug happens because both tasks need a std::fs::File. One to read the TTY, and one to write to it.
Both tasks consume ownership, and both tasks take responsibility for closing it.
Both destroy the rust_fd and hence close the c_fd, when the tasks run Err().
Expected Bug
The expected bug is that the second task to close won’t be able to close. The second task should get EBADF (bad file descriptor).
However, this is not the bug experienced.
Experienced Bug
The experienced bug is that on single core machines the program just stops, and keeps calling epoll(), which is something Rust does at a low level for async functions.
This makes it harder to debug, since there is no panic!, no crash.
Real Bug
The real bug is that on machines with two or more cores, the program continues fine. It should not continue.
On two or more cores, it should behave the same as on single core machines.
Solution
The solution is to skip running the destructor.
When it’s just a file descriptor, it can be enough to run
mem::forget(rust_fd);
Now we have stopped the crash. We need to take responsibilty and run
try_join!(tty_to_websocket, websocket_to_tty);
close(c_fd);
to prevent leaking file descriptors.
If you wrap the fd in a buffer, don’t forget to de-allocate the buffer by running:
let buffer = reader.buffer();
drop(buffer);
to prevent a memory leak.