May 26, 2025
Fixing two copy-on-write bugs in the NBD server.
What is What?
NBD is a network block device protocol. It has some overlap with iSCSI,
and a little with NFS. The protocol is much simpler than either, and has
one extra feature: copy-on-write.
Copy-on-write allows for sharing the same file to multiple machines, and
only writing changes back to disk. This can save a lot of storage space
in certain situations.
There are two bugs, and two one-line fixes presented here.
The Bugs
Sequential Read after Write
For the first case, it’s enough to read and write more than one
sequential block. The second and subsequent blocks read will read into
the wrong offset of the buffer, and copy invalid data to the client.
I use a 4096 block size in this example, but I’ve used others. I did
that to match the filesystem, but for the test I don’t even need a filesystem.
The Testexport OFFSET=0
export COUNT=3 # anything >= 1
dd if=/dev/urandom of=testdata bs=4096 count=$COUNT # random data
dd if=testdata of=/dev/nbd0 bs=4096 seek=$OFFSET count=$COUNT
dd if=/dev/nbd0 of=compdata bs=4096 skip=$OFFSET count=$COUNT
sum testdata compdata
The data, testdata and compdata, will be different.
If the kernel does a partition check when /dev/nbd0 is mounted, this
test will fail with COUNT=1 as well.
Sparse Write at the Wrong Offset
For the second case, with sparse_cow=true, we need to repeat the test
with an offset > 0. expwrite() calls write() instead of pwrite().
The Testexport OFFSET=100
export COUNT=3 # anything >= 1
dd if=/dev/urandom of=testdata bs=4096 count=$COUNT # random data
dd if=testdata of=/dev/nbd0 bs=4096 seek=$OFFSET count=$COUNT
dd if=/dev/nbd0 of=compdata bs=4096 skip=$OFFSET count=$COUNT
sum testdata compdata
The first time it’s run, it will result in an Input/Output error.
The second time it’s run, it will work.
The Patches
diff --git a/nbd-orig/nbd-server.c b/nbd-patched/nbd-server.c
index 92fd141..18e5ddd 100644
--- a/nbd-orig/nbd-server.c
+++ b/nbd-patched/nbd-server.c
@@ -1582,6 +1582,7 @@ int expread(READ_CTX *ctx, CLIENT *client) {
if (pread(client->difffile, buf, rdlen, client->difmap[mapcnt]*DIFFPAGESIZE+offset) != rdlen) {
goto fail;
}
+ ctx->current_offset += rdlen;
confirm_read(client, ctx, rdlen);
} else { /* the block is not there */
if ((client->server->flags & F_WAIT) &&
(client->export == NULL)){
diff --git a/nbd-orig/nbd-server.c b/nbd-patched/nbd-server.c
index 92fd141..9a57ad5 100644
--- a/nbd-orig/nbd-server.c
+++ b/nbd-patched/nbd-server.c
@@ -1669,7 +1669,7 @@ int expwrite(off_t a, char *buf, size_t len,
CLIENT *client, int fua) {
if(ret < 0 ) goto fail;
}
memcpy(pagebuf+offset,buf,wrlen) ;
- if (write(client->difffile, pagebuf, DIFFPAGESIZE) != DIFFPAGESIZE)
+ if (pwrite(client->difffile, pagebuf, DIFFPAGESIZE, client->difmap[mapcnt]*DIFFPAGESIZE) != DIFFPAGESIZE)
goto fail;
}
if (!(client->server->flags & F_COPYONWRITE))
May 25, 2025
A number of programs will not log to syslog, only logging to a file
they control, or to a custom log server. This post describes how to
get one such program to write to syslog.
TL;DR
Setup a FIFO
/etc/systemd/system/journal-pipe-modsecurity.socket[Unit]
Description=Journal Pipe
Documentation=man:systemd-journald.service(8) man:journald.conf(5)
DefaultDependencies=no
Before=sockets.target
IgnoreOnIsolate=yes
[Socket]
ListenFIFO=/run/systemd/journal/pipes/modsecurity
ReceiveBuffer=8M
Accept=no
Service=journal-pipe-modsecurity.service
SocketMode=0660
Timestamping=us
# Access control for the FIFO
SocketUser=some-user
SocketGroup=some-group
Setup a Service
/etc/systemd/system/journal-pipe-modsecurity.service[Unit]
Description=Journal Pipe for Modsecurity
After=network.target journal-pipe-modsecurity.socket
Requires=journal-pipe-modsecurity.socket
[Service]
Type=simple
StandardInput=fd:journal-pipe-modsecurity.socket
ExecStart=/usr/bin/systemd-cat --identifier=modsecurity
TimeoutStopSec=5
Group=some-group
PrivateTmp=yes
DynamicUser=yes
ProtectHome=yes
[Install]
WantedBy=default.target
modsecurity.confSecAuditLogType Serial
SecAuditLog /run/systemd/journal/pipes/modsecurity
Problem Description
Some programs will only log to a file, or to a custom logger. There is
no way to configure them to write to syslog, or a standard logger.
Good system administrators want to use syslog, or another standard
logger. In this post, we’re going to get Modsecurity to log to syslog,
which is an often requested feature.
Assumptions
We have an application, that can write logs, and we are interested
in those logs.
That application insists on writing logs to a file, or to a custom log
manager. A custom log manager means yet another application installed,
and configured, and storage allocated.
When the application is configured to write logs to a file, the application
must be configured to rotate the logs based on time or size, and must
manage the log size.
What is Syslog?
Syslog is a Unix-y standardised logger. For this post, it’s only important
that it’s standardised across the OS. In fact, we’re actually going to
use journald, and not syslog.
Syslog accepts logs on a number of interfaces.
- A file interface at /dev/log
- A socket interface
- A UDP network interface
Syslog then writes the logs using a timestamp, a hostname, and an
application and PID where possible.
Various syslog (and journal) servers exist that can
- Limit diskspace for logs
- Send logs to another drive, or another server, or a printer
- Manage logs before the filesystem is read-write, reducing lost logs
- Rotate logs without losing logs
- Throttle logs (unfortunately losing some) to prevent DoS attacks
What is Modsecurity?
Modsecurity is a web application firewall (WAF). It follows rules, and
prevents or allows HTTP requests.
Logically, it sits between the web server and a web application. Usually
the web server that runs Modsecurity acts as a proxy in front of a Java
or PHP web application.
Think of it as an anti-virus or firewall for the web.
Why Do We Want Syslog?
Standardising the logs gives a number of benefits.
Viewing Multiple Logs, Ordered
It’s often useful to view multiple logs, ordered sequentially, to track
bugs or security problems. This is doubly true for a program like
Modsecurity, that hooks into a webserver — usually as a proxy — and
an application server.
When something goes wrong, we can get the HTTP context from the webserver
logs, the application context from the application server logs, and
the security logs from the WAF, all ordered.
We also have timestamps in the same format.
Space Allocation
If the logs are all in a specific system, we can decide to allocate space
on a specific drive, optimised for writes, separate from any database.
Legal Data Retention
We may need to keep certain logs for legal reasons.
Separate Secure Storage
Syslog servers can send their data to a separate server. In the best
cases, to a server without an IP number. This means that any attackers
cannot delete the logs.
That’s a major security help.
Compartementilisation
The syslog server does not have to run as the same user as the application.
That means that when an attacker breaches security, and has access rights
similar to the application, the attacker does not have permissions to
delete the logs.
That’s also a major security help.
Separate Backups
If your application writes and manages your logs, chances are your backup
system has to backup and restore those logs. It’s almost always better
to manage the backups of logs, and the backups of the application data separate.
Pet Peeve
Applications that do their own logging is a pet peeve of mine. There
are almost always bugs.
Logging is harder than you think, and mistakes are common.
For example, Tomcat — in it’s recommended configuration — will still
claim diskspace of long-since deleted logs.
Apr 19, 2025
A description of a firmware bug in external USB storage that
cause disk error reports, and the way to avoid them.
Problem Description
When you connect an external USB drive you may see:
sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
sd 0:0:0:0: [sda] tag#0 Sense Key : Illegal Request [current]
sd 0:0:0:0: [sda] tag#0 Add. Sense: Invalid command operation code
sd 0:0:0:0: [sda] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 00 00 22 00 00 00 06 00 00
critical target error, dev sda, sector 34 op 0x3:(DISCARD) flags 0x0 phys_seg 1 prio class 0
critical target error, dev sda, sector 40 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2
If the following are true, you have the same problem:
- You see DISCARD or UNMAP
- You see Write same(16), and
- The external storage is a spinning disk
- You delete a lot of data, or reformat the disk
Explanation
These commands are all the same as what is also known as trim,
which is used to tell an SSD disk to mark a data area as re-usable.
Spinning disks do not support trim, and so reject it.
The kernel is attempting to run the command because the USB enclosure
told the kernel that it supports the commands writesame16 or unmap.
The error is reported because that command fails.
Root Cause
The root cause is that the USB enclosure supports the command in firmware,
and the spinning disk does not. The USB enclosure fails to negotiate
this command with the harddrive before negotiating it with the OS.
This happens when manufacturers use the same USB chips for SSD and
spinning disk drives on the cheap.
Solution
The solution is to tell the kernel that this does not work, for this
device, by using a udev rule, eg.
/etc/udev/rules.d/99-cheap-disk.rulesACTION=="add", SUBSYSTEM=="scsi_disk", SUBSYSTEMS=="scsi",
ATTRS{vendor}=="WD", ATTRS{model}=="My Passport *",
OPTIONS="log_level=debug",
PROGRAM="/usr/bin/logger -t udev/99-cheap-disk Found cheap disk",
ATTR{provisioning_mode}="disabled"
In my case, the bad harddrive was a “Western Digital My Passport 2626“,
with revision 1034.
Thanks
For the exact same problem and solution for SAN/NAS, see:
Chris Hofstaedtler
Feb 09, 2025
A deep dive on firmware bugs that prevent Garmin Index S2
scales from connecting to encrypted Wifi networks. I describe
some of the problems, a solution of sorts if you’re a network
administrator, and a guess as to the root cause. I do not,
in this article, reverse engineer the firmware.
Problem Description
Garmin Index S2 scales are notorious for not connecting reliable to
a number of Wifi Networks. For example, read
Garmin Forums Quite often the scale won’t connect, or will connect and not sync,
and it’s display is not clear, and not well documented to find the fault.
For other frustrations, you can read my previous blog post
Garmin Index Scale Firmware Problems
Official requirements
The official requirements are:
- 2.4 GHz (no 5.0 or 6.0)
- 802.11 ac, b, g or n (no 802.1x)
- Channels 1-11 only
- No hidden SSIDs
- Security: Unencrypted, WPA and WPA2
- Passwords must be at least 8 characters
Some of these requirements are simply due to the age and power of the
embedded controller. It was never going to support 5.0 GHz, for instance.
What we (customers) want
Compliant Wifi
- Encrypted, at least WPA2
- Channels 1-13
A Wifi access point can typically be configured for a specific channel, or
for all channels. No Wifi access points allow you to specify a channel range.
If you are outside the US, this means:
- Exactly one channel, or
- Channels 1-13.
So you are reduced to running one channel. Luckily the scale does actually
connect on channels 1-13. It’s doubtful the chip would be certified in
the countries it’s sold, if it doesn’t. So we will ignore the first problem.
What worked yesterday, to work today
The scale frequently gets stuck, and the usual support response is to
reset the scale.
Resetting the scale is difficult
- It requires tapping a button on the bottom, and simultaneously
viewing the top display.
- Testing requires putting significant weight on top of the scale,
but your finger is still tapping the bottom.
Resetting the scale is unnecessary
Frequently the scale is actually still working. The problem is that
the display isn’t communicating to you, the user, what’s happening.
Quite often it’s busy, and you should simply wait. It does this by
flashing an hourglass —- โ —- and then switching the display
off. To the average user, this looks like the scale fails to switch on.
The correct thing to do is wait 5 minutes.
Updating Garmin Documentation.
Lets first update the documentation. Lets create a useful Wifi manual
for the Garmin Index S2 Wifi connection status.
There are three icons:
- Wifi ๐
- Sync ๐
- Done โ
Wifi connecting ๐
While ๐ blinks, the scale is connecting to Wifi. If it stops blinking,
it has connected to Wifi, and your WPA2 password works.
As soon as this happens, you no longer need to reset your scale.
Data Syncing ๐
While data is syncing, ๐ is animating. At this point, the scale is
talking over the network to Garmin servers.
Under certain conditions this can take a long time, and the display
will switch off. It will still be syncing in the background, though.
Note: If the display switches off in this state, it’s power-saving
the display. It has not switched off. If you power on the scale,
you will see an hourglass โ. Wait 5 minutes. Be patient.
Done โ
It’s done.
Actual Syncing Problems
If the data never syncs, read on.
If the data never syncs, you may have:
- Firewall issues (not if you didn’t have them yesterday)
- ISP issues (ping connect.garmin.com)
- Hit a firmware or controller bug (the rest of this blog post)
Note: At this point the Wifi is connected. The scale found
your Wifi, your SSID, and has negotiated encryption.
Firewall Issues
This is simple.
If you don’t know what a firewall is, you didn’t break it.
If you haven’t modified your router settings —- if your ISP allows it —-
you didn’t break it.
If you have edited your firewall, roll back, try again.
ISP Issues
Do the normal tests. Connect to any other site
and check that Garmin is up.
If these work, your ISP is probably not down.
Scale Sync Network Activity
How does the scale sync with Garmin Connect?
I’m glad you asked.
In order, the scale uses the following protocols.
- DHCP
- DNS
- NTP
- HTTP
- HTTPS
These steps are fairly normal and expected. What is so unexpected is
that step 5 fails when all the other steps work, and how it fails.
DHCP
This is how the scale gets an IP, and part of the Wifi negotiation.
It’s fairly standard.
The most notable is that the client identifier is GarminIntern
and the host name is WINC-00-00.
DNS
It then does a DNS lookup, to do an NTP sync. It looks up two machines:
- time.google.com
- time.garmin.com
This is notable because time.garmin.com does not exist.
The scale will do more DNS lookups as we go along. They tend to work fine.
NTP
The scale syncs it’s clock. This, again, is normal. It’s necessary because
it will later use HTTPS, and that requires a valid clock.
time.garmin.com doesn’t exist, but it doesn’t stop the clock
from syncing. The scale also does normal NTP on time.google.com
and another NTP call on clock.garmin.com on port 4123.
HTTP
The scale proceeds to send POST /OBN/OBNServlet to
gold.garmin.com. This seems to be mainly to get a Cloudflare
response, for example CF-RAY.
This doesn’t usually fail, but the errors start here. The reason it
doesn’t fail outright is that it will retry, and eventually a retry
will work before the clock runs out.
HTTPS
Now the scale starts sending data to:
- services.garmin.com
- api.gcs.garmin.com
- connectapi.garmin.com
- omt.garmin.com
At this point the errors accumulate, and eventually the clock does run
out. The errors slow down the connection to the point where the scale
fails to send it’s data inside the 5 minute timeout.
Accumulated Problems
So what exactly fails? Once data flows, the scale fails to acknowledge
the server’s TCP ACK packets about 80% of the time. If too many packets
are missed, the server closes the connection, and the scale tries again.
Once the scale retries too many times, it gives up.
Since this happens ยฑ80% of the time, and multiple connections are made,
the scale fails very often. Every now and then it works.
Problem Details
TCP 101
TCP was created to transfer data without having the application worry
about reliability. Data gets chopped up, usually in ยฑ1500
bytes. If it needs to get chopped up further, the
OS gets notified.
TCP also takes care of putting the data back together again. This can
be more difficult than just Packet 1 + Packet 2.
- Packet 2 can arrive before Packet 1
- Packet 2 can get lost, requiring retransmission
Let’s look at this in a bit more detail. Let’s say the scale wants
to send 5000 bytes of data. It gets chopped up into 1500 bytes.
Time |
Sender |
Recipient |
Sequence |
Length |
Acknowledgement |
00:01 |
Scale |
Server |
0 |
1500 |
0 |
00:02 |
Server |
Scale |
0 |
0 |
1500 |
00:03 |
Scale |
Server |
1500 |
1500 |
0 |
00:04 |
Server |
Scale |
0 |
0 |
3000 |
00:05 |
Scale |
Server |
3000 |
1500 |
0 |
00:06 |
Server |
Scale |
0 |
0 |
4500 |
00:07 |
Scale |
Server |
4500 |
500 |
0 |
00:08 |
Server |
Scale |
0 |
0 |
5000 |
As you can see, the acknowledgements tell the scale how much data
the server has received, and from where to continue.
TCP 102
Of course, you can’t just start sending data. You have to tell the
server you want to send data, and the server must accept, so it goes
something like:
- Handshake
- SYN client โ server
- SYN/ACK server โ client
- ACK client โ server
- Data, as above.
- ACK client โ server
- ACK server โ client
- …
- Stop
- FIN/ACK client โ server
- FIN/ACK server โ client
- ACK client โ server
What is Observed
The scale starts sending data, but the server’s packets aren’t received.
The scale then eventually resends packets. Something like:
Time |
Sender |
Recipient |
Sequence |
Length |
Acknowledgement |
00:01 |
Scale |
Server |
0 |
1500 |
0 |
00:02 |
Server |
Scale |
0 |
0 |
1500 |
00:03 |
Scale |
Server |
1500 |
1500 |
0 |
00:04 |
Server |
Scale |
0 |
0 |
3000 |
00:06 |
Server |
Scale |
0 |
0 |
3000 |
00:10 |
Server |
Scale |
0 |
0 |
3000 |
00:18 |
Server |
Scale |
0 |
0 |
3000 |
00:34 |
Server |
Scale |
0 |
0 |
3000 |
00:35 |
Scale |
Server |
3000 |
1500 |
0 |
00:36 |
Server |
Scale |
0 |
0 |
4500 |
Note: The time increases exponentially, as the server asks for more
data. If this happens too often, the scale times out, and no data is
sent.
This means the scale doesn’t receive the ACK packets from the server.
What Else is Dropped?
At the start, nothing. DHCP, DNS and NTP all work. Once the scale
starts using HTTP (over TCP), packet drops start. This is usually about
15 seconds after the scale connects to the Wifi network.
However, once the packet drops start, other packets are dropped to.
- Ping
- ICMP packets to check if the scale is up.
- ARP
- Ethernet packets to match an IP to a MAC addressees.
ARP being dropped is very interesting. When ARP goes, everything
stops until ARP is answered. This would indicate that Wifi encryption
updates might get dropped too.
TCP 201
The retransmitted acknowledgements need not come from the actual
server, although they look like they do. They can and often do come
from a router or firewall in the middle as a performance optimization.
Workarounds that Don’t Work
Different Wifi
I am in the lucky position to try multiple Wifi routers, so I did.
I tried 3 different ones.
All Alone
Because I tried multiple routers, the scale was the only device on the
network. There was no traffic congestion, and no competition.
This did not help.
Different Encryption on Wifi
Encrypted Wifi can be WPA or WPA2, and support different
authentication methods and encryption standards. WPA uses TKIP,
WPA2 uses CCMP. Nope, this did not help.
Note: Different router hardware might not let you set some
protocols, since it may be handled in the Wifi network hardware.
Quality of Service
Boost Garmin IP networks. Boost the scale. Boost empty ACKs and
retransmitted ACKs. Nope.
Different Channel on Wifi
The Garmin Index S2 Scale officially only supports channels 1-11.
Since I’m not in the US, Wifi equipment usually uses channels 1-13, and
uses different frequency blocks.
Note: To get certified for sale in non-US countries (like the EU),
the scale would be tested by the local regulator, and must pass local
wireless regulations. Therefore I do not believe the official
documentation —- the scale would be illegal for sale.
However, I did try hard coding channels 1, 6, 11, and different country
regulations on the Wifi router. This did not make a difference.
Different Garmin
The various Garmin services resolve to multiple IP addresses. This is
likely for load balancing.
Modify the DNS on the router to supply specific ones. This didn’t help.
Note: This might not make as much of a difference as you think,
since they are behind Cloudflare, and the CDN will intercept and
reroute these connections.
Computer Captcha
Since it’s behind Cloudflare, the scale might be hitting Cloudflare’s
captcha protection. Maybe HTTP and HTTPS don’t work unless the scale
can prove it’s human.
Network dumps prove this is not what’s happening. They do reveal
Garmin’s ID and other Cloudflare details, that I will not post in
this blog.
Adjusting the MTU
A common problem in network stacks is that they don’t notice when the
MTU needs adjusting. This can and does happen when the transport changes,
for example when it changes from Wifi to Cable.
Yes, I did try adjusting this, both larger and smaller.
And I did check for DNF and Fragmentation Needed.
Hardcode ARP
This is easy, and it removes some network packets from the equation.
It does not fix the problem, though.
Workarounds that Work
No Encryption on Wifi
This works perfectly, but it’s not acceptable in my household.
No encryption does result in no retransmissions, though.
Hack the RTO
The retransmissions are also called TCP RTO (retransmission timeout).
We can reduce this, and increase the number of RTO packets that the router
sends, to greatly increase our chances of sending a retransmission at
the moment the scale is listening again.
This does work, but it requires two things:
- A custom kernel
- Enable custom kernel
- A proxy
Custom Kernel
We need a custom kernel because we’re going to move these values out
of the TCP specification. Since this network isn’t used for anything
else, let’s do it.
There are a number of queries on mailing lists for this. However,
there aren’t a great many final solutions. As I said, it’s outside spec.
I’m including a diff here, in case other people want it. This
- Increases the number of retries, so it doesn’t stop before the
scale times out.
- Decreases the highest timeout value, so we will retry much quicker.
- Sets every connection to thin, meaning that the timeouts increase
linearly instead of exponentially.
This diff is for Linux 6.12, and should work on Debian Stable.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b3917af30..cce7a5350 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -90,14 +90,14 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCP_URG_NOTYET 0x0200
#define TCP_URG_READ 0x0400
-#define TCP_RETR1 3 /*
+#define TCP_RETR1 8 /*
* This is how many retries it does before it
* tries to figure out if the gateway is
* down. Minimal RFC value is 3; it corresponds
* to ~3sec-8min depending on RTO.
*/
-#define TCP_RETR2 15 /*
+#define TCP_RETR2 30 /*
* This should take at least
* 90 minutes to time out.
* RFC1122 says that the limit is 100 sec.
@@ -138,8 +138,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCP_DELACK_MIN 4U
#define TCP_ATO_MIN 4U
#endif
-#define TCP_RTO_MAX ((unsigned)(120*HZ))
-#define TCP_RTO_MIN ((unsigned)(HZ/5))
+#define TCP_RTO_MAX ((unsigned)(5*HZ))
+#define TCP_RTO_MIN ((unsigned)(HZ/10))
#define TCP_TIMEOUT_MIN (2U) /* Min timeout for TCP timers in jiffies */
#define TCP_TIMEOUT_MIN_US (2*USEC_PER_MSEC) /* Min TCP timeout in microsecs */
@@ -226,7 +226,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
#define TCP_NAGLE_PUSH 4 /* Cork is overridden for already queued data */
/* TCP thin-stream limits */
-#define TCP_THIN_LINEAR_RETRIES 6 /* After 6 linear retries, do exp. backoff */
+#define TCP_THIN_LINEAR_RETRIES 60 /* After 6 linear retries, do exp. backoff */
/* TCP initial congestion window as per rfc6928 */
#define TCP_INIT_CWND 10
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index b65cd417b..e5656e919 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -639,7 +639,7 @@ void tcp_retransmit_timer(struct sock *sk)
*/
if (sk->sk_state == TCP_ESTABLISHED &&
(tp->thin_lto || READ_ONCE(net->ipv4.sysctl_tcp_thin_linear_timeouts)) &&
- tcp_stream_is_thin(tp) &&
+ //tcp_stream_is_thin(tp) &&
icsk->icsk_retransmits <= TCP_THIN_LINEAR_RETRIES) {
icsk->icsk_backoff = 0;
icsk->icsk_rto = clamp(__tcp_set_rto(tp),
Enable custom kernel
To get the full effect, you may need to enable thin streams on some
Linux distributions.
echo 1 > /proc/sys/net/ipv4/tcp_thin_linear_timeouts
Add it to /etc/sysctl.conf or it’s subdirectories
Proxy
We only control TCP RTO on local sockets and some NAT-ed connections, so
we need to setup a proxy. I tried this first with HTTP, using Squid, and
it worked. However I also needed to do HTTPS, and then encryption
certificates make it hard work.
However, I’m not looking inside the packets. I don’t need to decrypt them,
just forward them. I’m just interested in modifying TCP RTO, so I can
treat HTTPS like any other TCP Socket.
The easiest way to do this is with a systemd socket or xinetd redirect
and NAT. You will need one per Garmin IP address.
Example xinetd config:
service garmin-gold_garmin_com
{
type = UNLISTED
socket_type = stream
protocol = tcp
wait = no
user = nobody
bind = 0.0.0.0
port = 3129
only_from = **my_network**
redirect = gold.garmin.com 80
}
and a nat rule:
table inet filter {
chain prerouting {
type nat hook prerouting priority -100;
policy accept
ip saddr **scaleip**
ip daddr { gold.garmin.com } tcp dport { 80 }
dnat to 10.43.0.1:3129
}
}
Probable Cause
This only happens when the network is encrypted. The scale is not
recording the server’s data, and the scale is also not sending errors back.
This could happen if:
- the controller is too slow to decrypt
- the firmware decides the packets are corrupt
- an interrupt goes missing
- not enough RAM, hence forced to drop
Which exactly it is is unknown, and where you draw the line between
controller and firmware can be a grey area. For example, in most
computers, the network card will perform some calculations on the
packets instead of the OS. This is called hardware off-loading,
but in practice it happens in firmware.
Dec 07, 2024
tl;dr
It used to be possible to use a DSLR as a webcam, or in OBS, on Linux,
and with the move to pipewire this broke.
Here’s the code to make it work again:
gst-launch-1.0 \
clockselect \
v4l2src \
device=/dev/video0 \
! queue ! videoconvert \
! pipewiresink mode=provide stream-properties="properties,media.class=Video/Source,media.role=Camera" \
client-name=DSLR
background
On Linux, you can use your DSLR as a webcam, and have any application use
it. What you need is:
* gphoto2
* ffmpeg
* v4l2loopback (kernel module)
The recipe can be slightly different based on camera functionality like
frame rate and resolution, but it basically follows
gphoto2 --set-config liveviewsize=2 \
--stdout --capture-movie \
| ffmpeg -i - \
-re \
-vcodec rawvideo \
-pix_fmt yuv420p \
-threads 2 \
-f v4l2 /dev/video0
Most DSLRs are supported.
With the move to pipewire, another step is needed. This blog post
describes that step.
You may need to change:
- liveviewsize=2
- Different cameras will have different options
- -re
- -re should make ffmpeg match the native framerate, which may save CPU cycles
clock bug
The output of the gst command should have an advancing clock. If the
clock remains stuck on 00:00:00, you will need to add clockselect
to the gst-launch
command at the top.
This is related to Pipewire Regressin 4389
when
Right now, December 2024. Firefox, OBS and Chrome are moving to pipewire.
The move has happened in some distributions, and is in the process in others.
firefox
Look in about:config
for media.webrtc.camera.allow-pipewire
chrome
Look in chrome://flags
for Pipewire Camera Support
who
pipewire aims to improve the handling of audio
and video on Linux, and it’s pretty good at that.
You’ll need version 1.2.6 or later for this to work. Look for
libpw-v4l2.so
.
what
A DSLR camera, attached via USB, that you used to use using gphoto2
and
v4l2loopback
where
Look for a section like:
Video
โโ Devices:
โ 56. ...
โ
โโ Sinks:
โ
โโ Sources:
โ * 64. USB2.0 FHD UVC WebCam (V4L2)
โ 93. DSLR
You want your DSLR to show up under Sources
.
You can then get more information using pw-cli info 93
why
Update to pipewire. It really is better.
how
After running gphoto2
the way you normally do, run the command
at the top. You can name your DSLR something else if you like.
Test it using wpctl status
and pw-cli info
Apr 05, 2024
xz and openssh
There was a briefly successful attempt to add a backdoor to /usr/bin/sshd
on Linux.
There are a lot of discussions of how, when, what and why already on the
Internet. I’ll include a few links below.
Most posts fall in one of two camps:
- brief description
- detailed description, starting with a
.m4
file.
This post instead will try to turn this into a story. What are the
attacker’s goals, and how can they best be achieved?
Hopefully this different perspective will make it understandable to
a different audience, which will help prevent this in the future.
goals
The attackers goal is to add a backdoor to some software that is both
commonly installed, and has privileged access.
Some criteria for a good back door:
- The door is hidden
- The door can be used unoticed
- The door has a lock, so others can’t use it
- Because it’s open source, the door plans are hidden too.
location of the plans
The obvious location is OpenSSH. However, that project is very well
run. That means hiding the plans becomes difficult.
Well, where else can we hide it? Let’s look:
/lib64/ld-linux-x86-64.so.2
libcrypt.so.1 => /lib64/libcrypt.so.1
libaudit.so.1 => /lib64/libaudit.so.1
libpam.so.0 => /lib64/libpam.so.0
...
liblzma.so.5 => /lib64/glibc-hwcaps/x86-64-v3/liblzma.so.1.2.3
libzstd.so.1 => /lib64/glibc-hwcaps/x86-64-v3/libzstd.so.4.5.6
...
It turned out that liblzma, from xz utils is a good choice. On
a different day, or a different project, one of the others could
have worked.
chapter 1: place the door
front door
If we modify liblzma
, how does that open a door in /usr/sbin/sshd
?
For that, we need ifunc(). This behaves a little like $LD_PRELOAD
,
in that we can override functions. It has advantages for the attack:
- It’s less well known. Most setuid binaries filter
$LD_PRELOAD
- It’s better hidden.
Let’s give a quick ifunc() example:
int add_numbers_fast(int x, int y) {
return x + y; /* add */
}
int add_numbers_slow(int x, int y) {
return x + y; /* add */
}
int add_numbers(int x, int y)
__attribute__((ifunc ("resolve_add_numbers")));
static void *resolve_add_numbers(void)
{
if (1)
return add_numbers_fast;
else
return add_numbers_slow;
}
Compile with
gcc -fPIC -shared add_numbers.c -o add_numbers.so
This has two implementations of add_numbers(), and based on
some criteria — like CPU — we pick one.
This is fairly common for performance-critical functions. This
is also true for some encryption functions in OpenSSH.
We would then use it with
#include <stddef.h>
#include <stdio.h>
extern int add_numbers(int x, int y);
int main(int argc, char *argv[]) {
int x, y;
x = y = 3;
printf("add_numbers(%d, %d) = %d\n",
x, y,
add_numbers(x, y));
return 0;
}
Output:
backdoor
Suppose the library includes another implementation, like so:
extern int add_numbers(int x, int y);
int add_numbers_fake(int x, int y) {
return x * y; /* multiply instead! */
}
int add_numbers(int x, int y)
__attribute__((ifunc ("resolve_add_numbers")));
static void *resolve_add_numbers(void)
{
return add_numbers_fake;
}
Now we can get modified the output:
a note on compiling
To compile the examples, it’s best to split compilation and linking, like
gcc -fPIC -shared add_numbers.c -o add_numbers.so
gcc -fPIC -shared add_numbers_backdoor.c -o add_numbers_backdoor.so
gcc -c ifunc_example.c -o ifunc_example.o
ld /usr/lib64/crti.o /usr/lib64/crtn.o /usr/lib64/crt1.o \
-lc ifunc_example.o \
add_numbers_backdoor.so add_numbers.so \
-dynamic-linker /lib64/ld-linux-x86-64.so.2 \
-o ifunc_example
Then you can swap implementations by swapping the order of
add_numbers.so and add_numbers_backdoor.so
chapter 2: hiding the door
Now that we have something to include, we need to hide it. We cannot
add random function names to xz utils: it will get noticed.
A number of systems do automatic checks, like:
$ file *
add_numbers_backdoor.c: C source, ASCII text
add_numbers_backdoor.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=6883795af4a391d0f9cb256aea233498a37ba668, with debug_info, not stripped
add_numbers.o: ELF 64-bit LSB relocatable, x86-64, version 1 (GNU/Linux), not stripped
ifunc_example.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
Adding a binary blob that matches an ELF object will be suspicious.
What we can do is add the file encrypted. In xz this is easy, because
it includes binary test data.
A very simple encryption scheme, that shifts a -> b, b -> c, d -> e, …
# encrypt
echo 'hello world' | tr '[a-z]' '[b-z]a'
ifmmp xpsme
To decrypt
# decrypt
echo ifmmp xpsme | tr '[a-z]' 'z[a-y]'
hello world
Now we simply need to decrypt it before compiling. We can add the
following to either the .m4
file or Makefile
. Our choice.
cat myfile | tr '[a-z]' 'z[a-y]' > myfile.c
gcc myfile.c
chapter 3: hiding the use
Now that we can include some code in OpenSSH, where is the best place
to put it?
- before login, so we’re still root
- before sandboxing, so we have full access
- after encryption started, so the commands are encrypted
This is why the exploit overrides RSA_public_decrypt
.
links
Links to the actual files, and diagnosis of the actual files:
Mar 01, 2024
Finger and Webfinger
Finger and Webfinger answer the same question: โwhat information is
available about this user?โ
This blog was written because both were harder to get going than necessary.
Webfinger
Webfinger does something similar. Query my mastodon handle,
@berend@emptybox.deschouwer.co.za
, and you should get something like
{
"aliases": [
"https://emptybox.deschouwer.co.za/nextcloud/index.php/index.php/apps/social/@berend",
"https://emptybox.deschouwer.co.za/nextcloud/index.php/u/berend"
],
"links": [
{
"href": "https://emptybox.deschouwer.co.za/nextcloud/index.php/u/berend",
"rel": "http://webfinger.net/rel/profile-page",
"type": "text/html"
}
],
"subject": "berend@emptybox.deschouwer.co.za"
}
It’s a good API to find more information about a specific user.
Finger
Finger is old. It’s from 1991, 5 years before http.
You can see it in action by running
finger berend@berend.deschouwer.co.za
The original finger server was horribly insecure. This server is
not running that.
Webfinger Back
The Webfinger backend is provided by the Social app on Nextcloud.
I’m going to document some of the pitfalls here, since:
- An installation step isn’t documented
- Some of the error reporting is misleading.
- All of the official help is โconfigure your webserver/proxyโ, even
when the answer isn’t that.
- google-ing this doesn’t help, since everything redirects you back
to webserver configuration help.
Once, for Nextcloud
Yes, you do need to configure your webserver.
Under your nextcloud instance, when logged in as admin, navigate to security.
If Nextcloud Admin Security Check complains with a similar message
Your web server is not configured correctly to resolve โ/.well-known/caldavโ.
More information can be found on our documentation.
Your web server is not configured correctly to resolve โ/.well-known/carddavโ.
More information can be found on our documentation.
You need to fix your webserver or proxy. Follow Google. The solutions are complete.
Configuring Nextcloud
Twice, for the Social App in Nextcloud
Nope, you don’t need to do it twice.
If the Social App complains with a similar message
.well-known/webfinger isnโt properly set up!
Social needs the .well-known automatic discovery to be properly set up.
If Nextcloud is not installed in the root of the domain, it is often the
case that Nextcloud canโt configure this automatically
The problem is not your webserver. It’s the Social App
Configuring the App (the missing installation step)
You will need the CLI occ
from Nextcloud.
You may have used it to perform backups or upgrades. It’s in the webroot,
and you will usually run it something like:
sudo -u www /var/www/nextcloud/occ backup
First, look at the config. Run:
cd /var/www/nextcloud/
sudo -u www occ config:list
Look for the Social app.
"social": {
"address": "https:\/\/emptybox.deschouwer.co.za\/",
"enabled": "yes",
"url": "https:\/\/emptybox.deschouwer.co.za\/nextcloud\/index.php\/apps\/social\/",
"social_url": "https:\/\/emptybox.deschouwer.co.za\/nextcloud\/index.php\/index.php\/apps\/social\/",
"cloud_url": "https:\/\/emptybox.deschouwer.co.za\/nextcloud\/"
},
Ensure that these values are OK. If they are not, run
sudo -u www php ./occ config:app:set --value https://emptybox.deschouwer.co.za/overthere social url
Three, misleading errors
Webfinger not supported
Now you get to try:
http --follow 'https://emptybox.deschouwer.co.za/.well-known/webfinger?resource=acct%3Aberend'
If you get 404 and
{
"message": "webfinger not supported"
}
Don’t panic! It doesn’t mean webfinger isn’t supported, it means the account specified isn’t a valid account.
You didn’t specify a domain. Add a domain.
Webfinger is empty and 404
http --follow 'https://emptybox.deschouwer.co.za/.well-known/webfinger?resource=acct%3Aberend%40example.com'
If you get 404 and
Dont panic! It means that the specified acocunt (berend@example.com
) is not on this server.
Your Nextcloud instance isn’t example.com
, so you should specify an account you are authoritive for.
Finger Back
The backend for 1991 finger is a simple script that serves my whoami page.
Since the actual finger servers are insecure, I decided to re-implement
it again. Since that means even more insecure, I simplified everything.
systemd to the rescue
This is the reason for writing this section. systemd can really help us
locking it down.
[Unit]
Description=Finger Per-Connection Server
[Service]
ExecStart=/usr/.../.py # Keep some things secret
StandardInput=socket
# Send errors to the logs, instead of to the caller
StandardError=journal
SyslogIdentifier=finger
# Don't run as root
DynamicUser=yes
# Don't read /tmp, /dev, /home
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
# Restrict for DoS
MemoryMax=100M
Nice=19
IOSchedulingClass=best-effort
IOSchedulingPriority=7
CPUQuota=50%
IOWeight=25
# Restrict OS calls
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM
ProtectSystem=strict
RestrictSUIDSGID=true
MemoryDenyWriteExecute=true
# Don't load calls that are frequently used by exploits
InaccessiblePaths=/usr/bin/at
InaccessiblePaths=/usr/bin/bash
InaccessiblePaths=/usr/bin/sh
InaccessiblePaths=/usr/bin/wget
InaccessiblePaths=/usr/bin/curl
InaccessiblePaths=/usr/bin/ssh
InaccessiblePaths=/usr/bin/scp
InaccessiblePaths=/usr/bin/perl
User
The first level of security, is do not run as root.
No public directories
No public /tmp
or any such directory. If you do hack it, your in a sandbox.
Limited Access
Then we lock it down a bit for DoS reasons. We run it low priority,
with limited RAM, so you don’t take down the entire machine.
Too many details
Then we lock up some other processes. Some of these are for example purposes.
It’s really, really nice that systemd allows us to lock it down
like this.
Multiple users? Nope
The original finger backend allows you to query any user that exists,
and changes the output based on whether they are logged in.
Not for me. I give you the same answer, no matter what.
Redirects? Nope
You could query users on another server. Connecting to example.com
,
you could ask it about joe@smith.com
. Not on my server.
Memory Leaks? Nope
You could send very large user names. On my server, you get 512 bytes,
then I disconnect.
Since we don’t care about the data, we discard it.
discard = os.read(0, 512)
DoS? Nope
You can’t keep the finger socket open for very long. If it doesn’t receive
data very soon, it will disconnect.
def timeout(signum, frame):
raise Exception("Timed out")
signal.signal(signal.SIGALRM, timeout)
signal.alarm(3)
DDoS? Yep
DDoS is always a โyepโ.
More details
I’d give you more details, but here aren’t. We timeout after 3 seconds,
we read a maximum of 512 bytes of data, and we always give the exact same response.
Feb 22, 2024
What Question Am I Answering?
A question came up in a coding class, about why in Rust there’s a borrow,
and there’s a an error when borrowing.
You may have seen it like (copied from the rust book):
error[E0499]: cannot borrow `s` as mutable more than once at a time
--> src/main.rs:5:14
|
4 | let r1 = &mut s;
| ------ first mutable borrow occurs here
5 | let r2 = &mut s;
| ^^^^^^ second mutable borrow occurs here
6 |
7 | println!("{}, {}", r1, r2);
| -- first borrow later used here
For more information about this error, try `rustc --explain E0499`.
error: could not compile `ownership` due to previous error
The Rust book does explain what borrow is, and the theory behind
why it’s allowed and not allowed.
I’m showing the risks practical example.
How Am I Answering It?
With as little computer theory as possible.
- No assembler
- No diagrams
- No multiple languages
- short, simple examples
- No theory until after the bug is shown
Caveat
The answer is in C. It’s in C because Rust won’t let me break
borrow, not even in an unsafe {}
block.
It’s all in plain C, though. No assembler, and no C++. As
straightforward C as I can make it, specifically to allow for
plain examples. As few as possible shortcuts are taken.
I’m not trying to write a C-programmer’s C code. I’m trying to
write an example that shows the problem, for non-C programmers.
It should not be difficult to follow coming from Rust.
Inspiration
The inspiration comes from the C max()
function, which is actually
a macro.
That is great for explaining the differences between functions and macros,
and it’s great for explaining pass-by-code instead of value or reference.
It’s also great for explaining surprises.
First Examples
First, we give example code for pass-by-value and pass-by-reference.
Pass-by-reference is also called borrow in Rust. The examples are simple,
and we do not yet discuss the difference.
For now, it’s simply about C syntax.
#include <stdio.h>
int double_by_value(int x) {
x = x * 2;
return x;
}
int double_by_reference(int *x) {
*x = *x * 2;
return *x;
}
int main() {
int x;
x = 5;
printf("double_by_value(x) = %d\n", double_by_value(x));
/* double_by_value(x) = 10 */
x = 5;
printf("double_by_reference(x) = %d\n", double_by_reference(&x));
/* double_by_reference(x) = 10 */
return 0;
}
The code is almost identical. You can see the syntax difference, but no
functional difference yet.
The logic is the same, the input is the same, and the output is the same.
Let’s make it a tiny bit more challenging, and add a second variable.
#include <stdio.h>
int add_by_value(int x, int y) {
x = x + y;
return x;
}
int add_by_reference(int *x, int *y) {
*x = *x + *y;
return *x;
}
int main() {
int x, y;
x = y = 5;
printf("add_by_value(x, y) = %d\n", add_by_value(x, y));
/* add_by_value(x, y) = 10 */
x = y = 5;
printf("add_by_reference(x, y) = %d\n", add_by_reference(&x, &y));
/* add_by_reference(x, y) = 10 */
return 0;
}
The logic is the same, the input is the same, and the output is the same.
Put it together
Add the two and we get…
#include <stdio.h>
/* Previous example code here */
int add_and_double_by_value(int x, int y) {
x = double_by_value(x);
y = double_by_value(y);
x = x + y;
return x;
}
int add_and_double_by_reference(int *x, int *y) {
*x = double_by_reference(x);
*y = double_by_reference(y);
*x = *x + *y;
return *x;
}
int main() {
int x, y;
x = y = 5;
printf("add_and_double_by_value(x, y) = %d\n", add_and_double_by_value(x, y));
/* add_and_double_by_value(x, y) = 20 */
x = y = 5;
printf("add_and_double_by_reference(x, y) = %d\n", add_and_double_by_reference(&x, &y));
/* add_and_double_by_reference(x, y) = 20 */
return 0;
}
No surprised yet. Both examples print the same, correct answer.
The logic is the same, the input is the same, and the output is the same.
Surprise
Now let’s make it simpler…
#include <stdio.h>
/* Previous example code here */
int main() {
int x;
x = 5;
printf("add_and_double_by_value(x, x) = %d\n", add_and_double_by_value(x, x));
/* add_and_double_by_value(x, x) = 20 */
x = 5;
printf("add_and_double_by_reference(x, x) = %d\n", add_and_double_by_reference(&x, &x));
/* add_and_double_by_reference(x, x) = 40 */
return 0;
}
The logic is the same, the input is the same, but the output is different.
Why is it 40?
Double borrow
So what happened? Why did removing a variable, y, result in the “wrong” answer?
Pass by value copies the value, and passes the value, not the variable. The original is
never modified.
Pass by reference (or borrow), passes a reference, not a copy. The original is modified.
Look at
int add_and_double_by_reference(int *x, int *y) {
*x = double_by_reference(x);
*y = double_by_reference(y);
*x = *x + *y;
return *x;
}
When x is doubled, at *x = double_by_reference(x);
, and *x
is also *y
, y is
also doubled. X is now 10, as expected, but y is also 10.
Then y is doubled. And since *y
is _also *x
, x is doubled again. Now both variables
are 20.
Tada!
This is (one of the reasons) why Rust won’t let you borrow the same variable twice.
What do you do instead?
Borrow once, or .clone() /* copy */
.
Bonus time
So why have pass by reference or borrow at all?
- It’s faster, when the data is large.
- It’s faster for streamed data.
- It’s not possible in assembler pass complicated variables by value. Manual copies
must be made, and the language you use might not implement implicit copies.
For extra points, do the same with threads.
Jan 25, 2023
Who Is This Post For?
Anyone who has tried to connect a Garmin smart device to the
cloud with Garmin Connect.
Anyone who has tried to connect any smart device that has no
keyboard to any cloud anywhere.
What Device
A Garmin Index S2 scale. This is a human weight scale that can
upload your weight to the cloud.
The connects to the cloud directly via wifi. It can connect
to a phone via bluetooth, but only for wifi configuration. Everything
after that goes via wifi.
Even small things like 12/24 hour clock configuration go via wifi.
The scale has no keyboard, and a limited screen, so configuration
is via a phone.
Problem Experienced
The scale does not send any weight data to the cloud. The
scale does not receive any configuration changes from the cloud.
The scale does display weight data, and a wifi signal strength icon.
The scale does not display any errors. There is no indication
that anything is wrong.
Garmin Connect App on the phone does not display any errors.
It claims everything is working.
More About Garmin Connect
- Garmin Connect means two different things:
- A Phone App, used to configure various devices
- A Web App, used to view the data
More About Errors Not Displayed
On the Scale
The scale doesn’t show wifi errors. It can display some text,
like your name, but doesn’t even show E123, or any other error
code. The user believes it works.
The scale does show a wifi signal strength icon (triangle,
4 lines for strength), and displays a sync icon (twirling circle).
Neither icon changes to an error (red X, for example.) The user
is lead to believe it works.
On the Phone
The phone does not show any errors, wifi or otherwise.
The phone does display setup complete.
The phone app let’s you test wifi. It’s a bit complicated,
but you can. It will then display Wifi Connected OK.
Suspected Error
Eventually, after re-configuring hundreds of times, I suspect
an network error. I suspect the scale is trying to connect
to garmin.com on a non-standard port.
It’s not that, but it gives…
The Only Error Displayed
Hours later, I finally look on the router. I’m looking
at the network traffic, and see:
4-Way handshake failed for ifindex: 3, reason: 15
KEY_SEQ not returned in GET_KEY reply
So I know wifi isn’t working. The WPA2 handshake
is not completing even though the phone app thinks
the scale’s wifi is OK.
Non-specified Requirements
Bluetooth or Wifi? Bluetooth and Wifi
When you configure the scale, you should eventually
see a message on Garmin Connect on the phone that
sync completed, along with a green icon.
If you do not see this, the scale is on a different
wifi network than the phone.
Why would that be? In my case the phone is on a 5G
network (because it can) and the scale is on a 2.4G
network (because it can’t)
Until you put the phone on the same network as the
scale, the last message you will see is:
configuration completed, not sync completed,
and this means that the scale is not working.
At this point, the scale isn’t syncing to the
Garmin cloud. It’s just synced it’s configuration
with the phone.
WPA? WPA2? WPA1.5
The scale claims to support WPA2 on a 2.4G network.
For a device was first announced in 2020, this is
attrocious, but that’s true for a lot of these
smart devices.
Even then, it doesn’t work with the entire gamut
of WPA2 protocols.
In my case, I had to swap from iwd to
wpa_supplicant to connect the scale.
Questions for Garmin
- Why do you display that the wifi network tested OK
when the 4-way handshake failed? Couldn’t you add
a connections to https://test.garmin.com or something?
- Why do you not display a sync error message on either,
but preferably both the scale and phone app?
- Why do you not check that the phone and scale wifi
are the same? The setup fails silently otherwise.
- Why do you display configuration complete, when
it isn’t yet complete?
Dec 27, 2022
Who Is This Post For?
Rust programmers tracking down strange behaviours that doesn’t
always show up in debuggers or tracers
What Was Rust Used For?
I wanted to connect browsers to terminal programs. Think
running a terminal in a browser like hterm or ajaxterm.
One side of the websocket runs a program that may at any
time send data. In between there are pauses that stretch
from milliseconds to hours.
The other side is the same.
This is a perfect fit for asynchronous programming. It’s also
a candidate for memory leaks over time.
Both problems were tackled using Rust.
Problem Experienced
Sometimes the Rust program would stop. The program would
still run in the background, running epoll(7), indicating
that an async wait was running.
The program would not crash, and would not run away on the
CPU.
The last statement executed:
debug!("This runs");
Err("This does not run!")
}
Which is strange, to say the least.
This would only happen on single-core machines. On
machines with two or more cores, it would run fine.
This would happen on multiple targets architectures,
multiple OS-es.
It would go into an infinite look on Err(“…”) on
single core machines.
More About the Program
The program runs two parallel asynchronous threads, and waits
for either thread to stop.
It does that because the network side or the terminal side
could stop and close the connection. So it basically runs:
task::spawn(tty_to_websocket());
task::spawn(websocket_ty_tty());
try_join!(tty_to_websocket, websocket_to_tty);
try_join! should wait for either task to stop with an error.
I’ve setup both tasks to throw an error even on successful
completion. This is because join! might wait for both
to stop, and it’s possible for either side to stop without the
other noticing.
try_join! never completes, because Err() never completes,
which is strange.
What Does Err() Do?
Err() ends the function. In Rust that also runs the
destructors, like an object-oriented program might. Lets
say you have a function like:
fn error() -> Result<> {
let number: int = 2;
Err("error");
}
When Err() runs, Rust de-allocates number. The memory
is returned.
For an int this is simple, but it can be more complicated.
What Is A TTY?
A TTY, for a program, is a file descriptor on a character device.
It’s a bi-directional stream of data, typically keyboard in and
text out.
The C API to use one uses a file descriptor. One way to get
such a file descriptor is forkpty(3), which has a Rust crate.
Most Rust code want a std::fs::File, not a raw file
descriptor, so it needs to be converted:
unsafe { let rust_fd = std::os::from_raw_fd(c_fd); }
The first bell is unsafe {}. The code is indeed unsafe
because we’re working with a raw file descriptor.
The second bell is in the documentation for from_raw_fd.
The documentation is in bold in the original:
This function consumes ownership of the specified file
descriptor. The returned object will take responsibility
for closing it when the object goes out of scope.
Where Is The Bug?
The bug happens because both tasks need a std::fs::File.
One to read the TTY, and one to write to it.
Both tasks consume ownership, and both tasks take responsibility
for closing it.
Both destroy the rust_fd and hence close the c_fd, when
the tasks run Err().
Expected Bug
The expected bug is that the second task to close won’t be able
to close. The second task should get EBADF (bad file descriptor).
However, this is not the bug experienced.
Experienced Bug
The experienced bug is that on single core machines the program
just stops, and keeps calling epoll(), which is something
Rust does at a low level for async functions.
This makes it harder to debug, since there is no panic!, no crash.
Real Bug
The real bug is that on machines with two or more cores, the
program continues fine. It should not continue.
On two or more cores, it should behave the same as on single
core machines.
Solution
The solution is to skip running the destructor.
When it’s just a file descriptor, it can be enough to run
Now we have stopped the crash. We need to take responsibilty
and run
try_join!(tty_to_websocket, websocket_to_tty);
close(c_fd);
to prevent leaking file descriptors.
If you wrap the fd in a buffer, don’t forget to de-allocate
the buffer by running:
let buffer = reader.buffer();
drop(buffer);
to prevent a memory leak.