So there I was, staring at my terminal at 2:15 AM last Thursday. Our main database replica on a heavily loaded Ubuntu 24.04 LTS box had just flatlined. No warnings. Just dead.
The junior dev on call was panicking in Slack, convinced we’d been hit by ransomware. We hadn’t. Someone just fundamentally misunderstood how Linux handles file descriptors and group permissions, and a botched deployment script brought the whole house of cards down.
Managing enterprise Linux environments isn’t about memorizing flags for the tar command. It’s about understanding the underlying mechanics of the OS so you don’t panic when things break. Because they will break.
The Permission Nightmare
Let’s talk about users and permissions. Everyone learns chmod 777 when they’re starting out. It’s the duct tape of server administration. But applying it recursively to a shared application directory because a deployment script failed? That’s how you cause a massive outage.
What happened Thursday night was a classic Access Control List (ACL) conflict. The deployment script was running as a service account, but a developer had manually created some directories earlier that day.
Here is a massive edge case that bites people constantly: the interaction between default ACLs and standard POSIX permissions. If you set a default ACL on a directory to ensure a specific group always gets access, and then someone runs a standard chmod 755 on a file inside it, the ACL mask gets recalculated. Suddenly your service account can’t write to the file anymore, even though the ACL says it should.
I spent three hours fighting this specific quirk back in February on a different cluster. The fix isn’t standard permissions. You have to force the default ACLs properly.

# Remove the broken standard permissions
setfacl -b /var/www/shared_app
# Force the default ACL for the developers group
setfacl -d -m g:developers:rwx /var/www/shared_app
setfacl -m g:developers:rwx /var/www/shared_app
But the permissions mess was only half the problem. The application couldn’t write to disk, which caused it to start dumping gigabytes of error logs locally.
Filesystems Don’t Care About Your Feelings
Which brings us to why the database actually died. The disk was 100% full. Or at least, that’s what df -h claimed.
I ran du -sh * on the root directory. We were only using 40GB out of the 200GB volume. The math wasn’t mathing. The dev in Slack was completely lost at this point.
This is a fundamental concept of Linux filesystems (we were running ext4 on kernel 6.8.0-31-generic). When you delete a file that is actively being written to by a running process, the file is removed from the directory tree. You can’t see it with ls. But the kernel doesn’t free the inodes or the disk blocks until the file descriptor is actually closed by the application.
Someone had noticed the error log getting huge and just ran rm error.log to free up space. The application kept running, kept writing to the deleted file, and filled the disk invisibly.
Finding ghost files is easy once you know the trick.
# Find deleted files still holding open file descriptors
lsof +L1 | grep deleted
Sure enough, a 160GB deleted log file was sitting there attached to PID 4492. I restarted the offending application service. The file descriptor closed, the kernel finally released the blocks, and disk usage dropped from 100% to 18% instantly. Our disk I/O wait times also plummeted from a terrifying 400ms down to 12ms.
Stop Staring at top

You can’t manage servers by just SSHing in and running top when things get slow. It lies to you about memory usage anyway, thanks to how the kernel handles the page cache.
For immediate triage, I use htop (specifically 3.3.0 or later since they improved the I/O tab). But for actual enterprise monitoring, you need metrics you can query historically.
My current workflow relies heavily on Prometheus with node_exporter, but I don’t just use the default metrics. The real value comes from the textfile_collector. I write custom bash scripts that check our specific application states—like whether the shared NFS mounts are actually writable, not just mounted—and dump those results to a text file that Prometheus scrapes.
If you want to see exactly what’s dragging your disk down in real-time, skip the basic tools and use eBPF. I keep a few bpftrace one-liners handy for exactly this reason. If you suspect an application is doing weird things with the filesystem, trace the open calls directly.
# Trace which processes are opening files and what they are opening
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
The Systemd Isolation Trap
While I was fixing the application service, I noticed another massive flaw in how it was deployed. The systemd unit file was completely barebones. Just an ExecStart and a User directive.
Modern Linux administration requires isolating your services. If that application had been compromised, it had read access to almost the entire filesystem. I spent ten minutes rewriting the unit file before bringing it back up.
[Service]
ExecStart=/usr/local/bin/app-binary
User=appuser
Group=appgroup
Restart=on-failure
# The stuff that actually matters
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
ReadWritePaths=/var/log/app /var/www/shared_app
By adding ProtectSystem=strict, the entire filesystem becomes read-only to that specific service, except for the explicit paths defined in ReadWritePaths. If I had set this up originally, the rogue deployment script wouldn’t have been able to trash the permissions in the first place, and the application wouldn’t have been able to fill the root disk with error logs. It would have just crashed cleanly.
I committed the systemd changes to our config repo, watched the database replication catch up, and verified the I/O metrics were stable.
I closed my laptop and went back to sleep. The post-mortem could wait until morning.




