Fixing a Broken Linux Server: Permissions, Disks, and Panic

So there I was, staring at my terminal at 2:15 AM last Thursday. Our main database replica on a heavily loaded Ubuntu 24.04 LTS box had just flatlined. No warnings. Just dead.

The junior dev on call was panicking in Slack, convinced we’d been hit by ransomware. We hadn’t. Someone just fundamentally misunderstood how Linux handles file descriptors and group permissions, and a botched deployment script brought the whole house of cards down.

Managing enterprise Linux environments isn’t about memorizing flags for the tar command. It’s about understanding the underlying mechanics of the OS so you don’t panic when things break. Because they will break.

The Permission Nightmare

Let’s talk about users and permissions. Everyone learns chmod 777 when they’re starting out. It’s the duct tape of server administration. But applying it recursively to a shared application directory because a deployment script failed? That’s how you cause a massive outage.

What happened Thursday night was a classic Access Control List (ACL) conflict. The deployment script was running as a service account, but a developer had manually created some directories earlier that day.

Here is a massive edge case that bites people constantly: the interaction between default ACLs and standard POSIX permissions. If you set a default ACL on a directory to ensure a specific group always gets access, and then someone runs a standard chmod 755 on a file inside it, the ACL mask gets recalculated. Suddenly your service account can’t write to the file anymore, even though the ACL says it should.

I spent three hours fighting this specific quirk back in February on a different cluster. The fix isn’t standard permissions. You have to force the default ACLs properly.

Linux terminal screen - 4 Useful Commands to Clear Linux Terminal Screen — Linux terminal screen – 4 Useful Commands to Clear Linux Terminal Screen

# Remove the broken standard permissions
setfacl -b /var/www/shared_app

# Force the default ACL for the developers group
setfacl -d -m g:developers:rwx /var/www/shared_app
setfacl -m g:developers:rwx /var/www/shared_app

But the permissions mess was only half the problem. The application couldn’t write to disk, which caused it to start dumping gigabytes of error logs locally.

Filesystems Don’t Care About Your Feelings

Which brings us to why the database actually died. The disk was 100% full. Or at least, that’s what df -h claimed.

I ran du -sh * on the root directory. We were only using 40GB out of the 200GB volume. The math wasn’t mathing. The dev in Slack was completely lost at this point.

This is a fundamental concept of Linux filesystems (we were running ext4 on kernel 6.8.0-31-generic). When you delete a file that is actively being written to by a running process, the file is removed from the directory tree. You can’t see it with ls. But the kernel doesn’t free the inodes or the disk blocks until the file descriptor is actually closed by the application.

Someone had noticed the error log getting huge and just ran rm error.log to free up space. The application kept running, kept writing to the deleted file, and filled the disk invisibly.

Finding ghost files is easy once you know the trick.

# Find deleted files still holding open file descriptors
lsof +L1 | grep deleted

Sure enough, a 160GB deleted log file was sitting there attached to PID 4492. I restarted the offending application service. The file descriptor closed, the kernel finally released the blocks, and disk usage dropped from 100% to 18% instantly. Our disk I/O wait times also plummeted from a terrifying 400ms down to 12ms.

Stop Staring at top

Ubuntu logo – Ubuntu Linux gets a new logo – BetaNews

You can’t manage servers by just SSHing in and running top when things get slow. It lies to you about memory usage anyway, thanks to how the kernel handles the page cache.

For immediate triage, I use htop (specifically 3.3.0 or later since they improved the I/O tab). But for actual enterprise monitoring, you need metrics you can query historically.

Vim Editor Deep Dive: From Fundamentals to Advanced Automation

My current workflow relies heavily on Prometheus with node_exporter, but I don’t just use the default metrics. The real value comes from the textfile_collector. I write custom bash scripts that check our specific application states—like whether the shared NFS mounts are actually writable, not just mounted—and dump those results to a text file that Prometheus scrapes.

If you want to see exactly what’s dragging your disk down in real-time, skip the basic tools and use eBPF. I keep a few bpftrace one-liners handy for exactly this reason. If you suspect an application is doing weird things with the filesystem, trace the open calls directly.

# Trace which processes are opening files and what they are opening
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

The Systemd Isolation Trap

While I was fixing the application service, I noticed another massive flaw in how it was deployed. The systemd unit file was completely barebones. Just an ExecStart and a User directive.

Modern Linux administration requires isolating your services. If that application had been compromised, it had read access to almost the entire filesystem. I spent ten minutes rewriting the unit file before bringing it back up.

[Service]
ExecStart=/usr/local/bin/app-binary
User=appuser
Group=appgroup
Restart=on-failure

# The stuff that actually matters
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
ReadWritePaths=/var/log/app /var/www/shared_app

By adding ProtectSystem=strict, the entire filesystem becomes read-only to that specific service, except for the explicit paths defined in ReadWritePaths. If I had set this up originally, the rogue deployment script wouldn’t have been able to trash the permissions in the first place, and the application wouldn’t have been able to fill the root disk with error logs. It would have just crashed cleanly.

I committed the systemd changes to our config repo, watched the database replication catch up, and verified the I/O metrics were stable.

I closed my laptop and went back to sleep. The post-mortem could wait until morning.

The Permission Nightmare

Filesystems Don’t Care About Your Feelings

Stop Staring at top

Vim Editor Deep Dive: From Fundamentals to Advanced Automation

The Systemd Isolation Trap

Joomla on Ubuntu 24.04: Fixing the Nginx Routing Mess

DevOps is Just Linux Admin in a Trenchcoat

Linux Users Don’t Wait: The DIY Clipboard Bridge

Stop Typing Passwords: TPM2 and systemd-cryptenroll

Little Nightmares Review

Fe Review

Gold From Olympia

Unravel Review

Fixing a Broken Linux Server: Permissions, Disks, and Panic

The Permission Nightmare

Filesystems Don’t Care About Your Feelings

Stop Staring at top

The Systemd Isolation Trap

Latest Reviews

Categories

Subscribe Today