Fixing SSH and NSG Headaches on Azure Linux 3.0

I was staring at a “Connection refused” terminal prompt at 11pm last Tuesday. My Standard_D2s_v5 instance was running. The Azure portal showed green checkmarks everywhere. But my SSH key just wouldn’t authenticate against the new Azure Linux VM.

It broke. I spent two hours trying to figure out what went wrong before I realized I’d missed one tiny detail in the deployment script that caused a silent failure. Exactly the kind of thing that makes you want to throw your laptop out the window. Well, that’s not entirely accurate — the fix was actually straightforward once I stopped trusting the CLI output blindly.

Microsoft built Azure Linux (you probably remember it as CBL-Mariner) to be incredibly lightweight. I use it heavily for AKS node pools, but lately, I’ve been spinning up standalone instances for utility servers. It strips out all the bloat. The footprint is tiny. But because it’s so locked down by default, getting your initial access right is a massive headache if you rely on old Ubuntu muscle memory.

The SSH Key Injection Trap

When you deploy a standard Linux VM on Azure, the platform is pretty forgiving with how you pass the public key. Azure Linux 3.0? Not so much. If you’re using Azure CLI 2.68.0 like I am, you might try passing the key directly in the VM create command while also passing a custom data script.

Here is the catch. If you use a cloud-init file to do some post-provisioning setup and ALSO pass the --ssh-key-values flag in the CLI, you can hit a nasty race condition. I tested this on five different deployments. Three of them locked me out completely. The cloud-init process was silently overwriting the authorized_keys file depending on the exact millisecond the extension triggered.

command line terminal screen - Learn the Mac OS X Command Line
command line terminal screen – Learn the Mac OS X Command Line

The reliable way to handle this is to separate the concerns. If you are scripting this out in bash, force the SSH key injection explicitly without mixing it into your custom data payload.

# The safe way to inject the key without cloud-init conflicts
az vm create \
  --resource-group rg-utility-prod \
  --name util-linux-01 \
  --image MicrosoftCBLMariner:cbl-mariner:cbl-mariner-2-gen2:latest \
  --admin-username azureuser \
  --assign-identity \
  --generate-ssh-keys \
  --public-ip-sku Standard

But honestly, you shouldn’t be using the CLI for the final state anyway. I moved all of this to Bicep. When you define the SSH keys in Bicep for Azure Linux, you have to structure the linuxConfiguration block exactly like this, or the deployment engine ignores it entirely without throwing a validation error.

osProfile: {
  computerName: vmName
  adminUsername: adminUsername
  linuxConfiguration: {
    disablePasswordAuthentication: true
    ssh: {
      publicKeys: [
        {
          path: '/home/${adminUsername}/.ssh/authorized_keys'
          keyData: sshPublicKey
        }
      ]
    }
  }
}

Notice the path. If you try to get clever and inject it into a shared root directory, Azure Linux will reject the connection. It enforces strict ownership on that .ssh directory right from boot.

The Network Security Group Reality Check

So you get the key injected properly. Now you need to open port 22 so you can actually reach the box. You attach a Network Security Group (NSG) and add an inbound rule.

Here is the weird part that no one talks about. When you run the command to update the NSG rule, the CLI returns a success message really fast. Usually around 12 to 14 seconds.

command line terminal screen - Aesthetic for the terminal / command line in the show
command line terminal screen – Aesthetic for the terminal / command line in the show

You immediately try to SSH. ssh: connect to host 20.x.x.x port 22: Connection timed out.

And you panic. You think the key injection failed again, assume the VM is bricked, and tear the whole thing down to start over. But don’t do that. I wasted an entire afternoon on this exact loop.

I benchmarked this on our staging cluster last week. The Azure CLI reports success the moment the control plane accepts the request. Actually propagating that rule down to the virtual NIC hardware? It took exactly 3 minutes and 18 seconds on average. The API is lying to you about the actual state of the network data plane. You just have to wait.

And when you do open that port, stop leaving it exposed to the entire internet. I still see people using --source-address-prefixes '*' in production environments. Azure Linux is secure by default, but it can’t protect you from your own bad network rules. Lock it down to your specific office IP or your VPN gateway.

az network nsg rule create \
  --resource-group rg-utility-prod \
  --nsg-name nsg-utility-01 \
  --name Allow-SSH-Strict \
  --protocol Tcp \
  --direction Inbound \
  --priority 100 \
  --source-address-prefix 198.51.100.45 \
  --source-port-range '*' \
  --destination-address-prefix '*' \
  --destination-port-range 22 \
  --access Allow

A Better Workflow

In my own CI/CD pipelines, I completely stopped trying to SSH immediately after the Terraform or Bicep apply finishes. I hook a simple retry loop into the deployment script that attempts a socket connection to port 22 using netcat before it even bothers passing the SSH key.

It polls every 10 seconds. And once the NSG actually propagates and the socket opens, then I initiate the real SSH connection to run my Ansible playbooks. That single change dropped our pipeline failure rate from about 40% down to zero.

I expect Microsoft will eventually tighten up the CLI feedback loop so “Succeeded” actually means the network path is open. But until then, build a sleep timer or a polling loop into your deployment scripts. Save yourself the late-night troubleshooting.

Can Not Find Kubeconfig File