Terraform Best Practices for Enterprise Cloud Infrastructure

After years of working with Terraform across multiple enterprise environments, I've learned that while Infrastructure as Code (IaC) is powerful, it requires discipline and best practices to maintain at scale. Here are the key practices that have served me well in production environments.

1. Structure Your Code for Scale

Directory Organization

The foundation of maintainable Terraform code starts with proper organization:

terraform/
├── environments/
│   ├── dev/
│   ├── staging/
│   └── prod/
├── modules/
│   ├── networking/
│   ├── compute/
│   └── storage/
├── shared/
│   ├── variables.tf
│   └── outputs.tf
└── policies/
    └── security/

This structure separates concerns and makes it easy for team members to understand the codebase at a glance.

Use Modules Religiously

Every reusable piece of infrastructure should be a module. This isn't just about code reuse—it's about creating reliable, tested building blocks:

module "azure_vm" {
  source = "../modules/compute/virtual-machine"

  vm_name           = var.vm_name
  resource_group    = var.resource_group
  subnet_id         = var.subnet_id
  vm_size           = var.vm_size
  admin_username    = var.admin_username

  tags = local.common_tags
}

2. State Management Strategy

Remote State is Non-Negotiable

Never store Terraform state locally in production. Use remote backends with locking:

terraform {
  backend "azurerm" {
    resource_group_name  = "terraform-state-rg"
    storage_account_name = "terraformstateaccount"
    container_name       = "tfstate"
    key                  = "prod.terraform.tfstate"
  }
}

Environment Isolation

Each environment should have its own state file. This prevents accidental changes from affecting the wrong environment and allows for independent evolution.

3. Security First

Sensitive Data Handling

Never hardcode secrets in your Terraform files. Use Azure Key Vault or similar services:

data "azurerm_key_vault_secret" "db_password" {
  name         = "database-password"
  key_vault_id = var.key_vault_id
}

resource "azurerm_sql_server" "main" {
  name                         = var.sql_server_name
  resource_group_name          = var.resource_group_name
  location                     = var.location
  version                      = "12.0"
  administrator_login          = var.admin_username
  administrator_login_password = data.azurerm_key_vault_secret.db_password.value
}

Least Privilege Access

Configure your Terraform service principal with minimal required permissions. Use custom roles when Azure built-in roles are too broad.

4. Version Control and CI/CD

Terraform Versioning

Pin your Terraform version and provider versions:

terraform {
  required_version = ">= 1.0"
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

Automated Testing

Implement automated testing with tools like Terratest:

func TestTerraformAzureExample(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../examples/azure-vm",
        Vars: map[string]interface{}{
            "resource_group_name": "test-rg",
            "vm_name":            "test-vm",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
}

5. Documentation and Standards

Self-Documenting Code

Use meaningful variable names and descriptions:

variable "vm_size" {
  description = "The size of the Virtual Machine. Must be a valid Azure VM size."
  type        = string
  default     = "Standard_B2s"

  validation {
    condition = contains([
      "Standard_B1s", "Standard_B2s", "Standard_D2s_v3"
    ], var.vm_size)
    error_message = "VM size must be a valid Azure VM size."
  }
}

README Templates

Every module should have a comprehensive README with:
- Purpose and use cases
- Input variables
- Output values
- Usage examples
- Requirements and dependencies

6. Performance and Cost Optimization

Use Data Sources Wisely

Fetch existing resources rather than recreating them:

data "azurerm_subnet" "existing" {
  name                 = "existing-subnet"
  virtual_network_name = "existing-vnet"
  resource_group_name  = "existing-rg"
}

Implement Cost Controls

Use Azure Policy and Terraform to enforce cost controls:

resource "azurerm_policy_assignment" "vm_size_restriction" {
  name                 = "restrict-vm-sizes"
  scope                = data.azurerm_subscription.current.id
  policy_definition_id = "/providers/Microsoft.Authorization/policyDefinitions/cccc23c7-8427-4f53-ad12-b6a63eb452b3"

  parameters = jsonencode({
    listOfAllowedSKUs = {
      value = ["Standard_B1s", "Standard_B2s", "Standard_D2s_v3"]
    }
  })
}

Common Pitfalls to Avoid

  1. Circular Dependencies: Design your modules to avoid circular references
  2. Over-Engineering: Start simple and add complexity only when needed
  3. Ignoring State: Always plan before applying, especially in production
  4. Provider Confusion: Be explicit about which provider version you're using

Conclusion

Terraform is incredibly powerful, but with great power comes great responsibility. These practices have helped me manage infrastructure across multiple Azure subscriptions and environments while maintaining security, reliability, and team productivity.

The key is to start with good practices from day one—it's much harder to refactor poor Terraform code than it is to write it correctly from the beginning.


What are your favorite Terraform best practices? Have you encountered any of these challenges in your infrastructure projects? Let me know in the comments or reach out on LinkedIn.