This post is what I wish I read before recently working on replacing the SSH access to the bastion in my terraform project with SSM & EC2 Instance Connect.

Motivation

I'm a bit paranoid about security. Not as paranoid as Amelia who's a cybersecurity consultant but paranoid nonetheless.
I didn't always care about security until I got burnt.
7 years ago, while working for an early-stage startup, I woke up one day to find an email from Digital Ocean about a compromised server being shut down by their team.
An attacker had identified an insecure port and gained access to the server. They had integrated it with a botnet, connecting it to a C&C server.
I was new to managing servers then, DigitalOcean didn't have their firewall feature and I didn't even know about firewalls anyway.
I was determined to protect our servers from future attacks so I spent the next few days learning about security, IP-tables, firewalls, and general server hardening.

I've come a long way from my self-taught server hardening crash course but I still have a strong drive for security and that's what has driven this exercise.

My use case

I had a terraform configuration with:

  • An ECS cluster in a private subnet behind an ALB.
  • A Postgres RDS instance in a private subnet.
  • A bastion in a public subnet.

With this configuration, I would connect to the database, and EC2 instances using SSH tunneling via the bastion.

The networking module

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 2.69.0"

  name = "staff-advocacy-vpc"
  cidr = var.vpc_cidr
  azs  = var.availability_zones

  # ALB & Bastion
  public_subnets = var.public_subnets_cidr

  # ECS CLUSTER
  private_subnets = var.private_subnets_cidr

  # RDS CLUSTER
  database_subnets = var.database_subnets_cidr

  # DNS
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Terraform   = "true"
    Environment = var.environment
  }
}

The database module

locals {
  create_test_resources = true
}

module "db" {
  source  = "terraform-aws-modules/rds/aws"
  version = "2.20.0"

  vpc_security_group_ids = [module.postgres_security_group.this_security_group_id]
  create_db_subnet_group = false
  db_subnet_group_name   = local.create_test_resources ? var.subnet_group_name : ""

  username       = var.database_username
  password       = var.database_password
  port           = var.database_port
  identifier     = var.identifier
  name           = var.database_name
  engine         = "postgres"
  engine_version = var.database_engine_version

  create_db_option_group    = false
  create_db_parameter_group = false

  allocated_storage  = db_storage
  instance_class     = var.db_instance_class
  maintenance_window = var.db_maintenance_window
  backup_window      = var.db_backup_window

  tags = {
    Terraform   = "true"
    Environment = var.environment
  }
}

module "postgres_security_group" {
  source  = "terraform-aws-modules/security-group/aws//modules/postgresql"
  version = "~> 3.0"

  name   = "${var.identifier}-sg"
  vpc_id = var.vpc_id
  
  # using computed_* here to get around count issues.
  ingress_cidr_blocks = var.vpc_cidr_block
  computed_ingress_cidr_blocks = var.vpc_cidr_block
  number_of_computed_ingress_cidr_blocks = 1

  ingress_rules       = ["postgresql-tcp"]

  egress_cidr_blocks = ["0.0.0.0/0"]
  egress_rules       = ["http-80-tcp", "https-443-tcp"]

  tags = {
    Terraform   = "true"
    Name = "${var.environment}-rds-sg"
    Environment = var.environment
  }
}

The bastion module

module "bastion" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "2.16.0"

  ami                         = "ami-0e2e14798f7a300a1"
  name                        = var.name
  associate_public_ip_address = true
  instance_type               = "t2.small"
  vpc_security_group_ids      = [module.bastion_security_group.this_security_group_id]
  subnet_ids                  = var.vpc_public_subnets
  key_name                    = var.bastion_key_name
}

module "bastion_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "3.1.0"

  name   = "${var.name}-sg"
  vpc_id = var.vpc_id

  ingress_cidr_blocks = ["0.0.0.0/0"]
  ingress_rules       = ["ssh-tcp"]

  egress_cidr_blocks = ["0.0.0.0/0"]
  egress_rules       = ["postgresql-tcp", "http-80-tcp", "https-443-tcp"]
}

Fargate cluster

The full cluster module includes a lot more than a cluster, and alb resource definition but all of this is irrelevant here.

module "ecs_cluster" {
  source = "terraform-aws-modules/ecs/aws"

  name               = "${var.name}-${var.environment}"
  container_insights = true

  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy = [{
    capacity_provider = "FARGATE"
    weight            = "1"
  }]

  tags = {
    Environment = var.environment
  }
}

resource "aws_lb" "main" {
  name               = "${var.name}-alb-${var.environment}"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.vpc_public_subnets

  enable_deletion_protection = false

  tags = {
    Name        = "${var.name}-alb-${var.environment}"
    Environment = var.environment
  }
}

resource "aws_lb" "main" {
  name               = "${var.name}-alb-${var.environment}"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.vpc_public_subnets

  enable_deletion_protection = false

  tags = {
    Name        = "${var.name}-alb-${var.environment}"
    Environment = var.environment
  }
}

resource "aws_alb_target_group" "main" {
  name        = "${var.name}-tg-${var.environment}"
  port        = 80
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    healthy_threshold   = "3"
    interval            = "30"
    protocol            = "HTTP"
    matcher             = "200"
    timeout             = "3"
    path                = var.health_check_path
    unhealthy_threshold = "2"
  }

  tags = {
    Name        = "${var.name}-tg-${var.environment}"
    Environment = var.environment
  }
}

# Redirect to https listener
resource "aws_alb_listener" "http" {
  load_balancer_arn = aws_lb.main.id
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"

    redirect {
      port        = 443
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

# Redirect traffic to target group
resource "aws_alb_listener" "https" {
  load_balancer_arn = aws_lb.main.id
  port              = 443
  protocol          = "HTTPS"

  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = var.alb_tls_cert_arn

  default_action {
    target_group_arn = aws_alb_target_group.main.id
    type             = "forward"
  }
}

To connect to my database, I use local ssh port forwarding:


# Run in the foreground
ssh -N ubuntu@<bastion_ip> -L 8888: rds-db.weoweio.ap-southeast-2.rds.amazonaws.com:5432

# Run in the background
ssh -Nf ubuntu@<bastion_ip> -L 9999: rds-db.weoweio.ap-southeast-2.rds.amazonaws.com:5432

This setup has always worked, and with fail2ban/Guardduty and IP whitelist via security groups, it works great.

Challenges with this configuration

  • I have to keep OpenSSH updated to mitigate emerging attacks that target vulnerabilities in outdated versions.
  • I have to manage ssh keys. ssh key management is hard, especially with large teams.
  • Auditing ssh sessions is a painful task.
  • Keeping public IP/DNS increases my attack surface.

The solution

EC2 Instance Connect

When I read about EC2 Instance Connect in 2019 as a way to connect to an ec2 instance with temporary SSH keys, I was excited.

EC2 Instance Connect lets me manage SSH access via IAM.

It works by allowing a user to send temporary public keys to the EC2 instance using a CLI, the user then has 60seconds to authenticate using the private key.

So instead of using long-term SSH keys that live on the bastion, I can use temporary/disposable keys that are automagically discarded after 60seconds.

SSM

Using EC2 Instance Connect helps eliminate most of the problems listed above except the last one.

Using the AWS systems manager StartSSHSession document, I'm able to take the security a notch higher by eliminating the public DNS on the bastion.

How to do this in terraform

To do this in terraform, I have to :

  • Configure IAM permissions for the bastion — Add an IAM instance profile that includes 2 policies.
    • arn:aws:iam::aws:policy/EC2InstanceConnect
    • arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  • Install EC2 instance connect on the bastion or update my AMI. I found that EC2 instance connect comes pre-installed with certain AMIs (Amazon Linux > 2 2.0.20190618 and Ubuntu > 20.04 ).
  • Install ec2_instance_connect CLI on my local environment.
  • Install SSM plugin for the AWS CLI to my local environment. Reference here

And some cleanup steps:

  • Remove the SSH key pair on the bastion.
  • Remove the SSH ingress rules from my bastion security group.
  • Move the bastion from the public subnet to a private subnet.

The changes to the bastion module


module "bastion" {
  source  = "terraform-aws-modules/ec2-instance/aws"
  version = "2.16.0"

  ami                 = "ami-0e2e14798f7a300a1"
  name                        = var.name
  associate_public_ip_address = true
  instance_type               = "t2.small"
  vpc_security_group_ids      = [module.bastion_security_group.this_security_group_id]
  
  # Move the bastion into a private subnet
  subnet_ids                  = var.vpc_private_subnets
  
  # remove the redundant ssh key pair
  key_name                    = var.bastion_key_name
}

module "bastion_security_group" {
  source  = "terraform-aws-modules/security-group/aws"
  version = "3.1.0"

  name   = "${var.name}-sg"
  vpc_id = var.vpc_id

  # remove redundant ssh ingress rule
  ingress_cidr_blocks = ["0.0.0.0/0"]
  ingress_rules       = ["ssh-tcp"]

  egress_cidr_blocks = ["0.0.0.0/0"]
  egress_rules       = ["postgresql-tcp", "http-80-tcp", "https-443-tcp"]
}


module ec2_connect_role_policy {
  source  = "terraform-aws-modules/iam/aws//modules/iam-assumable-role"
  version = "~> 3.7.0"

  role_name               = "${var.name}-ec2-connect-role"
  role_requires_mfa       = false
  create_role             = true
  create_instance_profile = true

  trusted_role_services   = ["ec2.amazonaws.com"]
  custom_role_policy_arns = ["arn:aws:iam::aws:policy/EC2InstanceConnect", "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"]
}

SSH Tunneling

To set up the ssh tunnel I:

  • Generate a temporary ssh key.
  • Use ec2-instance-connect to upload the key to the bastion
  • SSH to the server over session manager using SSH ProxyCommand and the AWS CLI SSM plugin
# Generate a temporary ssh key. 
ssh-keygen -t rsa -f /ssh_key -N ''

# Use ec2-instance-connect to upload the key to the bastion
aws ec2-instance-connect send-ssh-public-key --instance-id <instance_id> --instance-os-user <os_user> --availability-zone <az> --ssh-public-key file:///ssh_key.pub

#ssh to the server over session manager using the aws cli SSM plugin
ssh <os_user>@<instance_id> -i /ssh_key -Nf \
  -L 9999:<rds_endpoint> \
  -o "StrictHostKeyChecking=no" \
  -o "UserKnownHostsFile=/dev/null" \
  -o ProxyCommand="aws ssm start-session --target %h --document AWS-StartSSHSession --parameters portNumber=%p --region=<region>"

My primary concern here is rds access. I haven't had the need to ssh into a fargate managed container in the ECS cluster but I imagine it would work the same way.