Configuring Amazon CloudWatch Agent to Consume Prometheus Endpoint Metrics

Jesse Aranki

February 20, 2024

Jesse Aranki

Leveraging CloudWatch for Prometheus Metrics

HashiCorp Terraform Enterprise (TFE) is a self-hosted distribution of Terraform Cloud. It offers enterprises a private instance of the Terraform Cloud application with no resource limits and additional enterprise-grade architectural features like audit logging and SAML single sign-on.

The Terraform Enterprise metrics service collects several runtime metrics. You can use this data to observe your installation in real-time. You can also monitor and alert on these metrics to detect anomalous incidents, performance degradation, and utilization trends. Terraform Enterprise aggregates these metrics on a 5 second interval and keeps them in memory for 15 seconds.

Monitoring is not without its challenges if you’re working in EC2; this blog post covers how to set up the Amazon CloudWatch agent to consume TFE metrics from the exposed Prometheus endpoint and forward the metrics onto CloudWatch without the need to work with Prometheus.

In this blog post, I will work with TFE FDO (flexible deployment options) as this is the recommended way to work with TFE nowadays.

Enabling metrics on TFE

Before we begin, let’s ensure the environment is set up correctly. If you haven’t enabled the metrics endpoint, you can do so by adding the following lines to your docker-compose file.

TFE_METRICS_ENABLE: true
TFE_METRICS_HTTP_PORT: 9090 
TFE_METRICS_HTTPS_PORT: 9091

ports: # append these to your existing ports section
  - "9090:9090"
  - "9091:9091"

For a full list of available fdo options, please take a look at the HashiCorp TFE documentation here. TFE_METRICS_HTTP_PORT and TFE_METRICS_HTTPS_PORT are not technically required as these are the default values; but it’s important to know this when looking at the solution as a whole.

Now that’s done, rebuild the container and start it back up again:

docker compose up -d

Validate that it works by curling the endpoint on the TFE server:

curl http://localhost:9090/metrics?format=prometheus

Setting up Amazon CloudWatch agent

Dependencies

The IAM role on your instance profile needs to have the following permissions to upload logs and metrics to Amazon CloudWatch:

- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
- cloudwatch:PutMetricData

Now that we have the endpoint exposed and can confirm that we can hit it from the host, we will need to configure the scraper jobs and the CloudWatch agent:

Create a scraper job file

In the CloudWatch agent config directory, create a Prometheus file.

cd /opt/aws/amazon-cloudwatch-agent/etc/ && mkdir prometheus
cd prometheus && vi prometheus.yml

2. Copy the code snippet below into the fileglobal:

global:
  scrape_interval: 10s
  scrape_timeout: 10s
scrape_configs:
  - job_name: tfe
    sample_limit: 10000
    static_configs:
      - targets:
        - "localhost:9090"
    scheme: http
    metrics_path: '/metrics'
    params:
      format: ['prometheus']

params section creates the url http://localhost:9090/metrics?format=prometheus in effect; if you don’t have the format parameter to the query then TFE will output the metrics in JSON format instead, which means we won’t be able to parse properly.
job_name is important to note, even though we will only be using a single job for our needs; we can use the job to create as many metrics as we want
scheme being HTTP is for simplicity and that all the traffic stays within the instance until the CloudWatch agent sends it to aws; which is encrypted.

Update CloudWatch agent config file

Navigate to your CloudWatch agent config file, the default location being:

Linux: /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Windows: C:\\Program Files\\Amazon\\AmazonCloudWatchAgent\\amazon-cloudwatch-agent.json

Add the following code to the config file.

If you already have a logs stanza, then simply add it appropriately:

"logs": {
  "metrics_collected": {
    "prometheus": {
      "log_group_name": "tfe",
      "prometheus_config_path": "/opt/aws/amazon-cloudwatch-agent/etc/prometheus/prometheus.yml",
      "emf_processor": {
        "metric_declaration_dedup": true,
        "metric_namespace": "TFE_Metrics",
        "metric_unit": {
            "tfe_container_cpu_usage_user": "Counter",
            "tfe_container_cpu_usage_kernel": "Counter",
            "tfe_container_memory_used_bytes": "Gauge",
            "tfe_container_memory_limit": "Gauge",
            "tfe_container_network_rx_bytes_total": "Counter",
            "tfe_container_network_rx_packets_total": "Counter",
            "tfe_container_network_tx_bytes_total": "Counter",
            "tfe_container_network_tx_packets_total": "Counter",
            "tfe_container_disk_io_op_read_total": "Counter",
            "tfe_container_disk_io_op_write_total": "Counter",
            "tfe_container_disk_io_bytes_read_total": "Counter",
            "tfe_container_disk_io_bytes_write_total": "Counter",
            "tfe_container_process_count": "Gauge",
            "tfe_container_process_limit": "Gauge"        
},
        "metric_declaration": [
          {
            "source_labels": [
              "job"
            ],
            "label_matcher": "tfe",
            "dimensions": [
              [
                "run_type",
                "host"
              ]
            ],
            "metric_selectors": [
              "^tfe_run_count$"
            ]
          },
          {
            "source_labels": [
              "job"
            ],
            "label_matcher": "^tfe$",
            "dimensions": [
              [
                "name",
                "host"
              ]
            ],
            "metric_selectors": [
              "^tfe_container_cpu_usage_user_ns$",
              "^tfe_container_cpu_usage_kernel_ns$"
            ]
          }
        ]
      }
    }
  }
}

Let’s take a closer look at the config file above; the above configuration file will do the following:

Create the metric namespace TFE_Metrics and upload the metrics to that namespace
Upload 3 metrics:

tfe.run.count – Number of running containers used for Terraform operators (runs and plans).
– Metric uploaded will utilise host and run_type dimensions to show the count for apply and plan
tfe.container.cpu.usage.user – Running count, in nanoseconds, of the total amount of time processes in the container have spent in userspace
tfe.container.cpu.usage.kernel – Running count, in nanoseconds, of the total amount of time processes in the container have spent in kernel space
– metric uploaded will utilise the host and name
– both kernel and user metrics will be uploaded to the same dimension view within the namespace

3. metrics will be uploaded to CloudWatch logs under the tfe log group

Note — by default, the log group will have unlimited log retention; if you want to customise this, you can do so after the creation, or create your log group ahead of time with whatever requirements you wish

It’s also important to note that the metric_unit section of the config file will need to match up with the metric type you see on the documentation I mentioned below.

Now that the configuration is done, you can restart the CloudWatch agent and verify it works.

systemctl restart amazon-cloudwatch-agent

The official AWS documentation that runs through Prometheus metric gathering in greater detail can be found here.

For a full list of metrics you can consume and send to CloudWatch, check out the HashiCorp documentation here

Conclusion

In conclusion, this blog post has outlined the process of setting up the Amazon CloudWatch agent to consume metrics from the HashiCorp Terraform Enterprise (TFE) metrics service. By leveraging the capabilities of the CloudWatch agent, organizations can seamlessly monitor and analyze TFE metrics in real-time, enabling them to detect anomalies and optimize performance.