Cloud Monitoring

 Cloud Monitoring

Amazon CloudWatch Metrics

  • CloudWatch provides metrics for every services in AWS
  • Metric is a variable to monitor ( CPUUtilization, Networking...)
  • Metrics have timestamps
  • Can create CloudWatch dashboards of metrics

Important Metrics

  • EC2 instances: CPU Utilization, Status Checks, Network (not RAM)
    • Default metrics every 5 minutes
    • Option for Detailed Monitoring ($$$): metrics every 1 minute
  • EBS volumes: Disk Read/Writes
  • S3 buckets: BucketSizeBytes, NumberOfObjects, AllRequests
  • Billing: Total Estimated Charge (only in us-east-1)
  • Service Limits: how many you've been using a service API
  • Custom mertics: push your own metrics

Amazon CloudWatch Alarms

  • Alarms are used to trigger notifications for any metric
  • Alarms actions
    • Auto Scaling: increase or decrease EC2 instances "desired" count
    • EC2 Actions: stop, terminate, reboot or recover an EC2 instance
    • SNS notifications: send a notification into an SNS topic
  • Various options (sampling, %, max, min, etc...)
  • Can choose the period on which to evaluate an alarm
  • Example: create a billing alarm on the CloudWatch Billing metric
  • Alarm States: OK, INSUFFICIENT_DATA, ALARM

Amazon CloudWatch Logs

  • CloudWatch Logs can collect log from:
    • Elastic Beanstalk: collection of logs from application
    • ECS: collection from containers
    • AWS Lambda: collection from function logs
    • CloudTrail based on filter
    • CloudWatch log agents: on EC2 machines or on-premises servers
    • Route53: Log DNS queries
  • Enables real-time monitoring of logs
  • Adjustable CloudWatch Logs retention

CloudWatch Logs for EC2

  • By default, no logs from your EC2 instance will go to CloudWatch
  • You need to run a CloudWatch agent on EC2 to push the log files you want
  • Make sure IAM permissions are correct
  • The CloudWatch log agent can be setup on-premises too

Amazon EventBridge (formerly CloudWatch Events)

  • Schedule: Cron jobs (Scheduled scripts)
  • Event Pattern: Event rules to react to a service doing something
  • Trigger Lambda functions, send SQS/SNS messages...



  • Schema Registry: model event schema
  • You can archive events (all/filter) sent to an event bus (indefinitely or set period)
  • Ability to replay archived events


AWS CloudTrail

  • Provides governance, compliance and audit for your AWS Account
  • CloudTrail is enabled by default
  • Get an history of events/API calls made within your AWS Account by:
    • Console
    • SDK
    • CLI
    • AWS Services
  • Can put logs from CloudTrail into CloudWatch Logs or S3
  • A trail can be applied to All Regions (default) or a single Region.
  • If a resource is deleted in AWS, investigate CloudTrail first


AWS X-Ray

  • Debugging in Production, the good old way:
    • Test locally
    • Add log statements everywhere
    • Re-deploy in production
  • Log formats differ across applications and log analysis is hard.
  • Debugging: one big monolith "easy", distributed services "hard"
  • No common views of your entire architecture
  • Solution is X-Ray

AWS X-Ray advantages

  • Troubleshooting performance (bottlenecks)
  • Understand dependencies in a microservice architecture
  • Pinpoint service issues
  • Review request behaviour
  • Find errors and exceptions
  • Are we meeting time SLA?
  • Where I am throttled?
  • Identify users that are impacted

Amazon CodeGuru (decomissioned)

  • An ML-powered service for automated code reviews and application performance recommendations
  • Provides two functionalities
    • CodeGuru Reviewer: automated code reviews for static code analysis (development)
    • CodeGuru Profiler: visibility/recommendations about application performance during runtime (production)


Amazon CodeGuru Reviewer

  • Identify critical issues, security vulnerabilities, and hard to find bugs
  • Example: common coding best practices, resource leaks, security detection, input validation
  • Uses Machine Learning and automated reasoning
  • Hard-learned lessons across millions of code reviews on 1000s of open-source and Amazon repositories
  • Supports Java and Python
  • Integrates with GitHub, Bitbucket and AWS CodeCommit

Amazon CodeGuru Profiler

  • Helps understand the runtime behaviour of your application
  • Example: identify if your application is consuming excessive CPU capacity on a logging routine
  • Features:
    • Identify and remove code inefficiencies
    • Improve application performance (e.g., reduce CPU utilization)
    • Decrease compute costs
    • Provides heap summary (identify which objects using up memory)
    • Anomaly Detection
  • Support applications running on AWS or on-premise
  • Minimal overhead on application

AWS Service Health Dashboard - Service History

  • Shows all regions, all services health
  • Shows historical information for each day
  • Has an RSS feed you can subscribe to 
  • Previously called AWS Service Health Dashboard

AWS Account Heatlh Dashboard - Your Account

  • Previously called AWS Personal Health Dashboard (PHD)
  • AWS Account Health Dashboard provides alerts and remediation guidance when AWS is experiencing events that may impact you
  • While the Service Health Dashboard displays the general status of AWS services, Account Health Dashboard gives you a personalized view into the performance and availability of the AWS services underlying your AWS resources
  • The dashboard displays relevant and timely information to help you manage events in progress and provides proactive notification to help you plan for scheduled activities.
  • Can aggregate data from an entire AWS Organization
  • Global Service
  • Show how AWS outages directly impact you and your AWS resources
  • Alert, remediation, proactive, scheduled activities

Monitoring Summary

  • CloudWatch
    • Metrics: monitor the performance of AWS services and billing metrics
    • Alarms: automate notification, perform EC2 action, notify to SNS based on metric
    • Logs: collect log files from EC2 instances, servers, Lambda functions...
    • Events ( or EventBridge): react to events in AWS, or trigger  rule on a schedule
  • CloudTrail: audit API calls made within your AWS Account
  • CloudTrail insights: automated analysis of your CloudTrail Events
  • X-Ray: trace requests made through your distributed applications
  • AWS Health Dashboard: status of all AWS services across all regions
  • AWS Account Health Dashboard: AWS events that impact your infrastructure
  • Amazon CodeGuru: automated code reviews and application performance recommandations

Comments

Popular posts from this blog

Machine Learning

Cloud Computing and IT