Databases in AWS

 Databases in AWS

Databases

  • Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits
  • Sometimes, you want to store data in a database....
  • You can structure the data
  • You build indexes to efficiently query/search through the data
  • You define relationships between your datasets
  • Databases are optimized for a purpose and come with different features, shapes and constraints

Relational Databases

  • Looks like Excel spreadsheets, with links between tables for relationship and normalizing data as per requirement.

NOSQL Databases

  • NoSQL = non-SQL = non relational databases
  • NoSQL databases are purpose built for specific data models and have flexible schemas for building modern applications
  • Benefits:
    • Flexibility: easy to evolve data model
    • Scalability: designed to scale-out by using distributed clusters
    • High-performance: optimized for a specific data model
    • Highly functional: types optimized for the data model
  • Examples: key-value, document, graph, in-memory, search databases
  • NoSQL data example: JSON
    • JSON = JavaScript Object Notation
    • JSON is a common form of data that fits int a NoSQL model
    • Data can be nested
    • Fields can change over time
    • Support for new types: arrays, etc.....

Databases & shared responsibility on AWS

  • AWS offers use to manage different databases
  • Benefits include:
    • Quick Provisioning, High Availability, Vertical and Horizontal Scaling
    • Automated Backup & Restore, Operations, Updates
    • Operating System Patching is handled by AWS
    • Monitoring, alerting
  • Note: many databases technologies could be run on EC2, but you must handle yourself the resiliency, backup, patching, high availability, fault tolerance, scaling...

Amazon RDS

  • RDS stands for Relational Database Service
  • It's a managed DB service for DB use SQL as a query language
  • It allows you to create databases in the cloud that are managed by AWS
    • Postgres
    • MySQL
    • MariaDB
    • Oracle
    • Microsoft SQL Server
    • IBM DB2
    • Aurora (AWS Proprietary database)
  • Advantage of using RDS over hosting database on EC2
    • RDS is a AWS managed service:
      • Automated provisioning, OS patching
      • Continuous backups and restore to specific timestamp (Point in Time Restore)!
      • Monitoring dashboards
      • Read replicas for improved read performance
      • Multi AZ setup for DR (Disaster Recovery)
      • Maintenance windows for upgrades
      • Scaling capability (vertical and horizontal)
      • Storage backed by EBS
    • But you can't SSH into your instances

RDS Solution Architecture



Amazon Aurora

  • Aurora is a proprietary technology from AWS (not open sourced)
  • PostgreSQL and MySQL are both supported as Aurora DB
  • Aurora is "AWS cloud optimized" and claims 5x performance improvement over MySQL on RDS, over 3x the performance of Postgres on RDS
  • Aurora storage automatically grows in increments of 10 GB, up to 120 TB
  • Aurora costs more the RDS by 20 % - but is more efficient

Amazon Aurora Serverless

  • Automated database instantiation and auto-scaling based on actual usage
  • PostgreSQL and MySQL are both supported as Aurora Serverless DB
  • No capacity planning needed
  • Least management overhead
  • Pay per second, can be more cost-effective
  • Use cases: good for infrequent, intermittent or unpredictable workloads.....

RDS Deployments: Read Replicas, Multi-AZ

  • Read Replicas:
    • Scale the read workload of your DB
    • Can create up to 15 Read Replicas
    • Data is only written to the main DB
  • Multi-AZ:
    • Failover in case of AZ outage (high availability)
    • Data is only read/written to the main database
    • Can only have 1 other AZ as failover

RDS Deployments: Multi-Region

  • Multi-Region (Read Replicas)
    • Disaster recovery in case of region issue
    • Local performance for global reads
    • Replication cost

Amazon ElastiCache Overview

  • It is a managed Redis or Memcached Relational database
  • Caches are in-memory databases with high performance, low latency
  • Helps reduce load off databases for read intensive workloads
  • AWS takes care of OS maintenance/patching, optimizations, setup, configuration, monitoring, failure recovery and backups

DynamoDB

  • Fully Managed Highly available with replication across 3 AZ
  • NoSQL database - not a relational database
  • Scales to massive workloads, distributed "serverless" database
  • Millions of requests per seconds, trillions of row, 100s of TB of storage
  • Fast and consistent in performance
  • Sigle-digit millisecond latency - low latency retrieval
  • Integrated with IAM for security, authorization and administration
  • Low cost and auto scaling capabilities
  • Standard & Infrequent Access (IA) Table class
  • Type of data:
    • key/value database
    • have different schema (column count) per row. Each row is know as items. Schema is known as attributes.

DynamoDB Accelerator - DAX

  • Fully Managed in-memory cache for DynamoDB
  • 10x performance improvement - single-digit millisecond latency to microseconds latency - when accessing your DynamoDB tables
  • Secure, highly scalable & highly available
  • Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases

DynamoDB - Global Tables

  • Make a DynamoDB table accessible with low latency in multiple-regions
  • Active-Active replication (read/write to any AWS Region)

Redshift Database

  • Redshift is based on PostgreSQL, but it's not used for OLTP
  • It's OLAP - online analytical processing (analytics and data warehousing)
  • Load data once every hour, not every second
  • 10x better performance than other data warehouse, scale to PBs of data
  • Columnar storage of data (instead of row based)
  • Massively Parallel Query Execution (MPP), highly available
  • Pay as you go based on the instances provisioned
  • Has a SQL interface for performing the queries
  • BI tools such as AWS Quicksight or Tableau integrate with it for data visualization

Redshift Serverless

  • Automatically provisions and scales data warehouse underlying capacity
  • Run analytics workloads without managing data warehouse infrastructure
  • Pay only for what you use (save costs)
  • Use cases: Reporting, dashboarding applications, real-time analytics....
 

Amazon EMR

  • EMR stands for "Elastic MapReduce"
  • EMR helps creating Hadoop Clusters (Big Data) to analyze and process vast amount of data
  • The clusters can be made of hundreds of EC2 instances
  • Also supports Apache Spark, HBase, Presto, Flink ....
  • EMR takes care of all the provisioning and configuration
  • Auto-scaling and integrated with Spot instances
  • Use cases: data processing, machine learning, web indexing, big data .....

Amazon Athena

  • Serverless query service to perform analytics against S3 objects
  • Uses standard SQL language to query the files
  • Supports CSV, JSON, ORC, Avro, Parquet (build on Presto)
  • Pricing: $5 per TB of data scanned
  • Use compressed or columnar data for cost-savings (less scan)
  • Use cases: Business intelligence/analytics/reporting, analyze & query VPC Flow Logs, CloudTrail trails, etc .....
  • If need to use serverless SQL to analyze S3 objects then use Athena

Amazon QuickSight

  • It is BI tool in AWS
  • Serverless machine learning-powered business intelligence service to create interactive dashboards
  • Fast, automatically scalable, embeddable, with per-session pricing
  • Use cases:
    • Business analytics
    • Building visualizations
    • Perform ad-hoc analysis
    • Get business insights using data
  • Integrated with RDS, Aurora, Athena, Redshift, S3 .....


DocumentDB

  • Aurora is an "AWS-implementation" of PostgreSQL/MySQL ...
  • DocumentDB is the same for MongoDB (which is a NoSQL database)
  • MongoDB us used to store, query and index JSON data
  • Fully Managed, highly available with replication across 3 AZ
  • DocumentDB storage automatically grows in increments of 10 GB
  • Automatically scales to workloads with millions of requests per second

Amazon Neptune

  • Fully managed graph database
  • A popular graph dataset would be a social network
    • Users have friends
    • Posts have comments
    • Comments have likes from users
    • Users share and like posts...
  • Highly available across 3 AZ, with up to 15 read replicas
  • Build and run applications working with highly connected datasets - optimized for these complex and hard queries
  • Can store up to billions of relations and query the graph with milliseconds latency
  • Highly available with replications across multiple AZs
  • Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking

Amazon Timestream

  • Fully managed, fast, scalable, serverless time series database
  • Automatically scales up/down to adjust capacity
  • Store and analyze trillions of events per day
  • 1000s times faster & 1/10th the cost of relational database
  • Built-in time series analytics functions (helps you identify patterns in your data in near real-time)

Amazon Managed Blockchain (Decentralized)

  • Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority.
  • Amazon Managed Blockchain is a managed service to:
    • Join public blockchain networks
    • Or create your own scalable private network
  • Compatible with the frameworks Hyperledger Fabric & Ethereum

AWS Glue

  • Managed extract, transform, and load (ETL) service
  • Useful to prepare and transform data for analytics
  • Fully serverless service
  • Glue Data Catalog: catalog of datasets
    • can be used by Athena, Redshift, EMR

DMS - Database Migration Service

  • Quickly and securely migrate databases to AWS, resilient, self healing
  • The source database remains available during the migration
  • Supports:
    • Homogeneous migrations: ex Oracle to Oracle
    • Heterogeneous migrations: ex Microsoft SQL Server to Aurora

Databases & Analytics Summary in AWS

  • Relational Database - OLTP: RDS & Aurora (SQL)
  • Differences between Multi-AZ, Read Replicas, Multi-Region
  • In-memory Database - ElastiCache
  • Key/Value Database: DynamoDB (serverless) & DAX (cache for DynamoDB)
  • Warehouse - OLAP: Redshift (SQL)
  • Hadoop Cluster: EMR
  • Athena: query data on Amazon S3 (serverless & SQL)
  • QuickSight: dashboards on your data (serverless)
  • DocumentDB: "Aurora for MangoDB" (JSON - NoSQL database)
  • Amazon Managed Blockchain: managed Hyperledger Fabric & Ethereum blockchains
  • Glue: Managed ETL (Extract Transform Load) and Data Catalog service
  • Database Migration: DMS
  • Neptune: graph database
  • Timestream: time-series database




 






Comments

Popular posts from this blog

Machine Learning

Cloud Computing and IT

Cloud Monitoring