Databases in AWS

Databases

Storing data on disk (EFS, EBS, EC2 Instance Store, S3) can have its limits
Sometimes, you want to store data in a database....
You can structure the data
You build indexes to efficiently query/search through the data
You define relationships between your datasets
Databases are optimized for a purpose and come with different features, shapes and constraints

Relational Databases

Looks like Excel spreadsheets, with links between tables for relationship and normalizing data as per requirement.

NOSQL Databases

NoSQL = non-SQL = non relational databases
NoSQL databases are purpose built for specific data models and have flexible schemas for building modern applications
Benefits:

Flexibility: easy to evolve data model
Scalability: designed to scale-out by using distributed clusters
High-performance: optimized for a specific data model
Highly functional: types optimized for the data model

Examples: key-value, document, graph, in-memory, search databases
NoSQL data example: JSON

JSON = JavaScript Object Notation
JSON is a common form of data that fits int a NoSQL model
Data can be nested
Fields can change over time
Support for new types: arrays, etc.....

Databases & shared responsibility on AWS

AWS offers use to manage different databases
Benefits include:

Quick Provisioning, High Availability, Vertical and Horizontal Scaling
Automated Backup & Restore, Operations, Updates
Operating System Patching is handled by AWS
Monitoring, alerting

Note: many databases technologies could be run on EC2, but you must handle yourself the resiliency, backup, patching, high availability, fault tolerance, scaling...

Amazon RDS

RDS stands for Relational Database Service
It's a managed DB service for DB use SQL as a query language
It allows you to create databases in the cloud that are managed by AWS

Postgres
MySQL
MariaDB
Oracle
Microsoft SQL Server
IBM DB2
Aurora (AWS Proprietary database)

Advantage of using RDS over hosting database on EC2

RDS is a AWS managed service:

Automated provisioning, OS patching
Continuous backups and restore to specific timestamp (Point in Time Restore)!
Monitoring dashboards
Read replicas for improved read performance
Multi AZ setup for DR (Disaster Recovery)
Maintenance windows for upgrades
Scaling capability (vertical and horizontal)
Storage backed by EBS

But you can't SSH into your instances

RDS Solution Architecture

Amazon Aurora

Aurora is a proprietary technology from AWS (not open sourced)
PostgreSQL and MySQL are both supported as Aurora DB
Aurora is "AWS cloud optimized" and claims 5x performance improvement over MySQL on RDS, over 3x the performance of Postgres on RDS
Aurora storage automatically grows in increments of 10 GB, up to 120 TB
Aurora costs more the RDS by 20 % - but is more efficient

Amazon Aurora Serverless

Automated database instantiation and auto-scaling based on actual usage
PostgreSQL and MySQL are both supported as Aurora Serverless DB
No capacity planning needed
Least management overhead
Pay per second, can be more cost-effective
Use cases: good for infrequent, intermittent or unpredictable workloads.....

RDS Deployments: Read Replicas, Multi-AZ

Read Replicas:

Scale the read workload of your DB
Can create up to 15 Read Replicas
Data is only written to the main DB

Multi-AZ:

Failover in case of AZ outage (high availability)
Data is only read/written to the main database
Can only have 1 other AZ as failover

RDS Deployments: Multi-Region

Multi-Region (Read Replicas)

Disaster recovery in case of region issue
Local performance for global reads
Replication cost

Amazon ElastiCache Overview

It is a managed Redis or Memcached Relational database
Caches are in-memory databases with high performance, low latency
Helps reduce load off databases for read intensive workloads
AWS takes care of OS maintenance/patching, optimizations, setup, configuration, monitoring, failure recovery and backups

DynamoDB

Fully Managed Highly available with replication across 3 AZ
NoSQL database - not a relational database
Scales to massive workloads, distributed "serverless" database
Millions of requests per seconds, trillions of row, 100s of TB of storage
Fast and consistent in performance
Sigle-digit millisecond latency - low latency retrieval
Integrated with IAM for security, authorization and administration
Low cost and auto scaling capabilities
Standard & Infrequent Access (IA) Table class
Type of data:

key/value database
have different schema (column count) per row. Each row is know as items. Schema is known as attributes.

DynamoDB Accelerator - DAX

Fully Managed in-memory cache for DynamoDB
10x performance improvement - single-digit millisecond latency to microseconds latency - when accessing your DynamoDB tables
Secure, highly scalable & highly available
Difference with ElastiCache at the CCP level: DAX is only used for and is integrated with DynamoDB, while ElastiCache can be used for other databases

DynamoDB - Global Tables

Make a DynamoDB table accessible with low latency in multiple-regions
Active-Active replication (read/write to any AWS Region)

Redshift Database

Redshift is based on PostgreSQL, but it's not used for OLTP
It's OLAP - online analytical processing (analytics and data warehousing)
Load data once every hour, not every second
10x better performance than other data warehouse, scale to PBs of data
Columnar storage of data (instead of row based)
Massively Parallel Query Execution (MPP), highly available
Pay as you go based on the instances provisioned
Has a SQL interface for performing the queries
BI tools such as AWS Quicksight or Tableau integrate with it for data visualization

Redshift Serverless

Automatically provisions and scales data warehouse underlying capacity
Run analytics workloads without managing data warehouse infrastructure
Pay only for what you use (save costs)
Use cases: Reporting, dashboarding applications, real-time analytics....

Amazon EMR

EMR stands for "Elastic MapReduce"
EMR helps creating Hadoop Clusters (Big Data) to analyze and process vast amount of data
The clusters can be made of hundreds of EC2 instances
Also supports Apache Spark, HBase, Presto, Flink ....
EMR takes care of all the provisioning and configuration
Auto-scaling and integrated with Spot instances
Use cases: data processing, machine learning, web indexing, big data .....

Amazon Athena

Serverless query service to perform analytics against S3 objects
Uses standard SQL language to query the files
Supports CSV, JSON, ORC, Avro, Parquet (build on Presto)
Pricing: $5 per TB of data scanned
Use compressed or columnar data for cost-savings (less scan)
Use cases: Business intelligence/analytics/reporting, analyze & query VPC Flow Logs, CloudTrail trails, etc .....
If need to use serverless SQL to analyze S3 objects then use Athena

Amazon QuickSight

It is BI tool in AWS
Serverless machine learning-powered business intelligence service to create interactive dashboards
Fast, automatically scalable, embeddable, with per-session pricing
Use cases:

Business analytics
Building visualizations
Perform ad-hoc analysis
Get business insights using data

Integrated with RDS, Aurora, Athena, Redshift, S3 .....

DocumentDB

Aurora is an "AWS-implementation" of PostgreSQL/MySQL ...
DocumentDB is the same for MongoDB (which is a NoSQL database)
MongoDB us used to store, query and index JSON data
Fully Managed, highly available with replication across 3 AZ
DocumentDB storage automatically grows in increments of 10 GB
Automatically scales to workloads with millions of requests per second

Amazon Neptune

Fully managed graph database
A popular graph dataset would be a social network

Users have friends
Posts have comments
Comments have likes from users
Users share and like posts...

Highly available across 3 AZ, with up to 15 read replicas
Build and run applications working with highly connected datasets - optimized for these complex and hard queries
Can store up to billions of relations and query the graph with milliseconds latency
Highly available with replications across multiple AZs
Great for knowledge graphs (Wikipedia), fraud detection, recommendation engines, social networking

Amazon Timestream

Fully managed, fast, scalable, serverless time series database
Automatically scales up/down to adjust capacity
Store and analyze trillions of events per day
1000s times faster & 1/10th the cost of relational database
Built-in time series analytics functions (helps you identify patterns in your data in near real-time)

Amazon Managed Blockchain (Decentralized)

Blockchain makes it possible to build applications where multiple parties can execute transactions without the need for a trusted, central authority.
Amazon Managed Blockchain is a managed service to:

Join public blockchain networks
Or create your own scalable private network

Compatible with the frameworks Hyperledger Fabric & Ethereum

AWS Glue

Managed extract, transform, and load (ETL) service
Useful to prepare and transform data for analytics
Fully serverless service
Glue Data Catalog: catalog of datasets

can be used by Athena, Redshift, EMR

DMS - Database Migration Service

Quickly and securely migrate databases to AWS, resilient, self healing
The source database remains available during the migration
Supports:

Homogeneous migrations: ex Oracle to Oracle
Heterogeneous migrations: ex Microsoft SQL Server to Aurora

Databases & Analytics Summary in AWS

Relational Database - OLTP: RDS & Aurora (SQL)
Differences between Multi-AZ, Read Replicas, Multi-Region
In-memory Database - ElastiCache
Key/Value Database: DynamoDB (serverless) & DAX (cache for DynamoDB)
Warehouse - OLAP: Redshift (SQL)
Hadoop Cluster: EMR
Athena: query data on Amazon S3 (serverless & SQL)
QuickSight: dashboards on your data (serverless)
DocumentDB: "Aurora for MangoDB" (JSON - NoSQL database)
Amazon Managed Blockchain: managed Hyperledger Fabric & Ethereum blockchains
Glue: Managed ETL (Extract Transform Load) and Data Catalog service
Database Migration: DMS
Neptune: graph database
Timestream: time-series database

Search This Blog

AWS Practitioner Certification notes

Databases in AWS

Databases in AWS

Databases

Relational Databases

NOSQL Databases

Databases & shared responsibility on AWS

Amazon RDS

RDS Solution Architecture

Amazon Aurora

Amazon Aurora Serverless

RDS Deployments: Read Replicas, Multi-AZ

RDS Deployments: Multi-Region

Amazon ElastiCache Overview

DynamoDB

DynamoDB Accelerator - DAX

DynamoDB - Global Tables

Redshift Database

Redshift Serverless

Amazon EMR

Amazon Athena

Amazon QuickSight

DocumentDB

Amazon Neptune

Amazon Timestream

Amazon Managed Blockchain (Decentralized)

AWS Glue

DMS - Database Migration Service

Databases & Analytics Summary in AWS

Comments

Post a Comment

Popular posts from this blog

Machine Learning

Cloud Computing and IT

Cloud Monitoring