Demystifying Database Options in AWS

Written By Anwar Haq

NoSQL

In Part I of this article, we explored the distinctions between SQL and NoSQL databases, alongside the features and use cases of SQL databases. This second part will focus on NoSQL databases. While NoSQL databases have existed for years, their popularity surged with the advent of cloud computing. This rise is largely attributed to the ease and cost-effectiveness of launching, experimenting with, and discontinuing cloud services. AWS offers a variety of NoSQL database services, each possessing unique features that make it suitable for specific use cases, despite some overarching similarities.

DynamoDB

Amazon DynamoDB is a serverless NoSQL database service, capable of scaling to millions of requests per second. As a schema-less database, it's ideal for storing data in JSON format. When designing tables, we must define a partition key, which is crucial not only for querying but also for horizontally partitioning data across DynamoDB's infrastructure.

Important Features

Amazon DynamoDB offers several key features that make it a powerful choice for high-performance, scalable applications:

Massive Performance Scale: DynamoDB achieves immense performance, handling millions of requests per second with sub-millisecond response times, by strategically leveraging data partitioning, optimized data distribution, and effective indexing.
Global Tables for Disaster Recovery and Performance: We can use Global Tables to automatically replicate DynamoDB tables across different AWS regions. This provides a robust disaster recovery (DR) strategy and simultaneously enhances application performance by serving global users from a geographically closer region.
Flexible Capacity Modes: DynamoDB tables can be configured to run in one of two capacity modes:
- On-Demand Capacity Mode: Ideal for unpredictable or spiky workloads, as it automatically scales to accommodate traffic without requiring capacity planning. While it's generally costlier per request, we only pay for what we use.
- Provisioned Capacity Mode: Suitable for predictable workloads, where we specify our expected read and write throughput. To handle any unexpected spikes or changes in demand, this mode can be combined with autoscaling to dynamically adjust provisioned capacity.
Real-time Data Streams: Any changes made to the DynamoDB tables can be captured and streamed in real-time using either DynamoDB Streams or Kinesis Data Streams for DynamoDB. This enables real-time analytics, replication, and integration with other services.
DynamoDB Accelerator (DAX): For applications requiring even faster performance, DAX provides an in-memory caching layer. By configuring DAX, we can achieve microsecond-level response times and significantly reduce the cost of repeated data reads from DynamoDB tables.

Amazon DynamoDB with DAX (source)

Use Cases

DynamoDB is an excellent choice for applications demanding sub-millisecond latency and high scalability. It supports a wide array of use cases, including:

Current State and Checkpointing: Ideal for capturing the real-time state of applications or checkpointing progress during streaming or ETL processes.
User Profiles: Effective for storing user information, enabling tailored marketing campaigns for a large subscriber base.
Session Management: Perfect for managing session data for a high volume of concurrent users, such as generating leaderboards in online gaming.
IoT Data Ingestion: Well-suited for ingesting and storing data from Internet of Things (IoT) devices due to its high write throughput.

Who uses DynamoDB: Over a million customers use DynamoDB including McAfee, Genesys, and Venmo.

Limitations to Consider

While Amazon DynamoDB excels in many areas, it's essential to be aware of its specific limitations:

Item Size Limit: Each individual item (row) stored in DynamoDB cannot exceed 400 KB in size. This includes the size of both the attribute names and their values.
Costly Full Scans: Performing a full scan of a large DynamoDB table is highly inefficient and costly. To maintain performance and manage expenses, the table and index designs must ensure that queries retrieve only a small, targeted number of items.
No Built-in Joins: DynamoDB does not have a native mechanism for performing table joins. This is a critical architectural difference from relational databases and emphasises the importance of defining our application's access patterns in advance so we can design tables to denormalize data and avoid the need for complex joins at query time.
Limited Transaction Support: While DynamoDB offers transactional APIs, its support for complex, multi-item, or multi-table transactions is more limited compared to traditional relational databases. Applications requiring strong ACID properties across many items or tables might need careful design consideration.

DocumentDB

DocumentDB is a NoSQL Amazon service with MongoDB compatibility. Just like MongoDB it stores data in JSON format. We can connect to DocumentDB using MongoDB drivers. DocumentDB is available both as fully managed and serverless deployments.

Important Features

Amazon DocumentDB offers several key features that provide high availability, scalability, and performance:

High Durability and Availability: Similar to Amazon Aurora, DocumentDB clusters store six copies of data across three different Availability Zones (AZs). This architecture ensures high data availability and fault tolerance, protecting the data against failures.
Read Replicas: Fully managed (instance-based) DocumentDB clusters can support up to 15 read replicas. These replicas efficiently offload read traffic, significantly improving our application's read throughput and scaling capabilities.
Scalability Options:
- DocumentDB Elastic Clusters: These clusters are designed for massive workloads, capable of handling millions of reads and writes per second and providing petabytes of storage capacity.
- Provisioned Clusters: Standard provisioned DocumentDB clusters support up to 128 TiB of storage.
Global Clusters for Disaster Recovery: DocumentDB supports Global Clusters, enabling us to create one primary cluster and up to five secondary clusters in different AWS Regions. This setup significantly simplifies disaster recovery and provides low-latency reads for globally distributed applications.
Persistent Page Cache: DocumentDB maintains its page cache in a separate process, distinct from the database engine itself. This means the cache survives database failures. When the database instance restarts or comes back online, its buffer cache is already "warm," allowing for immediate high performance without a lengthy warm-up period.

Amazon DocumentDB Architecture (source)

Use Cases

Amazon DocumentDB is an excellent choice for a variety of scenarios:

Document Store: It excels at storing diverse information like patient records, exam results, or course details.
E-commerce Platforms: Ideal for managing user profiles, product catalogs, orders, and customer reviews.
Content Management: Effectively handles various content types, including articles, metadata, and social media posts.
Recommendation Engines: Perfect for storing user preferences, behavior patterns, and historical data to power personalized recommendations.
Homogeneous Migration: Provides a straightforward path for migrating existing on-premises MongoDB deployments to a fully managed cloud environment.

Who uses DocumentDB: Customers include Dow Jones, BBC, Finra.

Limitations to Consider

While Amazon DocumentDB offers many advantages, it's important to understand its specific limitations:

Document Size Limit: A single document in DocumentDB has a maximum size of 16MB. While significantly larger than DynamoDB's 400KB limit, this is not unlimited, and we’ll need to consider how to manage very large embedded objects or arrays.
Limited Join Support: Although DocumentDB provides the $lookup operator for joining collections, it has limitations compared to traditional relational database joins. For example, it does not support correlated subqueries within $lookup operations, which can restrict certain complex query patterns.

Keyspaces

Amazon Keyspaces is a serverless, NoSQL, wide column database service with Apache Cassandra compatibility. Amazon Keyspaces provides virtually limitless throughput and storage. Unlike other database systems that support one writer with multiple readers, Keyspaces (like Cassandra) supports multiple writers, which makes it a good option for ingesting massive amount of data. As the name implies Keyspaces comprises multiple groups of tables called keyspace. A keyspace groups related tables that are relevant for one or more applications.

Important Features

Amazon Keyspaces offers several key features that make it a highly scalable, available, and resilient choice for demanding workloads:

Massive Parallel Writes: Due to its underlying architecture (inspired by Apache Cassandra), Amazon Keyspaces is designed to enable massive parallel writes. This allows for extremely high write throughput, accommodating applications with very high data ingestion rates.
Multi-Region Replication (Active-Active): Amazon Keyspaces provides multi-region replication, implementing a robust data resiliency architecture. Data is actively and synchronously distributed across independent and geographically separate AWS Regions. This means each Region can perform both reads and writes in isolation, offering continuous availability and low-latency access for global users.
High Durability and Availability: For built-in durability and high availability, Amazon Keyspaces automatically stores three copies of data across multiple Availability Zones (AZs) within a single region.
Point-in-Time Recovery (PITR): We can restore a Keyspaces table to any specific moment in time within the last 35 days. This point-in-time recovery capability is accessible via the AWS Management Console, AWS SDKs, AWS CLI, or through CQL (Cassandra Query Language).

Amazon Keyspaces Architecture (source)

Use Cases

Amazon Keyspaces is a good choice for a number of use cases like:

Large, Fast-Growing Datasets: Ideal for applications managing extensive and rapidly expanding data, such as log analytics or clickstream analysis.
IoT Data Streams: Well-suited for ingesting continuous data streams from IoT devices, scaling horizontally as more devices are added.
Content Delivery Networks (CDNs): Effective for caching frequently accessed content globally, as seen in services like Netflix video streaming or YouTube content caching.
Migration from Apache Cassandra: Provides a straightforward path for migrating existing Apache Cassandra workloads to a fully managed cloud service.

Who Use Keyspaces: Customers include Intuit, BankBazar.

Limitations to Consider

Following are some limitations to consider when working with Amazon Keyspaces:

Limited Cassandra Compatibility: Although Amazon Keyspaces is compatible with Apache Cassandra's API, it doesn't support every Cassandra feature or function. This means that if our existing Cassandra application relies on certain specialised or less common APIs, we might need to adjust code during migration.
Maximum Row Size: Each individual row in an Amazon Keyspaces table is limited to a maximum size of 1MB. While this is substantial for many use cases, it's crucial to design the data model to avoid exceeding this limit, especially for very wide rows or rows containing large embedded objects.
No Table Joins (CQL): Amazon Keyspaces uses Cassandra Query Language (CQL), which, despite its SQL-like syntax, does not support table joins. This fundamental difference from relational databases means we must design tables to denormalise data, and anticipate application's access patterns upfront, rather than relying on joining data at query time.

Neptune

Amazon Neptune is a high-performance, purpose-built graph database service available in both fully managed and serverless deployments. It stores graph data as vertices (data items) and edges, where the types of edges define the relationships between vertices. Neptune allows us to build various types of graphs, including knowledge graphs, fraud detection graphs, and social networking graphs.

Important Features

Amazon Neptune offers several key features that ensure high performance, availability, and advanced capabilities for graph databases:

Highly Durable Storage: Neptune stores six copies of the data across three different Availability Zones (AZs) within its cluster volumes. This data is typically replicated with a latency of less than 100 milliseconds, ensuring high durability and fault tolerance.
Low-Latency Read Replicas: We can create up to 15 read replicas across multiple AZs. These replicas provide very low-latency read access because they share the same underlying storage volume as the primary instance.
Global Database for Disaster Recovery: Neptune Global Database allows us to set up highly available, disaster-proof graph databases. We can create one primary cluster and up to five secondary clusters in different AWS Regions, with each secondary cluster supporting up to 16 read replicas.
Automatic Indexing: Neptune automatically indexes all graph data. This means we don't need to manually configure indexes; the service optimizes query performance out-of-the-box.
Neptune ML for Graph Analytics: With Neptune ML, we can build and train machine learning models directly on graph data. This enables advanced analytics such as predicting new relationships with Graph Neural Networks (GNNs) or identifying patterns using Knowledge Graph Embeddings.

Amazon Neptune ML Workflow (source)

Use Cases

Amazon Neptune is a versatile graph database capable of implementing various types of graphs for diverse use cases:

Knowledge Graphs: Ideal for storing interconnected information to answer complex queries and derive insights from structured and semi-structured data.
Identity Graphs: Used to map relationships between different entities, such as customers and their preferred products, or social connections. This information can then be powerfully leveraged to generate personalized recommendations.
Fraud Graphs: Effective for building networks of activities, like credit card purchases, to detect suspicious transactions and patterns indicative of fraud.
Medical Research: Valuable for mapping complex genomic data, identifying disease correlations, understanding drug interactions, and accelerating research discoveries.
Migration: Provides a straightforward path for migrating data from self-managed graph databases like Neo4j and Apache TinkerPop Gremlin.

Who uses Neptune: Customers like ADP, MovieStarPlant, Marinus Analytics.

Limitations to Consider

While Amazon Neptune is an excellent choice for many graph database scenarios, it's important to understand its specific limitations:

Not for OLTP or OLAP Workloads: Neptune is a purpose-built graph database; it's not designed for Online Transaction Processing (OLTP) workloads that require frequent, small, atomic transactions, nor for Online Analytical Processing (OLAP) workloads that involve complex aggregations on large, structured datasets.
No Manual Index Control: Neptune automatically indexes graph data. This simplifies management, but users cannot manually create or drop indexes, which might limit fine-tuned optimization for very specific query patterns.
VPC-Only Access: Amazon Neptune is designed to be accessed only from within an Amazon Virtual Private Cloud (VPC). This means we cannot connect to it directly from the public internet without setting up secure network configurations like a VPN or a bastion host.

OpenSearch

Amazon OpenSearch is a service designed for storing and analyzing large volumes of log data, scaling up to 25 petabytes. Available in both fully managed and serverless implementations, OpenSearch stores data as JSON documents. These documents are then indexed to enable various types of searches, including exact matches, close matches, and vector searches.

Important Features

Amazon OpenSearch offers several key features for scalable search and analytics:

Massive Scalability and Cost-Effective Storage: OpenSearch lets us store up to 25 petabytes of data across more than a thousand nodes. To save storage costs, it provides UltraWarm and cold storage tiers specifically for read-only data. We can also rollup indexes into more condensed forms for further storage efficiency.
Integrated Visualisation: We can visualise our data directly using OpenSearch Dashboards, which is built into the service.
Real-time Data Streaming: OpenSearch integrates seamlessly with Amazon Kinesis, allowing us to stream large volumes of data into our OpenSearch domains in real time.
Cross-Cluster Replication: For disaster recovery or distributing search workloads, cross-cluster replication enables us to replicate indexes from one OpenSearch domain to another.
Machine Learning for Enhanced Search: OpenSearch supports integrating machine learning models to improve search relevance and capabilities. We can either load these models directly into OpenSearch or call them from a remote platform.

Use Cases

Amazon OpenSearch is a versatile service well-suited for a variety of use cases, including:

Log Analytics: Efficiently store and analyze large volumes of logs, errors, and events for operational insights and troubleshooting.
Full-Text Search: Implement powerful search functionalities, similar to those found in Google or Wikipedia, for applications with vast amounts of textual data.
Observability: Achieve real-time monitoring of system performance, application errors, and threat detection, providing immediate insights into operational health.
Recommendation Engines: Quickly search through product catalogs and user preferences to generate personalized suggestions for e-commerce platforms or content services.
Geospatial Search: Perform location-based queries to find nearby places, services, or assets, enabling features for mapping and location-aware applications.

Amazon OpenSearch (source)

Who uses OpenSearch: Customers include Aro and Deputy.

Limitations to Consider

Amazon OpenSearch is not suitable for OLTP workloads.

Timestream

Timestream for LiveAnalytics will not be available for new customers, effective 20 June, 2025.

When it comes to managing vast amounts of time-series data, Amazon offers a specialised database service called Amazon Timestream. This service is designed for efficient storage and analysis of the massive volumes of time-series data generated daily.

Amazon Timestream provides two distinct options:

Amazon Timestream for LiveAnalytics (serverless): This serverless option is designed for real-time analytics.
Amazon Timestream for InfluxDB (provisioned): This provisioned service is ideal for users familiar with InfluxDB.

As Timestream for LiveAnalytics is no longer available for new customers, our focus will be on the features and uses of Timestream for InfluxDB. Timestream for InfluxDB is built upon version 2.x of the open-source InfluxDB database. It fully supports Flux and InfluxQL for querying data (SQL is supported for InfluxDB version 3 onwards). Data in Timestream for InfluxDB is stored using the Line Protocol format, which looks like this:

measurement,tagsets fieldsets timestamp

In this format, a point is comprised of tags, fields, and a timestamp. A collection of these points then forms a time series.

Amazon Timestream-InfluxDB (source)

Important Features

Amazon Timestream for InfluxDB offers a robust set of features for managing and analyzing time-series data:

High Ingestion Rates: Depending on the chosen instance size, Timestream for InfluxDB can handle the ingestion of hundreds of thousands of records per second, making it suitable for high-volume data streams.
Data Organization and Retention:
- Data within InfluxDB is organized into buckets.
- We can define lifecycle rules at the bucket level to automatically purge or archive old data, helping manage storage costs and data relevance.
- Buckets are further grouped under an Organization, providing a logical separation for different teams or projects.
Nanosecond Precision: InfluxDB supports storing data with nanosecond precision, which is crucial for applications requiring extremely granular time-series analysis, such as high-frequency sensor readings or financial data.
Seamless Grafana Integration: Timestream for InfluxDB integrates easily with Grafana, a popular open-source platform for data visualization and dashboards. This allows users to create rich, interactive dashboards to monitor and analyze their time-series data.
Flexible Data Ingestion with Telegraf: Data can be efficiently populated into InfluxDB using Telegraf, an open-source data collection agent developed by InfluxData. Telegraf supports a wide array of input plugins, enabling it to collect data from various sources and send it to Timestream for InfluxDB.

Use Cases

Amazon Timestream for InfluxDB is a highly versatile database service, purpose-built for efficient storage and analysis of time-series data across a wide range of applications. Its capabilities make it ideal for the following types of data and scenarios:

System and Infrastructure Monitoring: Collecting and analyzing operational metrics such as CPU utilization, memory consumption, storage I/O, and network activity from servers, virtual machines, and containers. This enables real-time performance monitoring, troubleshooting, and capacity planning.
Industrial Telemetry and Asset Management: Storing and processing data from industrial sensors and equipment (e.g., temperature, pressure, vibration, flow rates). This facilitates predictive maintenance, operational efficiency improvements, and streamlined equipment management in manufacturing, energy, and other industrial sectors.
Internet of Things (IoT) Data: Ingesting and analyzing high-volume data streams from IoT sensors and devices. This includes environmental data, smart home device metrics, vehicle telemetry, and more, enabling real-time insights and automated actions.
User Behavior and Website Analytics (Clickstream Data): Capturing and analyzing user interactions on websites and applications, such as page views, clicks, scrolls, and session durations. This provides valuable insights into user behavior, content popularity, and website performance.
Financial Market Data: Storing and querying historical and real-time stock prices, trading volumes, and other financial instrument data over time. This supports financial analysis, algorithmic trading, and market research.

Who uses Timestream: Timestream users include companies like Autodesk, CleanAir, and GrafanaLabs (link).

Limitations to Consider

While Amazon Timestream for InfluxDB offers powerful capabilities for time-series data, it's important to be aware of certain limitations when planning your implementation:

Version and Query Language: Amazon Timestream for InfluxDB is currently based on InfluxDB version 2.x. A key point to note is that InfluxDB version 2.x does not natively support SQL for querying data. SQL support was introduced starting with InfluxDB version 3.x. Therefore, users will primarily rely on Flux or InfluxQL for data interaction.
Maximum Storage Capacity: The current maximum storage capacity for a single Timestream for InfluxDB instance is 16 TiB. For use cases requiring storage beyond this limit, you would need to consider sharding your data across multiple instances or exploring alternative architectural patterns.
Immutable Compute and Storage: Currently, Amazon Timestream for InfluxDB does not allow you to modify the compute or storage configurations of existing instances after they have been created.

S3

Amazon Simple Storage Service (S3) is a foundational and one of the oldest services provided by Amazon Web Services (AWS). It is an object storage service, designed for highly scalable, secure, and durable data storage. Despite not being a database itself, S3 is a vital component in many modern database architectures. S3 is a global service, meaning that data stored in the S3 buckets can be accessed from anywhere in the world, provided the necessary permissions are granted.

Important Features

Amazon S3 offers a comprehensive set of features designed for robust, scalable, and cost-effective object storage:

Exceptional Durability and Availability: S3 provides industry-leading durability of 99.999999999% (11 nines), meaning our data is highly protected against loss. It also offers high availability of 99.99%, ensuring the data is accessible when we need it. These high levels of durability and availability are achieved through redundant storage across multiple facilities within an AWS Region.
Robust Security and Access Control: S3 is inherently very secure. We have fine-grained control over data access using access policies, which can be defined at various levels, from broad permissions to highly specific object-level controls. To enhance security further, S3 supports various encryption options for data both in transit and at rest, including server-side encryption with S3-managed keys, AWS Key Management Service (KMS), or customer-provided keys.
Lifecycle Management: S3 offers several storage classes, each optimised for different access patterns and cost requirements e.g. S3 Standard, S3 Intelligent-Tiering, S3 Glacier etc. We can leverage S3 Lifecycle rules to automatically transition objects between these storage classes based on predefined policies (e.g., move data to a colder tier after 30 days) and ultimately expire them when no longer needed, optimising storage costs.
Creation and Cross-Region Replication: While S3 is considered a global service in terms of its management plane, buckets are created within a specific AWS Region. For disaster recovery, data sovereignty, or low-latency access from different geographical locations, the Cross-Region Replication (CRR) feature allows us to automatically replicate data between buckets located in two different AWS Regions.
High-Performance Data Lakes: S3 Tables (built on top of Apache Iceberg open table format) represents a specialised storage class designed for high-performance data processing. It offers ACID-compliant storage, crucial for maintaining data consistency. Apache Iceberg enables robust data management features such as schema evolution, partition evolution, and time travel for data lake.

Use Cases

Amazon S3's versatility and robust features make it suitable for a wide array of use cases, particularly in data management and analytics. Here are some of its most common applications:

Long-Term Storage: Perhaps its most prevalent use case in the context of databases, S3 serves as an ideal, cost-effective, and highly durable solution for long-term storage of database backups and archival data. Its extreme durability ensures that our critical backups are safely preserved.
Data Lake Foundation: S3 is widely recognized as the foundation for data lakes, capable of storing vast quantities of structured, semi-structured (e.g., JSON, XML), and unstructured (e.g., images, videos, logs) data in its native format. Services like AWS Lake Formation and Amazon EMR (for big data processing) frequently use S3 as their primary storage layer for building and managing data lakes.
Querying with Amazon Athena: S3 is deeply integrated with Amazon Athena, a serverless query service that allows us to analyse data directly in S3 using standard SQL. This enables data analysts and data scientists to run ad-hoc queries on large datasets stored in S3 without needing to load them into a separate database.
Extended Storage for Databases: Many database management systems can leverage S3 as an extended storage layer. For example, Amazon Redshift (a data warehouse service) can query data stored in S3 using Redshift Spectrum. This allows users to combine data from their Redshift clusters with vast datasets in S3, providing a unified analytics experience without needing to move all data into Redshift.
Storing Binary Data: For applications dealing with binary large objects (BLOBs) such as images, videos, documents, or audio files, it's often more efficient to store the binary data directly in S3, while only keeping the metadata (e.g., file path, size, type) in a traditional database. This approach reduces the load on the database, improves database performance, and leverages S3's cost-effectiveness and scalability for large file storage.

Amazon S3 with Athena and Lake Formation (source)

Who uses S3: Millions of customers including Salesforce, Siemens, Bloomberg

Limitations to Consider

While Amazon S3 is a highly versatile and powerful storage service with a broad range of applications, it's important to understand its inherent limitations to ensure it aligns with your specific workload requirements.

Not a Traditional Database or Block Storage: Amazon S3 functions as an object storage service, not a block storage solution or a relational/NoSQL database. This means it's optimised for storing discrete, complete data objects (like files), rather than providing raw block-level access or complex database querying capabilities. Although it offers low latency, typical response times are in the tens to hundreds of milliseconds.
Request Rate Considerations: While S3 is designed for massive scale, it has specific request rate limits that can impact performance if exceeded. Currently, the documented limits are approximately 3,500 PUT, POST, or DELETE requests per second per prefix, and 5,500 GET or HEAD requests per second per prefix. Effective data organization and request distribution across prefixes are crucial for high-throughput workloads.
Immutable Bucket Properties: Once an S3 bucket has been created, its name cannot be changed, and its AWS Region cannot be altered. If you need to rename a bucket or relocate its data to a different AWS Region, you must create an entirely new bucket with the desired name or in the new region, and then transfer all your existing data to it.

Demystifying Database Options in AWS - Part II

NoSQL

DynamoDB

DocumentDB

Keyspaces

Neptune

OpenSearch

Timestream

S3