Designing a scalable AWS database service using serverless compute, tiered storage, and Redis caching.

Modern SaaS platforms often face a tricky challenge early on: how do you manage thousands of customer databases without exploding infrastructure costs?
The default answer is usually a managed database per tenant. But that approach quickly becomes expensive, operationally complex, and difficult to scale. Instead, I (along with my teammates at NYU) explored a different design: building a serverless, multi-tenant database platform using lightweight databases and cloud-native storage.
The goal was simple: design a system that provides strong tenant isolation, scalable migrations, and high availability while keeping costs dramatically lower than traditional managed databases.
Most SaaS platforms start with a simple architecture: one database per customer or one large shared database.
Both approaches have tradeoffs.
Provisioning a dedicated database instance per tenant offers strong isolation but becomes extremely expensive at scale. Running thousands of database instances means paying for compute, storage, and replication even when tenants are idle.
Shared databases reduce cost but introduce operational complexity:
difficult schema migrations
tenant isolation challenges
noisy neighbor performance issues
Traditional cloud databases solve some of these problems, but the cost curve grows quickly. For example, provisioning managed databases for thousands of tenants can cost tens of thousands of dollars per month.
This led to an interesting question:
What if each tenant database were just a lightweight file, orchestrated by serverless infrastructure?
That idea became the foundation for the architecture.
The system is built entirely with AWS serverless components and distributed storage, focusing on three goals:
tenant isolation
low operational cost
elastic scaling
Each tenant is provisioned as a separate SQLite database file (.db) stored in object storage, while orchestration is handled through serverless APIs and metadata services.
The architecture is built using a set of AWS services that together provide tenant provisioning, query execution, schema migrations, replication, and storage tiering.
API Gateway acts as the public entry point to the system. All client requests such as tenant creation, query execution, and schema migrations pass through API Gateway before being routed to backend Lambda functions.
Lambda functions serve as the compute layer of the platform. Different Lambda handlers are responsible for tasks such as executing queries, provisioning new tenant databases, performing schema migrations, managing replication, and handling storage tier transitions.
DynamoDB stores system metadata including tenant identifiers, API keys, schema versions, database locations, and migration state. This metadata layer allows the platform to orchestrate thousands of tenant databases efficiently.
S3 stores SQLite database files and schema templates. It also acts as cold storage for inactive tenants, allowing the system to store large numbers of databases cheaply while maintaining durability.
EFS is used as the hot storage layer for active tenant databases. Because it provides low-latency shared filesystem access, Lambda functions can read and write tenant database files directly during query execution.
Redis is used as an in-memory query result cache. Frequently executed read queries are temporarily cached, allowing the system to return results without repeatedly accessing the underlying database files.
SQS queues coordinate asynchronous workflows such as schema migrations and replication updates. FIFO queues ensure migrations for a tenant are processed sequentially, preventing concurrent modification of the same database.
SNS is used to broadcast database snapshot events after write operations. These events trigger downstream replication tasks, allowing multiple replicas to update asynchronously.
Route 53 manages DNS routing and health checks for the system’s API endpoints. If the primary endpoint fails, Route 53 automatically redirects traffic to a secondary region to maintain availability.
EventBridge triggers scheduled background jobs such as the cold storage manager, which periodically scans tenant activity and moves inactive databases from EFS to S3.
CloudWatch provides logging, metrics, and monitoring for the platform. It tracks system health, storage transitions, migration execution, and overall request latency.
Together, these services create a fully serverless control plane for database provisioning and management.
When a new tenant is created:
A request hits API Gateway
A Lambda function provisions a new SQLite database
Schema templates are applied
Metadata is stored in DynamoDB
The database file is uploaded to S3
This design provides complete tenant isolation, since each tenant operates on its own database file.
One of the key challenges in multi-tenant systems is managing storage cost. If every tenant database lives on low-latency storage, the cost grows linearly with the number of tenants.
To solve this, the system uses two storage tiers:
Storage | Purpose |
|---|---|
Amazon EFS | Hot storage for active tenants |
Amazon S3 | Cold storage for inactive tenants |
Active tenants are stored on EFS, which provides low-latency NFS access suitable for transactional workloads. Typical read latency is around 1–2 ms.
Inactive tenants are moved to S3, which provides durable and significantly cheaper object storage, but with higher access latency around 100–200 ms when databases need to be restored.
This separation allows the system to optimize both performance and cost.
One of the hardest problems in multi-tenant systems is schema evolution.
Updating schemas across thousands of databases can easily introduce downtime or inconsistent states.
To solve this, the system implements a queue-based migration pipeline.
Migration requests are first processed by a handler Lambda, which prepares migration tasks and sends them to a FIFO SQS queue. A separate worker Lambda consumes those tasks sequentially.
This architecture ensures:
migrations are processed in order
no two workers update the same tenant simultaneously
failures are isolated per tenant
Supported migration operations include:
creating tables
renaming tables
adding columns
dropping tables
More complex operations are intentionally avoided to keep migrations safe and predictable.
To ensure resilience, the system implements read replicas and automatic failover.
After every write operation:
A snapshot of the database is created
Metadata about the snapshot is published to an SNS topic
Multiple SQS queues fan out replication jobs
Lambda workers update read replicas asynchronously
This approach decouples replication from the write path, keeping write latency low while maintaining eventual consistency.
To handle failures, Route 53 health checks monitor the primary API endpoint. If a failure occurs, traffic is automatically routed to a standby endpoint in another region.
The result is a system that remains available even during infrastructure outages.
Storage cost becomes a major issue when you manage millions of tenant databases.
Instead of storing everything in a single storage layer, the system uses tiered storage:
Storage Layer | Purpose |
|---|---|
S3 | Cold storage for inactive tenants |
EFS | Hot storage for frequently accessed tenants |
Redis | In-memory cache for repeated queries |
Inactive tenant databases are automatically moved to S3 to reduce cost. When they become active again, they are rehydrated back into EFS.
On top of that, a Redis caching layer stores results of frequently executed queries. Cache hits can return results in under one millisecond, reducing load on the storage layer.
A scheduled Lambda function periodically scans tenant metadata stored in DynamoDB. If a tenant database has not been accessed within a configurable threshold (for example 30 days), it is moved to cold storage.
The workflow looks like this:
A scheduled Lambda checks the last_accessed timestamp for each tenant.
Databases that exceed the inactivity threshold are uploaded from EFS → S3.
The database is removed from EFS to free up space.
Metadata is updated to mark the tenant as COLD.
Operational metrics such as tier transitions and reclaimed storage are tracked using CloudWatch dashboards.
When a request is made for a tenant stored in cold storage:
API Gateway invokes a Rehydrate Lambda.
The tenant database is downloaded from S3 → EFS.
Metadata is updated to mark the tenant as HOT.
Queries resume normally.
This design keeps hot tenants fast while allowing inactive tenants to be stored cheaply.
To evaluate the effectiveness of tiered storage, consider a system with:
1 million tenants
5 GB per tenant database
10 reads and 5 writes per day
If all databases were stored on EFS, annual infrastructure costs would exceed $1.7 million.
If everything were stored on S3, storage would be cheaper but request and transfer costs would dominate, reaching over $10 million annually.
A hybrid tiered strategy dramatically improves efficiency:
Storage Strategy | Estimated Annual Cost |
|---|---|
All EFS | ~$1.7M |
All S3 | ~$10.7M |
Tiered (EFS + S3) | ~$1.0M |
The hybrid approach ensures the system cost scales with tenant activity rather than total tenant count.
The architecture delivered strong improvements in both cost and scalability.
For tenant provisioning, the system reduces database cost dramatically compared with traditional managed instances:
Approach | Monthly cost per tenant |
|---|---|
Managed database per tenant | $7–8 |
Shared database model | $0.10–0.20 |
File-based architecture | < $0.01 |
This represents over 95% cost reduction while still maintaining strong tenant isolation.
Performance improvements were also significant:
Access Path | Typical Latency |
|---|---|
Redis cache hit | < 1 ms |
EFS hot storage | 1–3 ms |
S3 cold rehydration | 100–200 ms |
The tiered storage architecture also reduces storage costs by 60–80%, since only active tenants remain in the hot storage layer.
Building a scalable database service does not always require heavyweight distributed databases or expensive managed clusters.
By combining:
serverless compute
lightweight databases
object storage
queue-based orchestration
caching layers
it IS possible to build a cost-efficient, highly scalable multi-tenant database platform using simple cloud primitives.
This architecture shows that with the right design choices, systems can scale to massive workloads while keeping both cost and operational complexity under control. If you'd like to take a look at the implementation of such a service, check out the GitHub repository on my profile, and make sure to ⭐ star it as well!