This post is part of our introduction series for our technology. Learn the background here or forever wonder how baseball cards, the macarena, and a cloud-native AMS are related.
In the intro to this series, we talked about our big ask. Design an AMS that knows you, connects you, and continuously improves. Anytime. Anywhere. Always.
Brilliant! Okay...now how do we do that? Up to this point we’ve been focused on application logic. But at the heart of any AMS is a database. Right? ....Or is there? Should be a simple yes or no right?
Actually not anymore. Today, before we can answer, we have to ask a different question: “What is a database?”
A database can be defined as “a structured set of data held in a computer, especially one that is accessible in various ways.” By that definition, yes, there’s a database. But it’s probably not what you think of when you hear the word “database”.
If you have been in or around technology in the last 40 years, and you hear the word database what you think of is probably a relational database management system (RDBMS). This style of database was first described in 1970 and was the powerhouse of data management throughout the PC and dot-com era. You may be familiar with names that have become synonymous with an RDBMS such as Oracle, SQL Server, MySQL, and maybe, just maybe, the first one I worked with, DBase.
Relational databases do some things very, very well. So well in fact, that a relational database became the default choice for any kind of data management for any kind of application across much of the technology world. It was asked to store everything. To do everything. If this sounds kind of familiar, you probably already read our serverless microservices post.
I remember reviewing a system in the early 2000s that held the numbers one through ten in a relational database table. To populate a drop-down. As if we would invent a new number - say “spleven” to go after six and we would be relieved because we wouldn’t have to update code. But that’s how ubiquitous the relational database had become - it was used everywhere for every problem without thinking and never questioned.
And then challenges started to appear. The first one was internet scale. Instead of building systems for one department or one company with hundreds or thousands of users, we started building one system for ten companies or a thousand companies with millions or tens of millions of users. This meant a lot more people trying to access or update data at the same time. In the RDBMS this created “contention.” People came up with lots of creative ways to solve for contention: queues, sharding, clustering, load balancing, read-replicas, reporting instances, and more. Together, these were fairly effective. But it took a lot of time and effort to get right. Time and effort that is better spent building features for customers.
The second problem was created by some of the other major advances we’ve been talking about in this series. Over the last several years, teams all over the world have been working to decompose their old systems, “monoliths,” into smaller and smaller pieces culminating in the serverless microservices architecture we talked about here. The benefits were obvious and overwhelming. Orders of magnitude increases in reliability and development speed. We’re talking about tasks that used to take a week done in an hour. That’s transformational change. But many systems relying on a single relational database to power them had a built-in Achilles heel that would prevent them from fully unlocking that advantage.
You see, in an RDBMS parts of the data, cities for instance, are related to other parts of the data, states. Relational database. Looking at the entire dataset you see an interweaved spider web of these relationships known as a data schema. The trouble is, when you have to change the schema in one place, there are a lot of ways that change may impact other places. So you manage and version and try to keep the schema in-sync with changes to the code. This is totally antithetical to what we’re trying to achieve with a decoupled microservices architecture where we want to be able to change one little part very rapidly and know for certain nothing else has been affected.
As a side note, somewhere out there is a support group for people who have been charged with building a version management system for a large RDBMS fleet and I just want you to know that your heroism has not gone unnoticed or unappreciated.
Just to be clear, I’m not saying a RDBMS is bad. I’m not even saying a RDBMS isn’t the best solution in the world for some problems. I’m just saying it shouldn’t be the default, do-everything, never-ask-why solution anymore. When the RDBMS conquered the world, we lived in an era when storage was very, very expensive. That computer I had in the 80s didn’t even have a hard drive. When we did finally get one, my parents spent hundreds of dollars for 1 MB of storage. 1 MB. Just for fun, try saving this page offline from your browser. I’ll bet it’s more than 1 MB. I like fun so I tried it with one of the other blog posts on this site. It was bigger than 3 of those hard drives could have held.
An RDBMS is a very efficient way to store data. Imagine you wanted to store all the people in the US and what state they live in. You could store “California”, “Georgia”, or “Mississippi” in each record. Or you could store a code: “CA”, “GA”, or “MS.” This is called normalization and it makes the total amount of data you need to store much smaller. Doesn’t seem like a big difference? Well it is if you live in an era when one GB of storage costs 1 million dollars as it did when the relational database was proposed. Today, that same GB would cost two cents. More than 7 orders of magnitude cheaper.
An RDBMS is also a very good way to run ad-hoc queries to create reports. Suppose you want to build a report that is going to show contacts and you want to display the name of the state they live in. You tell the RDBMS to execute a “join” matching the state code and loading the corresponding full state name for each record. That matching process uses the CPU. More on that in a bit.
An RDBMS, however, is not always the best way to load information to populate a web app. As an example, data for an order record may be spread across all kinds of tables in a relational database. These all have to be joined together when you ask to view your order history. As we’ve seen, that uses the CPU which has limited capacity shared across the entire database. This presents challenges such as a big task choking off resources for other queries which have to wait. As they’re waiting more queries come in and soon things cascade until they seize up. Again, there have been lots of methods created for managing this kind of contention. But better than all of those techniques is to simply avoid the problem in the first place.
In a NoSQL table, all the information relative to that order is stored in one place. You can load everything about an order with no join and the bare minimum CPU. But wait a minute you say. Isn’t data duplicated? Yes, in our example above, we would write the full “California”, “Georgia”, or “Mississippi” in every order - denormalized data. But as we also pointed out above, storage is cheap. And with the public cloud, for all practical purposes, it is infinite. Infinite, really? Well, if you started buying stuff right now, you’d run out of money long before AWS runs out of room to store your order data - duplicate state names and all.
The other benefit of having all the data in one place is that a change to the schema doesn’t affect anything else. In fact, a NoSQL table is “schemaless” so there is no schema to update at all. This fits perfectly with our serverless microservices architecture designed for ultra-rapid development and continuous deployment.
AWS’s NoSQL service, DynamoDB, has so many other benefits: point-in-time recovery, encryption at rest, on-demand throughput, global data replication, and even ACID transactions, that we knew it fit our vision for Rhythm far better than one big relational database. We also knew that our serverless microservice design would still allow us to use a RDBMS in very targeted ways should we encounter a situation in the future where the strengths of a relational database fit better. That’s right, when you have one thousand little parts working together, there’s no rule that says they have to store data in the same way. In fact, last year AWS’s CTO Werner Vogels wrote a fantastic blog post describing how AWS envisioned these exact sort of purpose-built database decisions.
That’s right, when you have one thousand little parts working together, there’s no rule that says they have to store data in the same way.
At this point there might be something gnawing at the back of your brain. I mentioned how an RDBMS is excellent for reporting, but how does that work for a system like Rhythm where the data is in NoSQL storage and not being powered by a big relational database? Well, I’m glad you might have asked. When storage is cheap and infinite, nothing says data has to live in only one place. In an upcoming post in this series we’ll look at how the radical shift in the availability of storage and the birth of map-reduce systems like hadoop have given rise to another new phenomenon allowing teams to deliver and scale faster than ever - the data lake.