On Architecture – Part 2: Data Storage and Access
Introduction
Second post on a new series on software architecture. First one here, on high-level system communication. This time I’m going to talk a bit about data storage and how to access the data.
Data Storage and Organisation
In the early days, pretty much everyone was using relational databases for storing their data. I believe a great number is still doing so, but a number of other solutions have emerged:
- Time series databases
- Graph databases
- Document databases
- Key-value
This is to name just a few. Essentially, almost anything can be modelled with a relational database, but there are other kinds of databases that allow things do be modelled, and, especially, queried, much better.
So, this is a very crucial decision to make: what kind of database engine to use. After that, a number of other decisions will arise, such as:
- The tech specs for the server
- The actual model definition
- Partitioning strategies (this, of course, can evolve over time)
- Scaling strategy (scale up/down, clustering)
- Monitoring
- Deployment (cloud, on premises)
- Maintenance (for performance)
- Getting rid of old data
It is not uncommon to have a specific database used for product definition and then use some sort of ETL process to get data into a more definite destination, which can be of a different kind of database engine. Think, for example, of a relational database where the product catalog is defined in a normalised format and an ETL process that picks it up, transforms, and turns it into a set of documents in a document database. This ETL process needs to be defined and put in place, a number of tools exist for this purpose (SSIS, Talend Open Studio) or it can be manually implemented, something that I’d strongly advise against.
Lastly, it is worth mentioning that the times when companies had a single database for all their needs are long gone! It is perfectly fine and usual to have multiple databases, so do not feel obliged to fit everything into a single one. Pick what’s best for each use case, including each microservice. If there is need, implement a service that synchonises/reconciles the data that needs it among the different databases.
Data Access
Another aspect that we must consider is: how do we access the data? Should we use plain SQL (in the case of relational databases, or the equivalent in other kinds), or use stored procedures? Going further, still for relational databases, should we trust in Object-Relational Mappers (O/RMs) such as Entity Framework Core or NHibernate for the job or do we prefer to write the queries by hand? Mind you, modern versions of these tools can either use SQL or stored procedures, or both. By the way, some O/RMs also support accessing non-relational databases, for example, Entity Framework Core can access Cosmos DB and, more recently, MongoDB. Other possibilities for accessing data might include OData or Graph QL, especially when we don’t want to allow direct access to the database engine via a database connection, all would need to go through HTTP(s). These two are similar, but Graph QL also includes support for subscriptions, something that might be worth looking at. Key-value databases such as Redis is very used to keep sessions and caches in web applications.
Because of the way applications work, it may make sense to use different database models, for querying and for updates, this may result in better performance. This may involve some kind of process to reconcile the two, either on a schedule or whenever a change occurs.
Security
How do we secure access to each database? Do we use credentials or, for example, Windows authentication, whenever possible? And in regards to the actual data, should we go for fine-grained security, such as row-level security, or even column-based? And do we need to encrypt columns (data at rest) or just use encryption on the connection (data in transit)?
Monitoring
How can we monitor our databases, what needs to be monitored: is it just the growth, the physical limits, or also the errors that may occur? Should we also look for deadlocks? Do we need to preemptively look for slow queries or operations?
Deployment
Now that we have everything defined, where do we put it? Is it going to be on premises (inside our company) or on some cloud? Maybe the latter is a better choice these days, as the costs can be considerably lower and it also offers other useful features when it comes to scalability and maintenance. If we are going to use cloud-base deployment, there may be legal and other concerns as well, and it may open possibilities such as geo-replication, of all or parts of the data.
Maintenance
When everything is set and running, it is our job to keep it that way, as smoothly as possible. This involves a number of operations, for example:
- Rebuilding/setting up new indexes depending on the current usage trends
- Update statistics
- Backing up data (full, incremental)
Also regarded as part of the maintenance is how to deal with old data.
Old Data
And in regards to old data, what shall we do, shall we leave everything online or shall we move data that is no longer (or less so) needed to some offline (or cold) storage? And shall we make it possible to still be able to access this data?
We may need to run an analysis for figuring out the old data and its dependencies that can be safely moved/purged first, this by itself can be a dauting task. Then we may need to have multiple levels for keeping it, possibly with different expirations, and even with manual confirmation, which may end in it being deleted or moved to a cold storage location, where it is essentially offline.
Conclusion
As you can see, plenty of food for thought. I merely scratched the surface, there’s a lot to dig here. As always, let me hear your thoughts.