💾Storage
How we store our data
Database
Database servers are high performance machines suitable for real-time usage. Database [instances] store our current data, which is required for system operations.
There are many different SQL engines which can be used to power your SQL server. While each engine shares the standard syntax and functions, they also have their own distinct helper functions and non-standard syntax. Our SQL database engine of choice is MySQL.
Each environment should have its own database server and instance with the same engine. This is so we can trial server configurations in one environment without potentially disrupting other environments.
Data Backup Strategy
It's important that we backup our production database at least daily so that we may recover data in a disaster scenario. At all times, we should keep at least the last 7 days worth of backups. The frequency of backups and data retention period is subject to re-review depending on system usage.
A data backup is a snapshot of a database at a moment in time.
Users' Right to Erasure
Users have the right to [data] erasure from our database and data warehouse but not our backups. We require our backups as a safety net for disaster recovery. Furthermore, data shouldn't be deleted from backups as:
this could potentially corrupt a backup.
it allows for a potential attack scenario where an attacker could delete all of a user's data on their behalf with no possibility of recovery.
Handling Disaster Recovery
In a disaster scenario where our production database needs to be restored from a backup, we would be restoring previously deleted user data. This data needs to be re-deleted to uphold users' right to erasure. This can be achieved by maintaining a list of all the deleted primary keys for each model. This deletion list can then be cross referenced with the backup to select previously deleted keys.
Primary keys must never be recycled.
Our data warehouse can track if/when a primary key was scheduled for deletion.
Data Integrity
When writing data to a database as part of a larger process, it's common for an error to be raised during a proceeding step. However, if not managed properly, this can lead to data integrity issues.
Atomic Transactions
MySQL (as well as most SQL engines) allows you to create a collection of SQL statements called a transaction. Transactions can be made to "roll back" all previous statements if a statement raises an error. In other words, it's all or nothing. This is critical to maintaining data integrity.
Atomicity is the "all or nothing" nature of transactions. Atomic transactions are those which roll back all previous statements if a statement raises an error.
Django supports making all HTTP requests atomic. This should be enabled by default.
"It works like this. Before calling a view function, Django starts a transaction. If the response is produced without problems, Django commits the transaction. If the view produces an exception, Django rolls back the transaction."
Data Warehouse
Data warehouses store all historical data, which is required for business intelligence (BI). Databases are not suitable for this purpose as they are high performance and come at a high cost.
suited for operations
system
business intelligence
ease of access
central / easy
not central / not easy
frequency of access
frequent
infrequent
speed of access
fast
slow
cost (relative to size)
expensive
cheap
data table relations
enforced
not enforced
data stored
current
historical
data CRUD operations
create, read, update, delete
create, read, delete
BigQuery
Our data warehouse of choice is Google's BigQuery (BQ). Our BQ project is found here.
BigQuery uses the GoogleSQL engine.
Data Tables
As a counterpart to our production database, we should have a BQ dataset named "production". Within the dataset, we should have a BQ table for each table present in the database that has the same name for traceability. Data will be transferred from the production tables to their counterpart BQ table.
A dataset is a collection of data tables.
We can optionally have datasets for our non-production environments too, where each dataset is named after its source environment. Although this data will not produce useful insights, it may be useful for testing purposes. However, the size of these datasets should be managed to avoid unnecessary expenditure.
How to Design a Table's Schema
In its simplest form, a BQ table's schema must be a subset of its database counterpart. We must consider data's sensitivity before transferring. Sensitive columns should be omitted by default, unless explicitly required. Non-sensitive columns should be included by default, unless explicitly not required.
New columns can be created in a BQ table that aren't present in their counterpart database table. This may be useful to save the result of an expensive calculation or retain some useful metadata of an omitted column. Metadata columns should be prepended with the original column's name for traceability.
Sensitive data is data that can be used to identify individual persons and may be found in one or multiple columns together.
Columns containing personally identifiable data must be omitted in BQ tables by default. Be wary that while individual columns may not contain personally identifiable data, data from 2 or more columns may be enough to identify individual persons.
Example of data rows in a database table and its BQ table counterpart.
Database: production -> Table: user_user
1
Stefan Kairinos
stefan.kairinos@codeforlife.com
red
lentils
2
Florian Aucomte
florian.aucomt1@codeforlife.com
red
quiche
Dataset: production -> Table: user_user
1
codeforlife.com
red
2
codeforlife.com
red
Tables' primary key must always be transferred to support JOIN's in SQL queries.
How to Create a Table
Compose a new query in BQ Studio:
Write a
CREATE TABLE
SQL statement:
Save the SQL statement to the project with naming convention "create {TABLE_NAME}". This is useful if the table needs to be created in multiple environments' dataset.
Execute the SQL statement.
There are other ways to create a table but they do not support setting all the creation options we require. For example, BQ Studio has a graphical schema editor.
How to Create Data Table Relations
An important distinction between databases and data warehouses is that keys can NOT be enforced. However, it's still worth defining non-enforced keys when creating tables as it allows Google's SQL engine to more efficiently join data tables.
Primary and foreign keys can not be enforced and are merely suggestions.
Scheduled Queries
Our scheduled queries is how we transfer data from our production database to our data warehouse. Data is not transferred continuously, like our production database, but rather in regular intervals and in bulk. Queries must be scheduled to pull in all modified data since the last transfer from a production table into its counterpart BQ table.
A query is a statement that specifically retrieves data and helps clarify their unique roles within the SQL language. On the other hand, statements include a broader set of operations, such as INSERT, UPDATE, DELETE, CREATE, ALTER, and DROP.
Scheduling Strategy
To transfer only data that has been created or updated since the last transfer, every data model will have a last_saved_at date-time field:
auto_now automatically sets the field to now every time the object is saved.
When scheduling transfers, the SQL queries must filter for data saved since the last transfer.
When scheduling queries, it's important to consider how often a table's data should be transferred. Furthermore, queries should be scheduled during non-peak hours to minimise stress on our production database.
By default, queries should be scheduled daily at 07:00 UTC.
When scheduling transfers, keep in mind the schedules of related data tables. If we require a snapshot of some data and its relations, then we need to sync the schedules of all the related data tables.
Data Deletion Strategy
When we wish to delete any data rows, we shouldn't delete them immediately as then we won't be able to sync the last data values to our data warehouse. Rather, we should schedule the data for deletion at some point after the next scheduled transfer. This allows us to warehouse when data was deleted data and what the values where at the time of deletion.
To support scheduled deletions, every data model will have a:
delete_after date-time field, which will be null if it's not scheduled for deletion.
delete_wait attribute to set how far in the future a data row should be scheduled for deletion.
delete function which overrides Django's delete function to set the deletion schedule.
It's recommended to schedule deletions after the next 2-3x scheduled transfers so that if the first data transfer fails, we have a small window of opportunity to fix the problem. If we don't manage to fix the problem in the window, we must uphold our obligation to delete the data and lose that piece of history in our warehouse.
Deleting from our Database
Using CRON jobs, we can delete any data whose scheduled deletion is now or in the past from our database.
Depending on the data table, a CRON job should run frequently so that we don't store data long after its scheduled deletion. By default, a job should be scheduled every hour.
Unscheduling Deletions
We can optionally build flows in our system that allow users to unschedule data deletions.
Deleting from our Data Warehouse
We may wish to delete data that's too old for our BI operations and to save storage costs. If so, we can create scheduled queries to delete all rows where last_saved_at is less than some threshold.
If a user's personal data is stored in our warehouse, then we'll to set it to null using a scheduled query which applies a similar filter.
Last updated