💾Storage
How we store our data
Database
Database servers are high performance machines suitable for real-time usage. Database [instances] store our current data, which is required for system operations.
Each environment should have its own database server and instance with the same engine. This is so we can trial server configurations in one environment without potentially disrupting other environments.
Data Backup Strategy
It's important that we backup our production database at least daily so that we may recover data in a disaster scenario. At all times, we should keep at least the last 7 days worth of backups. The frequency of backups and data retention period is subject to re-review depending on system usage.
Users' Right to Erasure
Users have the right to [data] erasure from our database and data warehouse but not our backups. We require our backups as a safety net for disaster recovery. Furthermore, data shouldn't be deleted from backups as:
this could potentially corrupt a backup.
it allows for a potential attack scenario where an attacker could delete all of a user's data on their behalf with no possibility of recovery.
Handling Disaster Recovery
In a disaster scenario where our production database needs to be restored from a backup, we would be restoring previously deleted user data. This data needs to be re-deleted to uphold users' right to erasure. This can be achieved by maintaining a list of all the deleted primary keys for each model. This deletion list can then be cross referenced with the backup to select previously deleted keys.
Read more about our disaster recovery process.
Primary keys must never be recycled.
Our data warehouse can track if/when a primary key was scheduled for deletion.
Data Integrity
When writing data to a database as part of a larger process, it's common for an error to be raised during a proceeding step. However, if not managed properly, this can lead to data integrity issues.
Atomic Transactions
MySQL (as well as most SQL engines) allows you to create a collection of SQL statements called a transaction. Transactions can be made to "roll back" all previous statements if a statement raises an error. In other words, it's all or nothing. This is critical to maintaining data integrity.
Django supports making all HTTP requests atomic. This should be enabled by default.
"It works like this. Before calling a view function, Django starts a transaction. If the response is produced without problems, Django commits the transaction. If the view produces an exception, Django rolls back the transaction."
Data Warehouse
Data warehouses store all historical data, which is required for business intelligence (BI). Databases are not suitable for this purpose as they are high performance and come at a high cost.
suited for operations
system
business intelligence
ease of access
central / easy
not central / not easy
frequency of access
frequent
infrequent
speed of access
fast
slow
cost (relative to size)
expensive
cheap
data table relations
enforced
not enforced
data stored
current
historical
data CRUD operations
create, read, update, delete
create, read, delete
BigQuery
Our data warehouse of choice is Google's BigQuery (BQ). Our BQ project is found here.
Data Tables
As a counterpart to our production database, we should have a BQ dataset named "production". Within the dataset, we should have a BQ table for each table present in the database that has the same name for traceability. Data will be transferred from the production tables to their counterpart BQ table.
How to Design a Table's Schema
In its simplest form, a BQ table's schema must be a subset of its database counterpart. We must consider data's sensitivity before transferring. Sensitive columns should be omitted by default, unless explicitly required. Non-sensitive columns should be included by default, unless explicitly not required.
New columns can be created in a BQ table that aren't present in their counterpart database table. This may be useful to save the result of an expensive calculation or retain some useful metadata of an omitted column. Metadata columns should be prepended with the original column's name for traceability.
Columns containing personally identifiable data must be omitted in BQ tables by default. Be wary that while individual columns may not contain personally identifiable data, data from 2 or more columns may be enough to identify individual persons.
Example of data rows in a database table and its BQ table counterpart.
Database: production -> Table: user_user
1
Stefan Kairinos
stefan.kairinos@codeforlife.com
red
lentils
2
Florian Aucomte
florian.aucomt1@codeforlife.com
red
quiche
Dataset: production -> Table: user_user
1
codeforlife.com
red
2
codeforlife.com
red
Tables' primary key must always be transferred to support JOIN's in SQL queries.
How to Create a Table
Compose a new query in BQ Studio:

Write a
CREATE TABLE
SQL statement:
CREATE TABLE production.user_user (
first_name VARCHAR(30),
last_name VARCHAR(30)
);
Save the SQL statement to the project with naming convention "create {TABLE_NAME}". This is useful if the table needs to be created in multiple environments' dataset.
Execute the SQL statement.
How to Create Data Table Relations
An important distinction between databases and data warehouses is that keys can NOT be enforced. However, it's still worth defining non-enforced keys when creating tables as it allows Google's SQL engine to more efficiently join data tables.
CREATE TABLE production.user_user (
id INT NOT NULL,
first_name VARCHAR(30),
last_name VARCHAR(30),
PRIMARY KEY (id) NOT ENFORCED
);
Primary and foreign keys can not be enforced and are merely suggestions.
Scheduled Queries
Our scheduled queries is how we transfer data from our production database to our data warehouse. Data is not transferred continuously, like our production database, but rather in regular intervals and in bulk. Queries must be scheduled to pull in all modified data since the last transfer from a production table into its counterpart BQ table.
Scheduling Strategy
To transfer only data that has been created or updated since the last transfer, every data model will have a last_saved_at date-time field:
from django.db import models
class BaseModel(models.Model):
"""The base model to be inherited by all models."""
last_saved_at = models.DateTimeField(auto_now=True)
...
When scheduling transfers, the SQL queries must filter for data saved since the last transfer.
When scheduling queries, it's important to consider how often a table's data should be transferred. Furthermore, queries should be scheduled during non-peak hours to minimise stress on our production database.
When scheduling transfers, keep in mind the schedules of related data tables. If we require a snapshot of some data and its relations, then we need to sync the schedules of all the related data tables.
Data Deletion Strategy
When we wish to delete any data rows, we shouldn't delete them immediately as then we won't be able to sync the last data values to our data warehouse. Rather, we should schedule the data for deletion at some point after the next scheduled transfer. This allows us to warehouse when data was deleted data and what the values where at the time of deletion.
To support scheduled deletions, every data model will have a:
delete_after date-time field, which will be null if it's not scheduled for deletion.
delete_wait attribute to set how far in the future a data row should be scheduled for deletion.
delete function which overrides Django's delete function to set the deletion schedule.
from datetime import timedelta
from django.db import models
from django.utils.timezone import now
class BaseModel(models.Model):
"""The base model to be inherited by all models."""
delete_wait = timedelta(days=3)
delete_after = models.DateTimeField(null=True)
def delete(self):
self.delete_after = now() + self.delete_wait
self.save()
...
Deleting from our Database
Using CRON jobs, we can delete any data whose scheduled deletion is now or in the past from our database.
from django.utils.timezone import now
ExampleModel.objects.filter(delete_after__lte=now()).delete()
Depending on the data table, a CRON job should run frequently so that we don't store data long after its scheduled deletion. By default, a job should be scheduled every hour.
Unscheduling Deletions
We can optionally build flows in our system that allow users to unschedule data deletions.
Deleting from our Data Warehouse
We may wish to delete data that's too old for our BI operations and to save storage costs. If so, we can create scheduled queries to delete all rows where last_saved_at is less than some threshold.
If a user's personal data is stored in our warehouse, then we'll to set it to null using a scheduled query which applies a similar filter.
UPDATE user_user
SET email = NULL
WHERE delete_after <= NOW();
Last updated