💾Storage

How we store our data

Database

Database servers are high performance machines suitable for real-time usage. Database [instances] store our current data, which is required for system operations.

There are many different SQL engines which can be used to power your SQL server. While each engine shares the standard syntax and functions, they also have their own distinct helper functions and non-standard syntax. Our SQL database engine of choice is MySQL.

Each environment should have its own database server and instance with the same engine. This is so we can trial server configurations in one environment without potentially disrupting other environments.

Data Backup Strategy

It's important that we backup our production database at least daily so that we may recover data in a disaster scenario. At all times, we should keep at least the last 7 days worth of backups. The frequency of backups and data retention period is subject to re-review depending on system usage.

A data backup is a snapshot of a database at a moment in time.

Users' Right to Erasure

Users have the right to [data] erasure from our database and data warehouse but not our backups. We require our backups as a safety net for disaster recovery. Furthermore, data shouldn't be deleted from backups as:

  1. this could potentially corrupt a backup.

  2. it allows for a potential attack scenario where an attacker could delete all of a user's data on their behalf with no possibility of recovery.

Handling Disaster Recovery

In a disaster scenario where our production database needs to be restored from a backup, we would be restoring previously deleted user data. This data needs to be re-deleted to uphold users' right to erasure. This can be achieved by maintaining a list of all the deleted primary keys for each model. This deletion list can then be cross referenced with the backup to select previously deleted keys.

Primary keys must never be recycled.

Our data warehouse can track if/when a primary key was scheduled for deletion.

Data Integrity

When writing data to a database as part of a larger process, it's common for an error to be raised during a proceeding step. However, if not managed properly, this can lead to data integrity issues.

Example

Let's say it's a requirement of our system that a user must verify their email address whenever a new user is created. Therefore, we would first need to create the user on our database and then send them a verification email once their user account has been created.

If an error was raised when sending their verification email, we should undo the creation of their user account and inform the user of the unsuccessful creation. Otherwise, they would never be able to verify their account or register again since their account would already exist.

Atomic Transactions

MySQL (as well as most SQL engines) allows you to create a collection of SQL statements called a transaction. Transactions can be made to "roll back" all previous statements if a statement raises an error. In other words, it's all or nothing. This is critical to maintaining data integrity.

Atomicity is the "all or nothing" nature of transactions. Atomic transactions are those which roll back all previous statements if a statement raises an error.

Django supports making all HTTP requests atomic. This should be enabled by default.

"It works like this. Before calling a view function, Django starts a transaction. If the response is produced without problems, Django commits the transaction. If the view produces an exception, Django rolls back the transaction."

Data Warehouse

Data warehouses store all historical data, which is required for business intelligence (BI). Databases are not suitable for this purpose as they are high performance and come at a high cost.

Example

Say we store our user's favourite colour in a database table and the user changes it from red to green.

  • In our database table we would update the value in an existing row. The previous value would no longer be stored - just the latest value.

  • In our data warehouse we would insert the value in a new row. The previous value would be stored (in a previous row).

BigQuery

Our data warehouse of choice is Google's BigQuery (BQ). Our BQ project is found here.

BigQuery uses the GoogleSQL engine.

Data Tables

As a counterpart to our production database, we should have a BQ dataset named "production". Within the dataset, we should have a BQ table for each table present in the database that has the same name for traceability. Data will be transferred from the production tables to their counterpart BQ table.

A dataset is a collection of data tables.

We can optionally have datasets for our non-production environments too, where each dataset is named after its source environment. Although this data will not produce useful insights, it may be useful for testing purposes. However, the size of these datasets should be managed to avoid unnecessary expenditure.

How to Design a Table's Schema

In its simplest form, a BQ table's schema must be a subset of its database counterpart. We must consider data's sensitivity before transferring. Sensitive columns should be omitted by default, unless explicitly required. Non-sensitive columns should be included by default, unless explicitly not required.

New columns can be created in a BQ table that aren't present in their counterpart database table. This may be useful to save the result of an expensive calculation or retain some useful metadata of an omitted column. Metadata columns should be prepended with the original column's name for traceability.

Sensitive data is data that can be used to identify individual persons and may be found in one or multiple columns together.

Columns containing personally identifiable data must be omitted in BQ tables by default. Be wary that while individual columns may not contain personally identifiable data, data from 2 or more columns may be enough to identify individual persons.


Example of data rows in a database table and its BQ table counterpart.

Database: production -> Table: user_user

Dataset: production -> Table: user_user

Example Breakdown
  • The name column was dropped entirely as it contained sensitive data and we did not wish to collect any metadata from it.

  • The email column contains sensitive data so it couldn't be transferred directly. However, we wished to retain the domain of each email in the new metadata column email_domain.

  • The favourite_colour column doesn't contain sensitive data and we wish to track this data over time. The column can be transferred directly.

  • The favourite_food column doesn't contain sensitive data but we don't wish to track this data. This column has been omitted.

Tables' primary key must always be transferred to support JOIN's in SQL queries.

How to Create a Table

  1. Compose a new query in BQ Studio:

  1. Write a CREATE TABLE SQL statement:

CREATE TABLE production.user_user (
  first_name VARCHAR(30),
  last_name VARCHAR(30)
);
  1. Save the SQL statement to the project with naming convention "create {TABLE_NAME}". This is useful if the table needs to be created in multiple environments' dataset.

  2. Execute the SQL statement.

There are other ways to create a table but they do not support setting all the creation options we require. For example, BQ Studio has a graphical schema editor.

How to Create Data Table Relations

An important distinction between databases and data warehouses is that keys can NOT be enforced. However, it's still worth defining non-enforced keys when creating tables as it allows Google's SQL engine to more efficiently join data tables.

CREATE TABLE production.user_user (
  id INT NOT NULL,
  first_name VARCHAR(30),
  last_name VARCHAR(30),
  PRIMARY KEY (id) NOT ENFORCED
);

Primary and foreign keys can not be enforced and are merely suggestions.

Scheduled Queries

Our scheduled queries is how we transfer data from our production database to our data warehouse. Data is not transferred continuously, like our production database, but rather in regular intervals and in bulk. Queries must be scheduled to pull in all modified data since the last transfer from a production table into its counterpart BQ table.

A query is a statement that specifically retrieves data and helps clarify their unique roles within the SQL language. On the other hand, statements include a broader set of operations, such as INSERT, UPDATE, DELETE, CREATE, ALTER, and DROP.

Scheduling Strategy

To transfer only data that has been created or updated since the last transfer, every data model will have a last_saved_at date-time field:

from django.db import models

class BaseModel(models.Model):
    """The base model to be inherited by all models."""
    last_saved_at = models.DateTimeField(auto_now=True)
    ...

auto_now automatically sets the field to now every time the object is saved.

When scheduling transfers, the SQL queries must filter for data saved since the last transfer.

Example

Say we scheduled a transfer from our user_user table at 06:00 everyday:

SELECT *
FROM EXTERNAL_QUERY("db_server.production", '''
  SELECT id,
    first_name,
    last_name,
    last_saved_at
  FROM production.user_user
  WHERE DATE(last_saved_at) >= CURDATE() - 1
    AND TIME(last_saved_at) >= "06:00:00";
''');

This will retrieve all users that were updated since 06:00 yesterday.

When scheduling queries, it's important to consider how often a table's data should be transferred. Furthermore, queries should be scheduled during non-peak hours to minimise stress on our production database.

By default, queries should be scheduled daily at 07:00 UTC.

When scheduling transfers, keep in mind the schedules of related data tables. If we require a snapshot of some data and its relations, then we need to sync the schedules of all the related data tables.

Data Deletion Strategy

When we wish to delete any data rows, we shouldn't delete them immediately as then we won't be able to sync the last data values to our data warehouse. Rather, we should schedule the data for deletion at some point after the next scheduled transfer. This allows us to warehouse when data was deleted data and what the values where at the time of deletion.

To support scheduled deletions, every data model will have a:

  1. delete_after date-time field, which will be null if it's not scheduled for deletion.

  2. delete_wait attribute to set how far in the future a data row should be scheduled for deletion.

  3. delete function which overrides Django's delete function to set the deletion schedule.

from datetime import timedelta
from django.db import models
from django.utils.timezone import now

class BaseModel(models.Model):
    """The base model to be inherited by all models."""
    
    delete_wait = timedelta(days=3)
    delete_after = models.DateTimeField(null=True)
    
    def delete(self):
        self.delete_after = now() + self.delete_wait
        self.save()
    ...

It's recommended to schedule deletions after the next 2-3x scheduled transfers so that if the first data transfer fails, we have a small window of opportunity to fix the problem. If we don't manage to fix the problem in the window, we must uphold our obligation to delete the data and lose that piece of history in our warehouse.

Deleting from our Database

Using CRON jobs, we can delete any data whose scheduled deletion is now or in the past from our database.

from django.utils.timezone import now

ExampleModel.objects.filter(delete_after__lte=now()).delete()

Depending on the data table, a CRON job should run frequently so that we don't store data long after its scheduled deletion. By default, a job should be scheduled every hour.

Unscheduling Deletions

We can optionally build flows in our system that allow users to unschedule data deletions.

Example

After a user schedules their account for deletion, they may be auto-logged out and redirected to the home page. Then, we present a message explaining that their account has been scheduled for deletion but if they log in again before it's been deleted, its deletion will be unscheduled. Facebook does something similar to this.

Deleting from our Data Warehouse

We may wish to delete data that's too old for our BI operations and to save storage costs. If so, we can create scheduled queries to delete all rows where last_saved_at is less than some threshold.

If a user's personal data is stored in our warehouse, then we'll to set it to null using a scheduled query which applies a similar filter.

UPDATE user_user
SET email = NULL
WHERE delete_after <= NOW();

Last updated