Encrypting Integer Primary Keys in Python
UUIDs are popular among developers for a number of reasons. These include:
- They can be generated by seperate actors with a very low chance of collision
- They do not leak information about a system
This article will ignore the first point, and examine the second in more detail.
Sequential Primary Keys
Consider an application that needs to manage a table of orders. With a sequential primary key, the schema for the table might look like this:
CREATE TABLE orders (
id INT(11) NOT NULL AUTO_INCREMENT,
# Other fields...
PRIMARY KEY (id)
);
If we had a REST API route that retrieved these orders, we would see queries such as GET /order/5
and GET /order/167
. This leaks unnecessary information about our application to the outside world, it might allow a rival to work out how many orders we had processed and learn confidential information about our business. It also opens up another attack surface, as it makes it easy for an attacker to predict the identifier of past and future orders.
Using a UUID as a Primary Key
Using a UUID as a primary key mitigates some of these problems. Depending on how they are generated, UUIDs can be predictable (although much less so than integer primary keys). However, they do not leak information about table volume. Our REST API would now see queries such as GET /order/123e4567-e89b-42d3-a456-556642440000
and GET /order/cd41294a-afb0-11df-bc9b-00241dd75637
. Our table schema might look like this:
CREATE TABLE orders (
id VARCHAR(36),
# Other fields...
PRIMARY KEY (id)
)
If we were using UUID_TO_BIN()
and BIN_TO_UUID()
, we might be able to store the UUID more efficiently as a binary field:
CREATE TABLE orders (
id BINARY(16),
# Other fields...
PRIMARY KEY (id)
)
Performance Problems
The problem with using a UUID as a primary key is that some database storage engines, such as InnoDB, use an index-organized data storage technique. This means that when inserting randomly distributed values (such as a UUID) the engine has to insert data randomly throughout the table, which is not very efficient. For these storage engines, it is much more performant to use a sequential primary key.
Some application developers solve this problem by using a combination of both fields together:
CREATE TABLE orders (
id INT(11) NOT NULL AUTO_INCREMENT,
uuid VARCHAR(36) NOT NULL UNIQUE,
# Other fields...
PRIMARY KEY (id)
);
Another option, is to revert back to our original schema design with a simple sequential primary key and use cryptography to encrypt the identifier before it is exported from the application.
Encrypting Integer Primary Keys with Python
As a solution to this problem, I have created a Python class that can be used to encrypt and decrypt primary keys. The class depends on the cryptography module. A demonstration of the API within the Python shell is shown below:
>>> PrimaryKeyEncryptor.generate_secret()
'134810bef265eabbfcc33c8956b9c870'
>>> PrimaryKeyEncryptor('134810bef265eabbfcc33c8956b9c870').encrypt(1)
'ec68e9d12ae27ce7ce9bad7488ecf665'
>>> PrimaryKeyEncryptor('134810bef265eabbfcc33c8956b9c870').encrypt(2)
'd2f225fd542442c6e12ab19943749391'
>>> PrimaryKeyEncryptor('134810bef265eabbfcc33c8956b9c870').decrypt('ec68e9d12ae27ce7ce9bad7488ecf665')
1
>>> PrimaryKeyEncryptor('134810bef265eabbfcc33c8956b9c870').decrypt('d2f225fd542442c6e12ab19943749391')
2
Our REST API would now see queries such as GET /order/ec68e9d12ae27ce7ce9bad7488ecf665
and GET /order/d2f225fd542442c6e12ab19943749391
. Another side benefit of this technique, is that it allows identifiers to be validated without touching the database.
In the example below, I change a single bit of the cipher text and the encryptor can recognise that it is not valid:
>>> PrimaryKeyEncryptor('134810bef265eabbfcc33c8956b9c870').decrypt('d2f225fd542442c6e12ab19943749393')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/test.py", line 105, in decrypt
raise ValueError('The encrypted primary key is invalid')
ValueError: The encrypted primary key is invalid
This works because the block size of the algorithm chosen (AES-128) is 128 bits, but we only have 64 bits to encrypt. This means that we have space to encrypt it twice. When we decrypt the cipher text, if the first 64 bits do not match the last 64 bits we know the cipher text is invalid.
The code for this class is shown below:
"""This module contains a python class that was designed for encrypting sequential integer primary keys
Typical usage example:
# Generate a secret (done once)
secret = PrimaryKeyEncryptor.generate_secret()
# Instantiate encryptor
encryptor = PrimaryKeyEncryptor(secret)
# Encrypt
encrypted = encryptor.encrypt(primary_key)
# Decrypt
decrypted_primary_key = encryptor.decrypt(encrypted)
"""
from os import urandom
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.ciphers import Cipher
from cryptography.hazmat.primitives.ciphers.algorithms import AES as Algorithm
from cryptography.hazmat.primitives.ciphers.modes import ECB as Mode
class PrimaryKeyEncryptor:
"""This class uses AES-128 in ECB mode to generate a secure mapping between a 64 bit integer and a 32 character hex
string.
Note:
In many other contexts, the use of AES-128 in ECB mode is insecure. Please do not use this example for anything
other than the explit purpose of encrypting sequential integer primary keys generated by a storage engine.
"""
def __init__(self, secret: str):
"""Initialize a PrimaryKeyEncryptor class.
Args:
secret: A 32 character (16 byte) hex string which is used as a key for encryption and decryption.
Raises:
ValueError: If the secret is not a 32 byte hex string.
"""
secret_bytes = bytes.fromhex(secret)
if len(secret_bytes) != 16:
raise ValueError('The secret for the PrimaryKeyEncryptor must be 16 bytes in hexadecimal format')
algorithm = Algorithm(secret_bytes)
mode = Mode()
self.cipher = Cipher(algorithm, mode, backend=default_backend())
@staticmethod
def generate_secret() -> str:
"""Generate a secret that can be used to instantiate the PrimaryKeyEncryptor class.
Returns:
A 32 charachter (16 byte) hex string which is suitable to use as a key for encryption and decryption.
"""
return urandom(16).hex()
def encrypt(self, primary_key: int) -> str:
"""Encrypt an integer primary key
Args:
primary_key: An integer to encrypt, this cannot be more than 64 bits in size.
Returns:
The encrypted integer primary key as a 32 charachter (16 byte) hex string.
Raises:
ValueError: If the primary key is not an integer of 64 bits or less.
"""
primary_key_bytes = primary_key.to_bytes(8, byteorder='big')
encryptor = self.cipher.encryptor()
cipher_bytes = encryptor.update(primary_key_bytes * 2) + encryptor.finalize()
return cipher_bytes.hex()
def decrypt(self, encrypted_primary_key: str) -> int:
"""Encrypt an integer primary key
Args:
encrypted_primary_key: The encrypted integer primary key as a 32 character (16 byte) hex string.
Returns:
The decrypted integer primary key.
Raises:
ValueError: If the encrypted primary key is not valid.
"""
cipher_bytes = bytes.fromhex(encrypted_primary_key)
if len(cipher_bytes) != 16:
raise ValueError('The encrypted primary key must be 16 bytes in hexadecimal format')
decryptor = self.cipher.decryptor()
plain_bytes = decryptor.update(cipher_bytes) + decryptor.finalize()
if plain_bytes[:8] != plain_bytes[8:]:
raise ValueError('The encrypted primary key is invalid')
return int.from_bytes(plain_bytes[:8], byteorder='big')
Ordered UUIDs
When I shared this post on Twitter, somebody pointed out another alternative:
There is also the possibility to rearrange the structure of the uuid to have the time part up front. We have done this with @ramsey /uuidhttps://t.co/qeuWs9jZO4https://t.co/q0WhjfZGRb
— Andreas 'Ando 🦆' Frömer (@icanhazstring) May 17, 2020
The basic concept is that a UUID is constructed of many different components, including a timestamp. However, the structure of a UUID does not ensure that UUIDs appear in chronological order when they are sorted. By rearranging the UUID such that this is true, it is possible to improve their performance as a primary key quite significantly. A blog post on Percona covers this topic with benchmarks in detail.