How to Create PostgreSQL Test Data

Introduction#

Developing high quality software inevitably requires some testing data.

You could be:

  • Integration testing your application for correctness and regressions
  • Testing the bounds of your application in your QA process
  • Testing the performance of queries as the size of your dataset increases

Either way, the software development lifecycle requires testing data as an integral part of developer workflow. In this article, we'll be exploring 3 different methods for generating test data for a Postgres database.

Setup#

In this example we'll be using Docker to host our Postgres database.

To get started you'll need to install docker and start our container running Postgres:

% docker run -p 5432:5432 -d -e POSTGRES_PASSWORD=1234 -e POSTGRES_USER=postgres -e POSTGRES_DB=dev postgres

As you can see, we've set very insecure default credentials. This is not meant to be a robust / productionised instance, but it'll do for our testing harness.

Our Schema#

In this example we'll setup a very simple schema. We're creating a basic app where we have a bunch of companies, and those companies have contacts.

CREATE TABLE companies(
company_id SERIAL PRIMARY KEY,
company_name VARCHAR(255) NOT NULL
);
CREATE TABLE contacts(
contact_id SERIAL PRIMARY KEY,
company_id INT,
contact_name VARCHAR(255) NOT NULL,
phone VARCHAR(20),
email VARCHAR(100),
CONSTRAINT fk_company
FOREIGN KEY(company_id)
REFERENCES companies(company_id)
);

This schema captures some business logic of our app. We have unique primary keys, we have foreign key constraints, and we have some domain-specific data types which have 'semantic meaning'. For example, the random string _SX ร† A-ii is not a valid phone number.

Let's get started.

Manual Insertion#

The first thing you can do which works well when you're starting your project is to literally manually insert all the data you need. This involves just manually writing a SQL script with a bunch of INSERT statements. The only thing to really think about is the insertion order so that you don't violate foreign key constraints.

INSERT INTO companies(company_name)
VALUES('BlueBird Inc'),
('Dolphin LLC');
INSERT INTO contacts(company_id, contact_name, phone, email)
VALUES(1,'John Doe','(408)-111-1234','john.doe@bluebird.dev'),
(1,'Jane Doe','(408)-111-1235','jane.doe@bluebird.dev'),
(2,'David Wright','(408)-222-1234','david.wright@dolphin.dev');

So here we're inserting directly into our database. This method is straight forward but does not scale when you need more data or the complexity of your schema increases. Also, testing for edge cases requires your hard-coding edge cases in the inserted data - resulting in a linear amount of work for the bugs you want to catch.

contact_idcompany_idcontact_namephoneemail
11John Doe(408)-111-1234john.doe@bluebird.dev
21Jane Doe(408)-111-1235jane.doe@bluebird.dev
32David Wright(408)-222-1234david.wright@dolphin.dev

Using generate_series to automate the process#

Since you're a programmer, you don't like manual work. You like things to be seamless and most importantly automated!

Postgres comes with a handy function called generate_series which, ...drum roll... generates series! We can use this to generate as much data as we want without writing it by hand.

Let's use generate_series to create 100 companies and 100 contacts

INSERT INTO companies(company_name)
SELECT md5(random()::text)
FROM generate_series(1,100);
INSERT INTO contacts(company_id, contact_name, phone, email)
SELECT id, md5(random()::text), md5(random()::text)::varchar(20), md5(random()::text)
FROM generate_series(1,100) id;
contact_idcompany_idcontact_namephoneemail
1181cc02c106b7c30d4e2b032c91cdb75ad056f1eee1dca55db03ccd0da2eef81aaa02d6ba15ef4551fb9f
22d2b0112bc9bbec85c5229a4b4f28a35007ba86b1dc24cdadfd247404f5b502084563f2ac20c29ed0e584
3364005702ecaff9f489e8074d6a718aae50db9534b58e0616cd343ea36293665aa1ac38e7d6371893046a
44202e87bc3d0c8c080048b2c0138c709b65f6ea317bd0f2c950dc8b8d9b92916f4cf77c38308f6ac4391b
558b2fd25d7b95158df5af671cb32557553e6ddc67aabe7164ce9aed32035400a7500203352f3597d2548f

We generated 100 companies and contacts here, the types are correct, but the output is underwhelming. First of all, every company has exactly 1 contact, and more importantly the actual data looks completely useless.

If you care about your data being semantically correct (i.e. text in your phone column actually being a phone number) we need to get more sophisticated.

We could define functions ourselves to generate names / phone numbers / emails etc, but why re-invent the wheel?

Using a data generator like Synth#

Synth is an open-source project designed to solve the problem of creating realistic testing data. It has integration with Postgres, so you won't need to write any SQL.

Synth uses declarative configuration files (just JSON don't worry) to define how data should be generated. To install the synth binary refer to the installation page.

The first step to use Synth is to create a workspace. A workspace is just a directory in your filesystem that tell Synth that this is where you are going to be storing configuration:

$ mkdir workspace && cd workspace && synth init

Next we want to create a namespace (basically a stand-alone data model) for this schema. We do this by simply creating a subdirectory and Synth will treat it as a separate schema:

$ mkdir my_app

Now comes the fun part! Using Synth's configuration language we can specify how our data is generated. Let's start with the smaller table companies.

To tell Synth that companies is a table (or collection in the Synth lingo) we'll create a new file app/companies.json.

{
"type": "array",
"length": {
"type": "number",
"constant": 1
},
"content": {
"type": "object",
"company_id": {
"type": "number",
"id": {}
},
"company_name": {
"type": "string",
"faker": {
"generator": "company"
}
}
}
}

Here we're telling Synth that we have 2 columns, company_id and company_name. The first is a number, the second is a string and the contents of the JSON object define the constraints of the data.

If we sample some data using this data model we get the following:

$ synth generate my_app/ --size 2
{
"companies": [
{
"company_id": 1,
"company_name": "Campbell Ltd"
},
{
"company_id": 2,
"company_name": "Smith PLC"
}
]
}

Now we can do the same thing for the contacts table by create a file my_app/contacts.json. Here we have the added complexity of a foreign key constraints to the company table, but we can solve it easily using Synth's same_as generator.

{
"type": "array",
"length": {
"type": "number",
"constant": 1
},
"content": {
"type": "object",
"company_id": {
"type": "same_as",
"ref":"companies.content.company_id"
},
"contact_name": {
"type": "string",
"faker": {
"generator": "name"
}
},
"phone": {
"type": "string",
"faker": {
"generator": "phone_number",
"locales": ["en_GB"]
}
},
"email": {
"type": "string",
"faker": {
"generator": "email"
}
}
}
}

There is quite a bit going on here - to get an in-depth understanding of the synth configuration refer I'd recommend reading the comprehensive docs. There are tons of cool features which this schema can't really explore!

Now we have both our tables data model under Synth, we can generate data into Postgres:

$ synth generate my_app/ --to postgres://postgres:5432@localhost:5432/dev

Taking a look at the company table:

contact_idcompany_idcontact_namephoneemail
11Carrie Walsh+44(0)117 496 0785espinozabetty@hotmail.com
22Brittany Flores+441632 960 480osharp@mcdaniel.com
33Tammy Rodriguez01632960737brenda82@ward.org
44Amanda Marks(0808) 1570096hwilcox@gonzalez.com
55Kimberly Delacruz MD+44(0)114 4960207pgarcia@thompson.com
66Jordan Williamson(0121) 4960483jamesmiles@weber.org
77Nicholas Williams(0131) 496 0974fordthomas@gmail.com

Much better :)

Conclusion#

We explored 3 different ways to generate data.

  • Manual Insertion: Is ok to get you started. If your needs are basic it's the path of least effort to creating a working dataset.
  • Postgres generate_series: This method scales better than manual insertion - but if you care about the contents of your data and have foreign key constraints you'll need to write quite a bit of bespoke SQL by hand.
  • Synth: Synth has a small learning curve, but to create realistic testing data at scale it reduces most of the manual labour.

In the next post we'll explore how to subset your existing database for testing purposes. And don't worry if you have sensitive / personal data - we'll cover that too.

Create realistic test data for your web app

So we've all been in this situation. You're building a Web App, you're super productive in your stack and you can go quickly - however generating lot's of data to see what your app will look like with enough users and traffic is a pain.

Either you're going to spend a lot of time manually inputting data or you're going to write some scripts to generate that data for you. There must be a better way.

In this post we're going to explore how we can solve this problem using the open-source project Synth. Synth is a state-of-the-art declarative data generator - you tell Synth what you want your data to look like and Synth will generate that data for you.

This tutorial is going to use a simple MERN (Mongo Express React Node) web-app as our test subject, but really Synth is not married to any specific stack.

I'm going to assuming you're working on MacOS or Linux (Windows support coming soon ๐Ÿคž) and you have NodeJS, Yarn and Docker installed.

For this example we'll be running Synth version 0.3.2 .

Getting started#

As a template, we'll use a repository which will give us scaffolding for the MERN app. I picked this example because it shows how to get started quickly with a MERN stack, where the end product is a usable app you can write in 10 minutes. For our purposes, we don't really need to build it from scratch, so let's just clone the repo and avoid writing any code ourselves.

git clone https://github.com/samaronybarros/movies-app.git && cd movies-app

Next, we'll be using docker to run an ephemeral version of our database locally. Docker is great for getting started quickly with popular software, and luckily for us MongoDB has an image on the docker registry. So - let's setup an instance of MongoDB to run locally (no username / password):

docker run -d --name mongo-on-docker -p 27017:27017 mongo

Starting the Web App#

The repository we just cloned contains a working end-to-end web-app running on a MERN stack. It's a super simple CRUD application enabling the user to add / remove some movie reviews which are persisted on a MongoDB database.

The app consists of 2 main components, a nodejs server which lives under the movies-app/server/ sub-directory, and a React front-end which lives under the movies-app/client sub-directory.

The client and server talk to each other using a standard HTTP API under /movie.

So let's get started and run the back-end:

cd server && yarn install && node index.js

And then the client (you'll need two terminals here ๐Ÿคท):

cd client && yarn install && yarn start

Cool! If you navigate to http://localhost:8000/ you should see the React App running ๐Ÿ™‚

Let's add some movies by hand#

Hold the phone. Why are we adding movies by hand since we have a tool to generate data for us?

Well, by adding a little bit of test data by hand, we can then use Synth to infer the structure of the data and create as many movies as we want for us. Otherwise we would have to write the entire data definition (what we call a schema) by hand.

So, let's add a couple of movies manually using the Web UI.

Create Movies

Ok, so now that we have a couple of movies, let's get started with Synth!

Synth#

In the following section we will cover how Synth fits into the Web App development workflow:

  1. First we'll install the Synth binary
  2. Then we'll initialize a Synth workspace in our repo to host our data model
  3. Next will ingest data from MongoDB into Synth
  4. And finally generate a bunch of fake data from Synth and back into Mongo

Installing Synth#

To install Synth on MacOS / Linux, visit the docs and choose the appropriate installation for your OS. If you are feeling adventurous, you can even build from source!

Declarative Data Generation#

Synth uses a declarative data model to specify how data is generated.

Hmmm, so what is a declarative model you may ask? A declarative model, as opposed to an imperative model, is where you 'declare' your desired end state and the underlying program will figure out how to get there.

On the other had, an imperative model (which is what we are mostly used to), is step by step instructions on how to get to our end-state. Most popular programming languages like Java or C are imperative - your code is step-by-step instructions on how to reach an end state.

Programming frameworks like SQL or React or Terraform are declarative. You don't specify how to get to your end-state, you just specify what you want and the underlying program will figure out how to get there.

With Synth you specify what your desired dataset should look like, not how to make it. Synth figures how to build it for you ๐Ÿ˜‰

Creating a Workspace#

A workspace represents a set of synthetic data namespaces managed by Synth. Workspaces are marked by .synth/ sub-directory.

A workspace can have zero or more namespaces, where the namespaces are just represented as sub-directories. All information pertaining to a workspace is in its directory.

So let's create sub-directory called data/ and initialize our Synth workspace.

movies-app $ mkdir data && cd data && synth init

Namespaces#

The namespace is the top-level abstraction in Synth. Namespaces are the equivalent of Schemas in SQL-land. Fields in a namespace can refer to other fields in a namespace - but you cannot reference data across namespaces.

Namespaces in turn, have collections which are kind of like tables in SQL-land. A visual example of the namespace/collection hierarchy can be seen below.

Alt Text

To create a namespace, we need to feed some data into Synth.

Feeding Data into Synth#

There are two steps to feed data into Synth from our MongoDB instance:

  1. We need to export data from MongoDB into a format that Synth can ingest. Luckily for us, Synth supports JSON out of the box so this can be done quite easily with the mongoexport command - a light weight tool that ships with MongoDB to enable quick dumps of the database via the CLI. We need to specify a little bit more metadata, such as the database we want to export from using --db cinema , the collection using --collection and the specific fields we are interested in --fields name,rating,time. We want the data from mongoexport to be in a JSON array so that Synth can easily parse it, so let's specify the --jsonArray flag.
  2. Next, we need to create a new Synth namespace using the synth import command. synth import supports a --from flag if you want to import from a file, but if this is not specified it will default to reading from stdin. We need to feed the output of the mongoexport command into Synth. To do this we can use the convenient Bash pipe | to redirect the stdout from mongoexport into Synth's stdin.
docker exec -i mongo-on-docker mongoexport \
--db cinema \
--collection movies \
--fields name,rating,time \
--forceTableScan \
--jsonArray | synth import cinema --collection movies

Synth runs an inference step on the JSON data that it's fed, trying to infer the structure of the data. Next Synth automatically creates the cinema namespace by creating the cinema/ sub-directory and populates it with the collection movies.json.

$ tree -a data/
data/
โ”œโ”€โ”€ .synth
โ”‚ โ””โ”€โ”€ config.toml
โ””โ”€โ”€ cinema
โ””โ”€โ”€ movies.json

We can now use this namespace to generate some data:

$ synth generate cinema/
{
"movies": [
{
"_id": {
"$oid": "2D4p4WBXpVTMrhRj"
},
"name": "2pvj5fas0dB",
"rating": 7.5,
"time": [
"TrplCeFShATp2II422rVdYQB3zVx"
]
},
{
"_id": {
"$oid": "mV57kUhvdsWUwiRj"
},
"name": "Ii7rH2TSjuUiyt",
"rating": 2.5,
"time": [
"QRVSMW"
]
}
]
}

So now we've generated data with the same schema as the original - but the value of the data points doesn't really line up with the semantic meaning of our dataset. For example, the time array is just garbled text, not actual times of the day.

The last steps is to tweak the Synth schema and create some realistic looking data!

Tweaking the Synth schema#

So let's open cinema/movies.json in our favorite text editor and take a look at the schema:

{
"type": "array",
"length": {
"type": "number",
"subtype": "u64",
"range": {
"low": 1,
"high": 4,
"step": 1
}
},
"content": {
"type": "object",
"time": {
"type": "array",
"length": {
"type": "number",
"subtype": "u64",
"range": {
"low": 1,
"high": 2,
"step": 1
}
},
"content": {
"type": "one_of",
"variants": [
{
"weight": 1.0,
"type": "string",
"pattern": "[a-zA-Z0-9]*"
}
]
}
},
"name": {
"type": "string",
"pattern": "[a-zA-Z0-9]*"
},
"_id": {
"type": "object",
"$oid": {
"type": "string",
"pattern": "[a-zA-Z0-9]*"
}
},
"rating": {
"type": "number",
"subtype": "f64",
"range": {
"low": 7.0,
"high": 10.0,
"step": 1.0
}
}
}
}

There is a lot going on here but let's break it down.

The top-level object (which represents our movies collection) is of type array - where the content of the array is an object with 4 fields, _id, name, time, and rating.

We can completely remove the field _id since this is automatically managed by MongoDB and get started in making our data look real. You may want to have the Generators Reference open here for reference.

Rating#

First let's change the rating field. Our app can only accept numbers between 0 and 10 inclusive in increments of 0.5. So we'll use the Number::Range content type to represent this and replace the existing value:

{
"range": {
"high": 10,
"low": 0,
"step": 0.5
},
"subtype": "f64",
"type": "number"
}

Time#

The time field has been correctly detected as an array of values. First of all, let's say a movie can be shown up to 5 times a day, so we'll change the high field at time.length.range to 6 (high is exclusive). At this stage, the values are just random strings, so let's instead use the String::DateTime content type to generate hours of the day.

{
"type": "array",
"length": {
"type": "number",
"subtype": "u64",
"range": {
"low": 1,
"high": 5,
"step": 1
}
},
"content": {
"type": "one_of",
"variants": [
{
"weight": 1.0,
"type": "string",
"date_time": {
"subtype": "naive_time",
"format": "%H:%M",
"begin": "12:00",
"end": "23:59"
}
}
]
}
}

Name#

Finally, the movie name field should be populated with realistic looking movie names.

Under the hood, Synth uses the Python Faker library to generate so called 'semantic types' (think credit card numbers, addresses, license plates etc.). Unfortunately Faker does no have movie names, so instead we can use a random text generator instead with a capped output size.

So let's use the String::Faker content type to generate some fake movie names!

{
"type": "string",
"faker": {
"generator": "text",
"max_nb_chars": 20
}
}

Final Schema#

So, making all the changes above, we can use our beautiful finished schema to generate data for our app:

{
"type": "array",
"length": {
"type": "number",
"subtype": "u64",
"range": {
"low": 1,
"high": 2,
"step": 1
}
},
"content": {
"type": "object",
"name": {
"type": "string",
"faker": {
"generator": "text",
"max_nb_chars": 20
}
},
"time": {
"optional": false,
"type": "array",
"length": {
"type": "number",
"subtype": "u64",
"range": {
"low": 1,
"high": 5,
"step": 1
}
},
"content": {
"type": "one_of",
"variants": [
{
"weight": 1.0,
"type": "string",
"date_time": {
"subtype": "naive_time",
"format": "%H:%M",
"begin": "00:00",
"end": "23:59"
}
}
]
}
},
"rating" : {
"range": {
"high": 10,
"low": 0,
"step": 0.5
},
"subtype": "f64",
"type": "number"
}
}
}
$ synth generate cinema/ --size 5
{
"movies": [
{
"name": "Tonight somebody.",
"rating": 7,
"time": [
"15:17"
]
},
{
"name": "Wrong investment.",
"rating": 7.5,
"time": [
"22:56"
]
},
{
"name": "Put public believe.",
"rating": 5.5,
"time": [
"20:32",
"21:06",
"16:15"
]
},
{
"name": "Animal firm public.",
"rating": 8.5,
"time": [
"20:06",
"20:25"
]
},
{
"name": "Change member reach.",
"rating": 8.0,
"time": [
"12:36",
"14:34"
]
}
]
}

Ah, much better!

Generating data from Synth into MongoDB#

So now that we can generate as much correct data as we want, let's point Synth at MongoDB and let loose the dogs of war.

This step can be broken into two parts:

  1. Run the synth generate command with our desired collection movies and specifying the number of records we want using the --size field.
  2. Pipe stdout to the mongoimport command, mongoexport's long lost cousin. Again here we specify the database we want to import to, --db cinema and the specific collection movies. We also want the --jsonArray flag to notify mongoimport that it should expect a JSON array.
synth generate cinema/ \
--collection movies \
--size 1000 \
| docker exec -i mongo-on-docker mongoimport \
--db cinema \
--collection movies \
--jsonArray

And voila! Our app now has hundreds of valid movies in our database!

Alt Text

Conclusion#

This post was a summary of how you can use Synth to generate realistic looking test data for your Web App. In the next part of this tutorial, we'll explore how we can use Synth to generate relational data, i.e. where you have references between collections in your database.

To check out the Synth source code you can visit the Synth repo on Github, and to join the conversation hop-on the the Synth discord server.