Latest News
Start your free PG Nitrous Trial Get started ›
Come play in the new interactive Swarm64 DA query sandbox start here ›
Unleash PostgreSQL Parallel Query Read more ›

Modern data analysis with PostgreSQL – JSONB throws Window functions out the…

Posted in How to, PostgreSQL

In this blog post I would like to take a look at a classical data layout paradigm and how transactional database features, such as fast UPDATEs and JSON/JSONB types, can make analytics  easier. 

JSON and JSONB, i.e. Binary JSON or – as some call it – “JSON better”, are not new to PostgreSQL. Yet, they are probably still an underutilized feature. Not only do they enable PostgreSQL to act as a pretty impressive document store database, they also allow to rethink some of the traditional data analysis and data warehousing patterns.

This example is based on a customer case. But as the details are confidential, we had to “translate” it slightly to a different problem and context.

The old way: append-only into two tables

Let’s say you are running a big online shop and timely delivery to your customers is very important. You may be interested to analyse how many orders, if any, get lost between being loaded into the delivery vehicle for the last mile transport and arriving at the customer.

Traditionally you would have an order table and you would have a table which contains all order shipment status updates. Let’s call these two tables orders and order_shipment_status. They are joined by order_id in a one-to-many relationship.

Inserting status updates is very straightforward. Note that for history tracking purposes a status update does not actually UPDATE but INSERT a new value:

INSERT INTO order_shipment_status
VALUES (order_id, current_timestamp, 'loaded into van');

Now, when you want to find out if your last status was ‘lost’ and the status immediately prior was ‘loaded into van’ one needs to do a bit of work. This is how I would solve it to keep the query as easy, logical, and legible as possible.

SELECT id order_id FROM ( 
   LAG(id,1) OVER (ORDER BY id, timestamp) prev_id, 
   LEAD(id,1) OVER (ORDER BY id, timestamp) next_id, 
   LAG(status,1) OVER (ORDER BY id, timestamp) prev_status 
   FROM orders, order_shipment_status 
   WHERE = order_shipment_status.order_id 
) AS subquery 
id = prev_id  
AND (next_id <> id OR next_id IS NULL) 
AND prev_status = 'loaded into van' 
AND status = 'lost'; 

In the inner subquery we are working with the Window functions LAG and LEAD. For starters, it is great that PostgreSQL supports advanced SQL like that. But they are not cheap when it comes to resource requirements and query execution time.

So we are creating a subquery that joins orders and order_shipment_status and then, for every id, it picks the id before and after as well as the status and the previous status. This allows us to find if ‘lost’ was the last status ((next_id <> id OR next_id IS NULL)) and ‘loaded into van’ came immediately before (prev_status = ‘loaded into van’ AND status = ‘lost’).

This query can be very resource heavy as it joins two of the largest tables in the database and executes a Window function. Indexes do not help you much here.

The new way: fine grained UPDATEs to a JSONB array

The order_shipment_status table is really a series of shipment updates. Every order can have many updates – but couldn’t this be stored in an array-like structure? Of course it can, and the tool of choice – thanks to its wide variety of functions – would be JSON, or more specifically JSONB. I will not be going into the intricate differences between the two, but the advantage of JSONB is a more structured and more compact storage format behind the scenes.

We could do away with a separate order_shipment_status table, instead we would have a shipment_status column of type JSONB directly in our orders table. INSERTs are now UPDATEs and look like:

UPDATE orders 
   SET shipment_status = shipment_status ||'{"tstamp":current_timestamp,"status":"added"}'::jsonb 
   WHERE id = order_id; 

With every status update we are using the append operator || to add another element to our JSONB array. In contrast to appending rows to a shipment status table, we are appending each status update to an individual JSONB field inside the orders table. Each element has two parameters: ‘tstamp’ and ‘status’.

Pro tip: You may want to set the fillfactor parameter of your orders table below 100. 80 would be a good start. This means 20% of each 8K data block (“heap page”) are left empty and prevents PostgreSQL from having to create new pages all the time when updating the JSONB field:

ALTER TABLE orders SET (fillfactor = 80);

Now let us look at our query again: Show me all order ids where the last status is ‘lost’ and the one before is ‘loaded into van’.

SELECT id order_id 
FROM orders 
WHERE shipment_status->-1->>'status' = 'lost' 
AND shipment_status->-2->>'status' = 'loaded into van'; 

Nice. No JOIN, no Window functions. We are using the backwards iterator of JSON and JSONB: jsonarray->-1 means the last element of the array, -2 the second last etc. ->> directly returns our JSON(B) data as text for easy comparison.

Now, we can make an optimization to above, let’s add the following to the end of the query:

AND shipment_status @> '[{"status":"loaded into van"},{"status":"lost"}]'::jsonb; 

The @> operator checks if specific key value pairs are present in the JSON. At first this may seem redundant. But, adding this condition, we can now use a GIN Index (Generalized Inverted Index) and be sure it gets used. The GIN index is perfect for finding key-value pairs in JSON(B). Kudos to the PostgreSQL community and its wonderfully diverse index types.

CREATE INDEX shipment_status_gin ON orders USING gin (shipment_status); 

Assuming the number of lost parcels is few in comparison to the overall number of shipments – I sure hope for this business that it is – an index is perfect as we save a lot of table scanning.

So our final query looks like:

SELECT id order_id 
FROM orders 
WHERE shipment_status->-1->>'status' = 'lost' 
AND shipment_status->-2->>'status' = 'loaded into van' 
AND shipment_status @> '[{"status":"loaded into van"},{"status":"lost"}]'::jsonb; 

We are using an index to reduce the quantity of data early, avoid a JOIN, a subquery and three Window functions as well as having significantly shortened our query.

Food for thought

Is it time to rethink the good old “append only database” paradigm -only ever adding rows to tables- when it comes to systems of analysis? How about keeping the full history directly in the row where it belongs? When I first saw this JSONB concept at a customer I thought: This is genius. And it was giving them a competitive advantage – being able to answer complex queries like this much faster for their clients’ dashboards.

Do we always have to hoard data in separate tables, keep primary keys stored in both places to join later and do all the heavy lifting when the read queries run? Should we not rethink the pattern and keep an array of updates as exactly that: A JSONB array of updates?

Instead of appending new rows to a separate table we merely update an existing field in the main row. As a result we can easily index the data, save JOINs and avoid Window Functions.

I think this blogpost highlights some of the great features that make PostgreSQL so immensely popular. The next growth area is using its wonderful transactional features more heavily for data warehousing. Enhancing your analytics with features intended for transaction processing – something PostgreSQL is world class at.

PS: There are much more advanced JSON(B) operators and for this first example we are choosing some simplified examples. Let us know @Swarm64 on Twitter if we should dive deeper in a follow up post.