Columnar storage is a proven way to speed up database query performance by greatly reducing I/O during query execution. It’s been a fundamental feature in analytic relational databases such as Vertica, Amazon Redshift, Snowflake, and many others for more than a decade. It’s also a fundamental feature of the PostgreSQL query-acceleration extensions of Swarm64 DA.
In Swarm64 DA 5.0, compressed columnar storage is implemented as a columnstore index, which can easily be applied to any standard PostgreSQL table to speed up queries.Get a free Swarm64 DA developer license
The columnstore index caches the data of the related table in a compressed column-oriented format, thus providing data locality for similar values. A query that uses the columnstore index benefits from more efficient I/O and improved CPU utilization. Both these properties lead to faster query execution.
Why columnstore indexing is important for PostgreSQL
The columnstore index provides a more performant way for PostgreSQL to retrieve data from large tables in queries that use a subset of all the columns in the table.
Columnstore indexing can be very useful if you are doing any of these things with PostgreSQL:
- Migrating applications from Microsoft SQL Server, which also implements columnstore indexes
- Supporting mixed workloads, for example user reporting on an OLTP database or accelerating the occasional table scan
- Improving query and reporting performance, especially at scale, for concurrent users
- Implementing data warehouses and analytical databases on free, open source PostgreSQL
What is columnar storage, and how does it accelerate queries?
Columnar storage is an established DBMS concept, but let’s start with a high-level primer on how columnar storage accelerates querying. Consider the following table containing data about pets:
In row-oriented databases such as PostgreSQL, data pages (or blocks) store the values of an entire row together, sequentially for each column in the row. The rows-to-data pages ratio might be more or less than one, depending on the number and type of columns per row and the data page size setting. For illustration purposes, let’s assume that each data page in our Pets database contains one row in our table:
Row-oriented storage is optimal for OLTP performance, where transactions frequently read and write entire records. However, if you query one column of the row-store table, for example PetType, you see that all columns are read, which means the query wastes a lot of time scanning and filtering irrelevant data.
Column-oriented storage solves this problem by grouping values of the same column (PetType) together on disk:
For that same query on the PetType column, only the values of the relevant column are scanned, which results in fewer reads from storage and faster query results. In this simple case, each page read from storage contains PetType values for five rows instead of one, so I/O would be reduced by 5x.
But wait, there’s less! (compressed columns)
Since columns contain data of the same type, and especially when they contain relatively few unique values, they lend themselves well to compression. This further decreases I/O (and query time) by squeezing even more column values into each data page. In fact, compressed columns are often small enough to remain cached entirely in memory for very fast query performance.
Accelerating TPC-H with PostgreSQL columnstore indexing
When we run a 1-terabyte (SF1000) TPC-H benchmark on PostgreSQL with the Swarm64 DA extensions installed, you can see the effects of the columnstore indexing (along with other PostgreSQL acceleration features in Swarm64 DA, such as increased parallelism). The time required to run the whole benchmark is reduced by 19x with Swarm64 DA, with some queries executing as much as 60x faster:
The following chart shows how much columnstore indexing reduces I/O. As the I/O bars show in the chart below, with columnstore indexes applied, 17x less data is read by PostgreSQL during the benchmark. This reduces scan time by up to 20x in some TPC-H queries.
Creating a columnstore index in PostgreSQL with Swarm64 DA
As mentioned, columnstore indexes can be defined for any PostgreSQL table. Let’s suppose we have the following table definition:
CREATE TABLE orders ( o_orderkey bigint NOT NULL PRIMARY KEY, o_custkey int NOT NULL, o_orderstatus "char" NOT NULL, o_totalprice double precision NOT NULL, o_orderdate date NOT NULL, o_orderpriority character varying(15) NOT NULL, o_clerk character varying(15) NOT NULL, o_shippriority int NOT NULL, o_comment character varying(79) );
A columnstore index should cover your most commonly accessed columns. Let’s suppose we want a columnstore index that covers all columns in the orders table except for an infrequently referenced o_comment column. In this case, we use the following create index statement:
CREATE INDEX orders_columnstore_idx ON orders USING columnstore( o_orderkey , o_custkey , o_orderstatus , o_totalprice , o_orderdate , o_orderpriority , o_clerk , o_shippriority );
The specific order of the columns in the CREATE INDEX statement does not matter. The new columnstore index will initialize as soon as you execute a VACUUM statement or whenever autovacuum next kicks in.
Columnstore indexes work like most any other PostgreSQL index
After the columnstore index is created, the PostgreSQL planner uses it automatically in your queries. Like other PostgreSQL indexes, the columnstore index has full transaction safety, crash-safety, replication support, and it benefits from an often-vacuumed database for optimal performance.
Try it out, and let us know what you think
That’s a high-level overview of columnstore indexing and columnar data storage for PostgreSQL.
In upcoming blog posts we will detail other aspects of columnstore indexing like compression and additional columnstore indexing best practices.
Columnstore indexing is available in Swarm64 DA 5.0. We’d love to hear your feedback. If you’d like to try it out for free, you can request a copy here:Get Swarm64 DA for free