Powering up Postgres for Open Source Data Warehousing & Analytics

Swarm64 CEO and co-founder, Thomas Richter, was recently interviewed by Tobias Macey on the Data Engineering Podcast.

They dove deeply into PostgreSQL and how Swarm64 DA extends it to handle data warehousing and analytics projects, thus giving people a fast, open source alternative to commercial data warehouse databases like Oracle, SQL Server, Amazon Redshift, and others.

Here are a few highlights from the interview:

How did Swarm64 get started?

Thomas: “We started Swarm64 to help free, open source PostgreSQL become a choice to handle query-intensive analytic and data warehousing workloads that previously, only proprietary databases could handle. PostgreSQL already has a fantastic feature set, and we lift its query performance to a level that’s competitive with expensive proprietary databases like Oracle, Netezza, and SQL Server.“

What are some of the bottlenecks that prevent Postgres from being used more widely in data warehousing and analytics?

Thomas: “Postgres has been around for over 30 years, and it’s a really mature and powerful product. It’s extremely versatile, but it usually struggles as the quantity of data grows into the terabytes range. We see four areas that Swarm64 improves because otherwise they challenge PostgreSQL’s ability to handle larger scale analytics and reporting projects:

Number one, parallelism, which was added 20 years into PostgreSQL’s life, at version 9.6. So it’s still relatively immature, especially when you compare it to commercial databases which master MPP (massively parallel processing) on systems with dozens of CPU cores. PostgreSQL has made a great start here, but it is really not going as far as we believe it should. That’s why we extend it with query plans and execution that utilize multicore systems much more effectively and keep processing parallelized throughout the entire duration of the query.

Two, scanning large amounts of data. I mean, you could index your data, but traditional indexes help mainly with look-up queries, not queries that scan ranges of data. Plus, it’s extra work and extra storage space required to index. We believe PostgreSQL should just scan faster so you don’t have to index and tune so much to get adequate performance.

Three, complex concurrent queries, which really make your bottlenecks visible. When you run multiple queries concurrently, you’ll see your individual PostgreSQL workers are kind of scrambling and competing with each other for I/O. It’s not uncommon for query times to increase by 5x if you double the number of clients executing them. That’s a problem if you’re trying to grow your user base.

And finally, difficult query patterns. This isn’t unique to PostgreSQL. Certain query patterns are difficult and slow, or even impossible—what we call “never come back queries.” You see lots of community debate on how to rewrite queries for better performance. But sometimes you can’t rewrite them, like SQL within an application that you cannot change. And maybe you’d rather not spend the time rewriting; you really want the DBMS to be more forgiving and smarter about how it plans and executes your query in the first place.“

Listen to the podcast

Does increasing parallelism shift the bottlenecks to disk I/O in terms of the overall throughput of the database?

Thomas: “Yes, the thing that happens when you move to MPP, is you run into a bottleneck caused by ‘how fast can you fetch the data?’ We reduced the I/O by creating our own, compressed, hybrid row-columnar storage format stored in foreign tables that moves data in and out via larger pages than PostgreSQL uses. More recently, we achieve this through new column-store indexes that act as a cache on native PostgreSQL tables, which is an alternative to the older foreign table method of columnar storage in Swarm64 DA.”

Beta test PostgreSQL Column-Store Indexing…

Can you talk through some of the details about creating an extension rather than a fork for PostgreSQL?

Thomas: PostgreSQL has been used as the basis for a lot of data warehousing databases like Amazon Redshift, IBM Netezza, and Pivotal Greenplum. All of them forked Postgres, and effectively forced PostgreSQL users to give up using free, open source PostgreSQL for analytic projects.

We made a conscious choice to deliver Swarm64 as an extension to PostgreSQL instead of building our own fork. PostgreSQL is one of the most extensible databases there is. We’ve found that extending PostgreSQL through its standard APIs makes it very easy for us to support different PostgreSQL-compatible tooling and also different versions of PostgreSQL—even commercial versions like EnterpriseDB Postgres.

Now let me explain a few of the things our extensions do. Mainly, we help PostgreSQL query planning and execution. We have functions that give the query planner more and better options—like a buffet. The query planner will choose the best option for the query.

Swarm64 also helps with query execution. Besides reducing I/O with columnar access, we provide ways for the query to stay parallel for longer. We also offer our own join implementation, specifically optimized for joining very large amounts of data, like if you want to join tables that have a few billion rows with tables that have a few million, or even a few billion rows themselves.

So, we constantly look for better ways to handle different query patterns and build those into each new release of Swarm64.”

When is Swarm64 not a good choice, and someone might be better suited using vanilla PostgreSQL?

Thomas: “If your problem is small, or your systems small, I think we’re probably not the right choice. Parallelism becomes a little pointless on 4- or 8-core servers, because there aren’t that many cores to go around in the first place. Swarm64 tends to be more relevant starting at 100s of gigabytes of data, on up well into the terabytes. That’s definitely a range where we feel very, very comfortable.”

What keeps you motivated to continue investing energy in Swarm64?

Thomas: “Our work at Swarm64 is really engaging because it’s a little bit like playing Rocky Balboa. You’re like the small guy, and you’re going in, and you’re fighting to win the title against some of those really heavyweight champions, and I find that quite rewarding. It’s a big challenge, but that’s where the fun is as well.”

Try Swarm64 DA for free