
BigQuery compresses data under the hood for you on an ongoing basis but your queries are still billed as if you are scanning uncompressed bytes. CompressionīigQuery: proprietary compression that is opaque to the user and handled by the ColumnIO columnar format (Colossus under the hood). Data is stored in a hybrid columnar format (PAX) with aggressive metadata caching. Snowflake: Proprietary columnar format, in-memory / SSD / object store running on compute / object storage in your cloud of choice. RA3 nodes include both a hot query cache and an extensive metadata cache. RA3 separates compute and storage, whilst all other node types colocalise your compute and storage together. Redshift: Proprietary but is generally SSD (dc1, dc2) / HDD (ds1, ds2) or mixed including S3 based (for RA3) using a proprietary columnar format. Complete separation of distributed compute and storage. Storage layerĪll 3 databases have implementations of hot / warm / cold storage – this pertains to internal storage rather than external storage (e.g., external tables, federated sources).īigQuery: Proprietary, stored on the Colossus filesystem using ColumnIO as a storage format. Hybrid columnar system inspired by C-Store, MonetDB among others. Snowflake: Proprietary compute engine with intelligent predicate pushdown + smart caching running on commodity virtual machines (AWS, GCP or Azure) depending on your cloud choice. Don’t let the proximity to Postgres fool you, it’s more of a distant second cousin. Redshift: Proprietary fork of ParAccel (which was partly forked from Postgres) running on AWS virtual machines.
%20-%20Redshift.png)
This runs on Borg, a cybernetic life-form that Google has imprisoned inside conveniently located data centers in various regions. Let’s get started! Compute layerīigQuery: Runs on distributed compute. If you feel strongly that I’ve missed or misrepresented something, have made a grievous error or would just like to have a chat – please get in touch. Spoiler alert: the performance differences are marginal between the different providers. Fivetran has also posted a performance comparison (with some caveats that they note). Each provider publishes their own TPC-H / encryption, support for third party tools and TPC-DS benchmarks – have a look for these. That’s a mammoth effort but is something I’d love to do at some point. – This doesn’t cover every possible database you can (or could) use for analytics. All 3 products have unique feature sets but we don’t have enough space to cover all of it (unfortunately). – A recommendation to pick a specific product or a comprehensive evaluation of all features. I’m in the process of adding more sections over time!

This guide covers: compute, storage, compression, deployment, pricing, scalability, data access, encryption, support for third party tools, query language, user defined functions, federated queries, materialised views, caching, streaming, data sources, maintenance and scheduling.
ATHENA VS REDSHIFT UPDATE
I can’t mention any features that are currently under NDA but will update as soon as they become either public.

Where applicable I’ve noted if a feature or functionality is in alpha / beta / in-preview. – I’ve attempted to make the information as accurate as possible, but some details may be condensed for simplicity. – An up to date guide (hopefully with regular updates as new features are released or changed) – Designed as a fair feature comparison between the different products For updates, changes and errata please see the Changelog at the end of this post.
