swebloha.blogg.se - Athena vs redshift

ATHENA VS REDSHIFT UPDATE

BigQuery compresses data under the hood for you on an ongoing basis but your queries are still billed as if you are scanning uncompressed bytes. CompressionīigQuery: proprietary compression that is opaque to the user and handled by the ColumnIO columnar format (Colossus under the hood). Data is stored in a hybrid columnar format (PAX) with aggressive metadata caching. Snowflake: Proprietary columnar format, in-memory / SSD / object store running on compute / object storage in your cloud of choice. RA3 nodes include both a hot query cache and an extensive metadata cache. RA3 separates compute and storage, whilst all other node types colocalise your compute and storage together. Redshift: Proprietary but is generally SSD (dc1, dc2) / HDD (ds1, ds2) or mixed including S3 based (for RA3) using a proprietary columnar format. Complete separation of distributed compute and storage. Storage layerĪll 3 databases have implementations of hot / warm / cold storage – this pertains to internal storage rather than external storage (e.g., external tables, federated sources).īigQuery: Proprietary, stored on the Colossus filesystem using ColumnIO as a storage format. Hybrid columnar system inspired by C-Store, MonetDB among others. Snowflake: Proprietary compute engine with intelligent predicate pushdown + smart caching running on commodity virtual machines (AWS, GCP or Azure) depending on your cloud choice. Don’t let the proximity to Postgres fool you, it’s more of a distant second cousin. Redshift: Proprietary fork of ParAccel (which was partly forked from Postgres) running on AWS virtual machines.

This runs on Borg, a cybernetic life-form that Google has imprisoned inside conveniently located data centers in various regions. Let’s get started! Compute layerīigQuery: Runs on distributed compute. If you feel strongly that I’ve missed or misrepresented something, have made a grievous error or would just like to have a chat – please get in touch. Spoiler alert: the performance differences are marginal between the different providers. Fivetran has also posted a performance comparison (with some caveats that they note). Each provider publishes their own TPC-H / encryption, support for third party tools and TPC-DS benchmarks – have a look for these. That’s a mammoth effort but is something I’d love to do at some point. – This doesn’t cover every possible database you can (or could) use for analytics. All 3 products have unique feature sets but we don’t have enough space to cover all of it (unfortunately). – A recommendation to pick a specific product or a comprehensive evaluation of all features. I’m in the process of adding more sections over time!

This guide covers: compute, storage, compression, deployment, pricing, scalability, data access, encryption, support for third party tools, query language, user defined functions, federated queries, materialised views, caching, streaming, data sources, maintenance and scheduling.

ATHENA VS REDSHIFT UPDATE

I can’t mention any features that are currently under NDA but will update as soon as they become either public.

Where applicable I’ve noted if a feature or functionality is in alpha / beta / in-preview. – I’ve attempted to make the information as accurate as possible, but some details may be condensed for simplicity. – An up to date guide (hopefully with regular updates as new features are released or changed) – Designed as a fair feature comparison between the different products For updates, changes and errata please see the Changelog at the end of this post.