ClickHouse: Refreshing Take on Materialized Views

Working on my Open Source MMP I have been seeing how much of it will work with ClickHouse. Unfortunately combining multiple streaming data sources in ClickHouse was proving difficult as data would semi randomly not join correctly.

This led to a potential fix by using the very recent ClickHouse refreshable materialized views. These refreshable views work much closer to how materialized views work in other databases like PostgreSQL but come with additional benefits like setting the refreshes inside the materialized view logic itself.

Official Documentation:
https://clickhouse.com/docs/en/sql-reference/statements/create/view#refreshable-materialized-view

At the time of writing, refreshable views are new enough that they require setting this to enable support. Since this is needed per session in which a table is created, you will need to use this setting. This shouldn’t be necessary after Q2 2024.

SET allow_experimental_refreshable_materialized_view = 1;

Then to make the view refreshable you just add REFRESH EVERY {INT} {TIME} like this:

CREATE MATERIALIZED VIEW attribute_impressions_mv
REFRESH EVERY 5 SECOND
TO attributed_impressions -- Specify the destination table
AS
WITH
merged_impression_event AS (
    -- Ranked rows by impression time
    SELECT 
            app.event_time AS app_event_time,
            app.store_id AS store_id,
            app.event_id,
            ...

What I then discovered though was that once a refreshable view is used, it will replace the entire contents of the destination table. This means that they likely will need further management in the future as the data grows. Additionally, they cannot be combined with the columnar data inserts I was using previously with AggregatingMergeTree.

So, once you add a refreshable view, all higher level aggregations will likely all have to be refreshable as well.

In the end, I was able to use refreshable views to solve my problem, but I am not certain they wont cause delays later due to the way they seem to refresh across all data. Perhaps I’ll have to find a way to narrow the range they aggregate.