Data Refresh
Refresh Modes​
Spice supports three modes to refresh/update local data from a connected data source. full
is the default mode.
Mode | Description | Example |
---|---|---|
full | Replace/overwrite the entire dataset on each refresh | A table of users |
append | Append/add data to the dataset on each refresh | Append-only, immutable datasets, such as time-series or log data |
changes | Apply incremental changes | Customer order lifecycle table |
Example:
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
acceleration:
refresh_mode: full
refresh_check_interval: 10m
Append​
Using refresh_mode: append
requires the use of a time_column
dataset parameter, specifying a column to compare the local acceleration against the remote source. Data will be incrementally refreshed where the time_column
value in the remote source is greater-than (gt) the max(time_column)
value in the local acceleration.
E.g.
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
time_column: created_at
acceleration:
refresh_mode: append
refresh_check_interval: 10m
If late arriving data or clock-skew needs to be accounted for, an optional overlap can also be specified. See acceleration.refresh_append_overlap
.
Changes (CDC)​
Datasets configured with acceleration refresh_mode: changes
requires a Change Data Capture (CDC) supported data connector. Initial CDC support in Spice is supported by the Debezium data connector.
Ready State​
Supported in refresh_mode | Any |
Required | No |
Default Value | on_load |
By default, Spice will return an error for queries against an accelerated dataset that is still loading its initial data. The endpoint /v1/ready
is used in production deployments to control when queries are sent to the Spice runtime.
The ready state for an accelerated dataset can be configured using the ready_state
parameter in the dataset configuration.
ready_state: on_load
: Default. The dataset is considered ready after the initial load of the accelerated data. For file-based accelerated datasets that have existing data, this will be ready immediately. Queries against this dataset before the data is loaded will return an error.ready_state: on_registration
: The dataset is considered ready when the dataset is registered in Spice, even before the initial data is loaded. Queries against this dataset before the data is loaded will automatically fallback to the federated source. Once the data is loaded, queries will be served from the acceleration.
Example:
datasets:
- from: s3://my_bucket/my_dataset
name: my_dataset
ready_state: on_load # or on_registration
acceleration:
enabled: true
Filtered Refresh​
Typically only a working subset of an entire dataset is used in an application or dashboard. Use these features to filter refresh data, creating a smaller subset for faster processing and to reduce the data transferred and stored locally.
- Refresh SQL - Specify the filter as arbitrary SQL to be pushed down to the remote source.
- Refresh Data Window - Filters data from the remote source outside the specified time window.
Refresh SQL​
Supported in refresh_mode | Any |
Required | No |
Default Value | Unset |
Refresh SQL supports specifying filters for data accelerated from the connected source using arbitrary SQL.
Filters will be pushed down to the remote source when possible, so only the requested data will be transferred over the network.
Example:
datasets:
- from: databricks:my_dataset
name: accelerated_dataset
acceleration:
enabled: true
refresh_mode: full
refresh_check_interval: 10m
refresh_sql: |
SELECT * FROM accelerated_dataset WHERE city = 'Seattle'
The refresh_sql
parameter can be updated at runtime on-demand using PATCH /v1/datasets/:name/acceleration
. This change is temporary and will revert to the spicepod.yml
definition at the next runtime restart.
Columns can be selected in the query via the SELECT
clause, but only column names are supported. Arbitrary expressions or aliases are not supported.
Example:
curl -i -X PATCH \
-H "Content-Type: application/json" \
-d '{
"refresh_sql": "SELECT city, state FROM accelerated_dataset WHERE city = 'Bellevue'"
}' \
127.0.0.1:8090/v1/datasets/accelerated_dataset/acceleration
Queries that return zero results will fallback to the behavior specified by the on_zero_results
parameter, and will not have the refresh_sql
applied to the results from the fallback. The refresh_sql
only applies to acceleration refresh tasks.
For the complete reference, view the refresh_sql
section of datasets.
- When
refresh_mode: changes
is specified, Refresh SQL can only modify the selected columns and cannot apply filters. - Running queries while using refresh SQL will not fallback to the source if any query returns more than zero rows, even when querying against columns that are not explicitly filtered by the refresh SQL. This may result in queries returning partial data, depending on the filters applied in the refresh SQL.
- Refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported.
- Refresh SQL modifications made via API are temporary and will revert after a runtime restart.