Skip to main content
Version: Next

GCS Data Connector

The GCS Data Connector enables federated SQL queries on files stored in Google Cloud Storage. Both gcs:// and gs:// URI schemes are accepted.

When a folder path is provided, all the contained files will be loaded.

File formats are specified using the file_format parameter, as described in File Formats.

datasets:
- from: gs://my-bucket/taxi_sample.csv
name: gcs_test
params:
gcs_service_account_path: /etc/spice/gcs-key.json
file_format: csv

Configuration​

from​

Defines the GCS URI to a folder or object. Both schemes are supported and equivalent:

  • from: gs://<bucket>/<path>
  • from: gcs://<bucket>/<path>

Example: from: gs://my-bucket/path/to/file.parquet

name​

Defines the dataset name, which is used as the table name within Spice.

Example:

datasets:
- from: gs://my-bucket/taxi_sample.csv
name: cool_dataset
params:
file_format: csv
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+

The dataset name cannot be a reserved keyword.

params​

Basic parameters​

Parameter nameDescription
file_formatSpecifies the data format. Required if it cannot be inferred from the object URI. Options: parquet, csv, json. Refer to File Formats for details.
allow_httpAllow insecure HTTP connections. Defaults to false.
client_timeoutOptional. Timeout for GCS client operations.
hive_partitioning_enabledEnable partitioning using hive-style partitioning from the folder structure. Defaults to false.
schema_source_pathSpecifies the URL used to infer the dataset schema. Defaults to the most recently modified file.

Authentication parameters​

The following authentication methods are mutually exclusive — only one can be set at a time. The runtime will fail to start if more than one is specified.

  • gcs_service_account_path
  • gcs_service_account_key
  • gcs_application_default_credentials
  • gcs_skip_signature

If none of these are set, the connector accesses the bucket without explicit credentials. For public buckets, set gcs_skip_signature: true to skip request signing.

Parameter nameDescription
gcs_service_account_pathPath to a GCS service account JSON key file.
gcs_service_account_keyGCS service account JSON key as a string.
gcs_application_default_credentialsSet to true to use Google Application Default Credentials. If GOOGLE_APPLICATION_CREDENTIALS is set, that path is used. Defaults to false.
gcs_skip_signatureSet to true to skip signing requests. Use for public buckets.

Retry parameters​

Parameter nameDescription
gcs_max_retriesMaximum number of retries. Defaults to 3.
gcs_retry_timeoutTotal timeout for retries (e.g., 5s, 1m).
gcs_backoff_initial_durationInitial retry delay (e.g., 5s).
gcs_backoff_max_durationMaximum retry delay (e.g., 1m).
gcs_backoff_baseExponential backoff base (e.g., 0.1).

Authentication​

GCS connector supports four mutually-exclusive authentication modes, as detailed in the authentication parameters.

Service account JSON file​

Configure a service account by setting gcs_service_account_path to the file path of a downloaded service account JSON key:

datasets:
- from: gs://my-bucket/data/
name: my_data
params:
gcs_service_account_path: /etc/spice/gcs-key.json
file_format: parquet

To create the key file, follow the Google Cloud documentation for service account keys and grant the service account roles/storage.objectViewer (or higher) on the bucket via the Cloud Storage IAM settings.

Service account JSON content​

When mounting a key file is not practical (e.g., when keying off a secret store), pass the JSON contents directly via gcs_service_account_key:

datasets:
- from: gs://my-bucket/data/
name: my_data
params:
gcs_service_account_key: ${secrets:GCS_SERVICE_ACCOUNT_JSON}
file_format: parquet

The value should be the full JSON key as a single string, ideally provided through a supported secret store.

Application Default Credentials (ADC)​

To use Application Default Credentials — for example, when running inside Google Cloud with attached service accounts (GKE Workload Identity, Compute Engine metadata, etc.) or when using gcloud auth application-default login locally — set gcs_application_default_credentials: true:

datasets:
- from: gs://my-bucket/data/
name: my_data
params:
gcs_application_default_credentials: true
file_format: parquet

If the GOOGLE_APPLICATION_CREDENTIALS environment variable is set to a service account JSON key path, that file is used. Otherwise, the ADC chain searches the well-known locations described in the Google Cloud documentation.

Public buckets​

For unauthenticated access to a public bucket, set gcs_skip_signature: true:

datasets:
- from: gs://public-bucket/data/
name: public_data
params:
gcs_skip_signature: true
file_format: parquet

Supported file formats​

Specify the file format using the file_format parameter. More details in File Formats.

Examples​

Reading a Parquet folder with a service account key file​

datasets:
- from: gs://my-bucket/trips/2024/
name: taxi_trips
params:
gcs_service_account_path: /etc/spice/gcs-key.json
file_format: parquet

Reading a CSV file with the service account JSON inlined from a secret​

datasets:
- from: gs://my-bucket/taxi_sample.csv
name: taxi_sample
params:
gcs_service_account_key: ${secrets:GCS_SERVICE_ACCOUNT_JSON}
file_format: csv

Reading from a public bucket​

datasets:
- from: gs://public-bucket/sample.parquet
name: sample
params:
gcs_skip_signature: true
file_format: parquet

Hive-partitioned dataset​

datasets:
- from: gs://my-bucket/events/
name: events
params:
gcs_application_default_credentials: true
file_format: parquet
hive_partitioning_enabled: true

Secrets​

Spice integrates with multiple secret stores to help manage sensitive data securely. For detailed information on supported secret stores, refer to the secret stores documentation.

gcs_service_account_path and gcs_service_account_key are marked as secrets and can be supplied through any supported secret store using the ${secrets:KEY} replacement syntax.