Skip to main content
Version: Latest (v1.11)

File Formats

File-based data connectors — including s3://, abfs://, file://, ftp://, sftp://, and others — support multiple structured and document file formats. This page details the format-specific parameters available for each.

Common Parameters

These parameters apply across multiple file formats.

ParameterTypeDefaultDescription
file_formatStringInferredSelects the file reader. If omitted, format is inferred from the file extension. See Supported Formats.
file_extensionStringDerivedOverrides the file extension filter used when listing files. Defaults to the extension matching the resolved format.
schema_infer_max_recordsInteger1000Maximum number of records scanned to infer the schema.
file_compression_typeStringUNCOMPRESSEDFile-level compression for CSV, TSV, and JSON files. Valid values: GZIP, BZIP2, XZ, ZSTD, UNCOMPRESSED.
hive_partitioning_enabledBooleanfalseEnables Hive-style partition discovery from directory structure.

Supported Formats

The file_format parameter accepts these values:

ValueReaderDefault ExtensionNotes
parquetApache Parquet.parquet
csvCSV.csvUses csv_* parameters.
tsvTSV (tab-delimited).tsvUses tsv_* parameters. Delimiter is tab.
jsonJSON.jsonUses json_format to control parsing mode.
jsonlJSON Lines.jsonlLine-delimited JSON.

When file_format is omitted, Spice infers the format from the dataset path extension. If the extension does not match one of the values above, a configuration error is returned.

Parquet

Spice reads any Parquet file regardless of the compression codec or data encoding.

Supported compression codecs:

Supported data encodings:

CSV

Parameters

ParameterTypeDefaultDescription
csv_has_headerBooleantrueWhether the first row contains column headers.
csv_quoteChar"Character used to quote fields containing special characters.
csv_escapeCharnoneCharacter used to escape special characters within a field.
csv_delimiterChar,Character used to separate fields.
csv_schema_infer_max_recordsInteger1000Deprecated. Use schema_infer_max_records instead. Maximum records scanned for schema inference.

TSV

TSV (tab-separated values) is a first-class format. Set file_format: tsv or use a .tsv file extension. The delimiter is always tab and cannot be changed.

Parameters

ParameterTypeDefaultDescription
tsv_has_headerBooleantrueWhether the first row contains column headers.
tsv_quoteChar"Character used to quote fields containing special characters.
tsv_escapeCharnoneCharacter used to escape special characters within a field.
tsv_schema_infer_max_recordsInteger1000Deprecated. Use schema_infer_max_records instead. Maximum records scanned for schema inference.

JSON

Set file_format: json for JSON files. Use the json_format parameter to select the parsing mode.

Parameters

ParameterTypeDefaultDescription
json_formatStringjsonlParsing mode. jsonl, ndjson, and ldjson produce line-delimited JSON. array parses a top-level JSON array.
flatten_jsonBooleanfalseWhen true, nested JSON objects are flattened with . as a separator (e.g., address.city).

Setting file_format: jsonl uses the DataFusion JSON Lines reader directly, without json_format or flatten_json support.