w.statement_execution: Statement Execution

class databricks.sdk.service.sql.StatementExecutionAPI

The Databricks SQL Statement Execution API can be used to execute SQL statements on a SQL warehouse and fetch the result.

Getting started

We suggest beginning with the [Databricks SQL Statement Execution API tutorial].

Overview of statement execution and result fetching

Statement execution begins by issuing a :method:statementexecution/executeStatement request with a valid SQL statement and warehouse ID, along with optional parameters such as the data catalog and output format. If no other parameters are specified, the server will wait for up to 10s before returning a response. If the statement has completed within this timespan, the response will include the result data as a JSON array and metadata. Otherwise, if no result is available after the 10s timeout expired, the response will provide the statement ID that can be used to poll for results by using a :method:statementexecution/getStatement request.

You can specify whether the call should behave synchronously, asynchronously or start synchronously with a fallback to asynchronous execution. This is controlled with the wait_timeout and on_wait_timeout settings. If wait_timeout is set between 5-50 seconds (default: 10s), the call waits for results up to the specified timeout; when set to 0s, the call is asynchronous and responds immediately with a statement ID. The on_wait_timeout setting specifies what should happen when the timeout is reached while the statement execution has not yet finished. This can be set to either CONTINUE, to fallback to asynchronous mode, or it can be set to CANCEL, which cancels the statement.

In summary: - Synchronous mode (wait_timeout=30s and on_wait_timeout=CANCEL): The call waits up to 30 seconds; if the statement execution finishes within this time, the result data is returned directly in the response. If the execution takes longer than 30 seconds, the execution is canceled and the call returns with a CANCELED state. - Asynchronous mode (wait_timeout=0s and on_wait_timeout is ignored): The call doesn’t wait for the statement to finish but returns directly with a statement ID. The status of the statement execution can be polled by issuing :method:statementexecution/getStatement with the statement ID. Once the execution has succeeded, this call also returns the result and metadata in the response. - [Default] Hybrid mode (wait_timeout=10s and on_wait_timeout=CONTINUE): The call waits for up to 10 seconds; if the statement execution finishes within this time, the result data is returned directly in the response. If the execution takes longer than 10 seconds, a statement ID is returned. The statement ID can be used to fetch status and results in the same way as in the asynchronous mode.

Depending on the size, the result can be split into multiple chunks. If the statement execution is successful, the statement response contains a manifest and the first chunk of the result. The manifest contains schema information and provides metadata for each chunk in the result. Result chunks can be retrieved by index with :method:statementexecution/getStatementResultChunkN which may be called in any order and in parallel. For sequential fetching, each chunk, apart from the last, also contains a next_chunk_index and next_chunk_internal_link that point to the next chunk.

A statement can be canceled with :method:statementexecution/cancelExecution.

Fetching result data: format and disposition

To specify the format of the result data, use the format field, which can be set to one of the following options: JSON_ARRAY (JSON), ARROW_STREAM ([Apache Arrow Columnar]), or CSV.

There are two ways to receive statement results, controlled by the disposition setting, which can be either INLINE or EXTERNAL_LINKS:

  • INLINE: In this mode, the result data is directly included in the response. It’s best suited for

smaller results. This mode can only be used with the JSON_ARRAY format.

  • EXTERNAL_LINKS: In this mode, the response provides links that can be used to download the result data

in chunks separately. This approach is ideal for larger results and offers higher throughput. This mode can be used with all the formats: JSON_ARRAY, ARROW_STREAM, and CSV.

By default, the API uses format=JSON_ARRAY and disposition=INLINE.

Limits and limitations

Note: The byte limit for INLINE disposition is based on internal storage metrics and will not exactly match the byte count of the actual payload.

  • Statements with disposition=INLINE are limited to 25 MiB and will fail when this limit is exceeded. -

Statements with disposition=EXTERNAL_LINKS are limited to 100 GiB. Result sets larger than this limit will be truncated. Truncation is indicated by the truncated field in the result manifest. - The maximum query text size is 16 MiB. - Cancelation might silently fail. A successful response from a cancel request indicates that the cancel request was successfully received and sent to the processing engine. However, an outstanding statement might have already completed execution when the cancel request arrives. Polling for status until a terminal state is reached is a reliable way to determine the final state. - Wait timeouts are approximate, occur server-side, and cannot account for things such as caller delays and network latency from caller to service. - To guarantee that the statement is kept alive, you must poll at least once every 15 minutes. - The results are only available for one hour after success; polling does not extend this. - The SQL Execution API must be used for the entire lifecycle of the statement. For example, you cannot use the Jobs API to execute the command, and then the SQL Execution API to cancel it.

[Apache Arrow Columnar]: https://arrow.apache.org/overview/ [Databricks SQL Statement Execution API tutorial]: https://docs.databricks.com/sql/api/sql-execution-tutorial.html

cancel_execution(statement_id: str)

Requests that an executing statement be canceled. Callers must poll for status to see the terminal state. Cancel response is empty; receiving response indicates successful receipt.

Parameters:

statement_id – str The statement ID is returned upon successfully submitting a SQL statement, and is a required reference for all subsequent calls.

execute_statement(statement: str, warehouse_id: str [, byte_limit: Optional[int], catalog: Optional[str], disposition: Optional[Disposition], format: Optional[Format], on_wait_timeout: Optional[ExecuteStatementRequestOnWaitTimeout], parameters: Optional[List[StatementParameterListItem]], query_tags: Optional[List[QueryTag]], row_limit: Optional[int], schema: Optional[str], wait_timeout: Optional[str]]) StatementResponse

Execute a SQL statement and optionally await its results for a specified time.

Use case: small result sets with INLINE + JSON_ARRAY

For flows that generate small and predictable result sets (<= 25 MiB), INLINE responses of JSON_ARRAY result data are typically the simplest way to execute and fetch result data.

Use case: large result sets with EXTERNAL_LINKS

Using EXTERNAL_LINKS to fetch result data allows you to fetch large result sets efficiently. The main differences from using INLINE disposition are that the result data is accessed with URLs, and that there are 3 supported formats: JSON_ARRAY, ARROW_STREAM and CSV compared to only JSON_ARRAY with INLINE.

** URLs**

External links point to data stored within your workspace’s internal storage, in the form of a URL. The URLs are valid for only a short period, <= 15 minutes. Alongside each external_link is an expiration field indicating the time at which the URL is no longer valid. In EXTERNAL_LINKS mode, chunks can be resolved and fetched multiple times and in parallel.


### Warning: Databricks strongly recommends that you protect the URLs that are returned by the `EXTERNAL_LINKS` disposition.

When you use the EXTERNAL_LINKS disposition, a short-lived, URL is generated, which can be used to download the results directly from . As a short-lived is embedded in this URL, you should protect the URL.

Because URLs are already generated with embedded temporary s, you must not set an Authorization header in the download requests.

The EXTERNAL_LINKS disposition can be disabled upon request by creating a support case.

See also [Security best practices].


StatementResponse contains statement_id and status; other fields might be absent or present depending on context. If the SQL warehouse fails to execute the provided statement, a 200 response is returned with status.state set to FAILED (in contrast to a failure when accepting the request, which results in a non-200 response). Details of the error can be found at status.error in case of execution failures.

[Security best practices]: https://docs.databricks.com/sql/admin/sql-execution-tutorial.html#security-best-practices

Parameters:
  • statement – str The SQL statement to execute. The statement can optionally be parameterized, see parameters. The maximum query text size is 16 MiB.

  • warehouse_id

    str Warehouse upon which to execute a statement. See also [What are SQL warehouses?]

    [What are SQL warehouses?]: https://docs.databricks.com/sql/admin/warehouse-type.html

  • byte_limit – int (optional) Applies the given byte limit to the statement’s result size. Byte counts are based on internal data representations and might not match the final size in the requested format. If the result was truncated due to the byte limit, then truncated in the response is set to true. When using EXTERNAL_LINKS disposition, a default byte_limit of 100 GiB is applied if byte_limit is not explicitly set.

  • catalog

    str (optional) Sets default catalog for statement execution, similar to [USE CATALOG] in SQL.

    [USE CATALOG]: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-use-catalog.html

  • disposition

    Disposition (optional) The fetch disposition provides two modes of fetching results: INLINE and EXTERNAL_LINKS.

    Statements executed with INLINE disposition will return result data inline, in JSON_ARRAY format, in a series of chunks. If a given statement produces a result set with a size larger than 25 MiB, that statement execution is aborted, and no result set will be available.

    NOTE Byte limits are computed based upon internal representations of the result set data, and might not match the sizes visible in JSON responses.

    Statements executed with EXTERNAL_LINKS disposition will return result data as external links: URLs that point to cloud storage internal to the workspace. Using EXTERNAL_LINKS disposition allows statements to generate arbitrarily sized result sets for fetching up to 100 GiB. The resulting links have two important properties:

    1. They point to resources _external_ to the Databricks compute; therefore any associated authentication information (typically a personal access token, OAuth token, or similar) _must be removed_ when fetching from these links.

    2. These are URLs with a specific expiration, indicated in the response. The behavior when attempting to use an expired link is cloud specific.

  • format

    Format (optional) Statement execution supports three result formats: JSON_ARRAY (default), ARROW_STREAM, and CSV.

    Important: The formats ARROW_STREAM and CSV are supported only with EXTERNAL_LINKS disposition. JSON_ARRAY is supported in INLINE and EXTERNAL_LINKS disposition.

    When specifying format=JSON_ARRAY, result data will be formatted as an array of arrays of values, where each value is either the string representation of a value, or null. For example, the output of SELECT concat(‘id-’, id) AS strCol, id AS intCol, null AS nullCol FROM range(3) would look like this:

    ` [ [ "id-1", "1", null ], [ "id-2", "2", null ], [ "id-3", "3", null ], ] `

    When specifying format=JSON_ARRAY and disposition=EXTERNAL_LINKS, each chunk in the result contains compact JSON with no indentation or extra whitespace.

    When specifying format=ARROW_STREAM and disposition=EXTERNAL_LINKS, each chunk in the result will be formatted as Apache Arrow Stream. See the [Apache Arrow streaming format].

    When specifying format=CSV and disposition=EXTERNAL_LINKS, each chunk in the result will be a CSV according to [RFC 4180] standard. All the columns values will have string representation similar to the JSON_ARRAY format, and null values will be encoded as “null”. Only the first chunk in the result would contain a header row with column names. For example, the output of SELECT concat(‘id-’, id) AS strCol, id AS intCol, null as nullCol FROM range(3) would look like this:

    ` strCol,intCol,nullCol id-1,1,null id-2,2,null id-3,3,null `

    [Apache Arrow streaming format]: https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format [RFC 4180]: https://www.rfc-editor.org/rfc/rfc4180

  • on_wait_timeoutExecuteStatementRequestOnWaitTimeout (optional) When wait_timeout > 0s, the call will block up to the specified time. If the statement execution doesn’t finish within this time, on_wait_timeout determines whether the execution should continue or be canceled. When set to CONTINUE, the statement execution continues asynchronously and the call returns a statement ID which can be used for polling with :method:statementexecution/getStatement. When set to CANCEL, the statement execution is canceled and the call returns with a CANCELED state.

  • parameters

    List[StatementParameterListItem] (optional) A list of parameters to pass into a SQL statement containing parameter markers. A parameter consists of a name, a value, and optionally a type. To represent a NULL value, the value field may be omitted or set to null explicitly. If the type field is omitted, the value is interpreted as a string.

    If the type is given, parameters will be checked for type correctness according to the given type. A value is correct if the provided string can be converted to the requested type using the cast function. The exact semantics are described in the section [cast function] of the SQL language reference.

    For example, the following statement contains two parameters, my_name and my_date:

    ` SELECT * FROM my_table WHERE name = :my_name AND date = :my_date `

    The parameters can be passed in the request body as follows:

    ` { …, “statement”: “SELECT * FROM my_table WHERE name = :my_name AND date = :my_date”, “parameters”: [ { “name”: “my_name”, “value”: “the name” }, { “name”: “my_date”, “value”: “2020-01-01”, “type”: “DATE” } ] } `

    Currently, positional parameters denoted by a ? marker are not supported by the Databricks SQL Statement Execution API.

    Also see the section [Parameter markers] of the SQL language reference.

    [Parameter markers]: https://docs.databricks.com/sql/language-manual/sql-ref-parameter-marker.html [cast function]: https://docs.databricks.com/sql/language-manual/functions/cast.html

  • query_tags

    List[QueryTag] (optional) An array of query tags to annotate a SQL statement. A query tag consists of a non-empty key and, optionally, a value. To represent a NULL value, either omit the value field or manually set it to null or white space. Refer to the SQL language reference for the format specification of query tags. There’s no significance to the order of tags. Only one value per key will be recorded. A sequence in excess of 20 query tags will be coerced to 20. Example:

    { …, “query_tags”: [ { “key”: “team”, “value”: “eng” }, { “key”: “some key only tag” } ] }

  • row_limit – int (optional) Applies the given row limit to the statement’s result set, but unlike the LIMIT clause in SQL, it also sets the truncated field in the response to indicate whether the result was trimmed due to the limit or not.

  • schema

    str (optional) Sets default schema for statement execution, similar to [USE SCHEMA] in SQL.

    [USE SCHEMA]: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-use-schema.html

  • wait_timeout

    str (optional) The time in seconds the call will wait for the statement’s result set as Ns, where N can be set to 0 or to a value between 5 and 50.

    When set to 0s, the statement will execute in asynchronous mode and the call will not wait for the execution to finish. In this case, the call returns directly with PENDING state and a statement ID which can be used for polling with :method:statementexecution/getStatement.

    When set between 5 and 50 seconds, the call will behave synchronously up to this timeout and wait for the statement execution to finish. If the execution finishes within this time, the call returns immediately with a manifest and result data (or a FAILED state in case of an execution error). If the statement takes longer to execute, on_wait_timeout determines what should happen after the timeout is reached.

Returns:

StatementResponse

get_statement(statement_id: str) StatementResponse

This request can be used to poll for the statement’s status. StatementResponse contains statement_id and status; other fields might be absent or present depending on context. When the status.state field is SUCCEEDED it will also return the result manifest and the first chunk of the result data. When the statement is in the terminal states CANCELED, CLOSED or FAILED, it returns HTTP 200 with the state set. After at least 12 hours in terminal state, the statement is removed from the warehouse and further calls will receive an HTTP 404 response.

NOTE This call currently might take up to 5 seconds to get the latest status and result.

Parameters:

statement_id – str The statement ID is returned upon successfully submitting a SQL statement, and is a required reference for all subsequent calls.

Returns:

StatementResponse

get_statement_result_chunk_n(statement_id: str, chunk_index: int) ResultData

After the statement execution has SUCCEEDED, this request can be used to fetch any chunk by index. Whereas the first chunk with chunk_index=0 is typically fetched with :method:statementexecution/executeStatement or :method:statementexecution/getStatement, this request can be used to fetch subsequent chunks. The response structure is identical to the nested result element described in the :method:statementexecution/getStatement request, and similarly includes the next_chunk_index and next_chunk_internal_link fields for simple iteration through the result set. Depending on disposition, the response returns chunks of data either inline, or as links.

Parameters:
  • statement_id – str The statement ID is returned upon successfully submitting a SQL statement, and is a required reference for all subsequent calls.

  • chunk_index – int

Returns:

ResultData