Operations
Full Example
With TQL API users can query a dataset, more precisely the records of a dataset and their corresponding cells as well as their keys.
The following code sets up a TQL environment and executes a simple query:
final DatasetReader<String> reader = dataset.reader(); // 1
TqlEnvironment env = new TqlEnvironment(reader, // 2
"Comedians", // 3
CellDefinition.defineString("FirstName"), // 4
CellDefinition.defineString("LastName"));
try (ResultStream resultStream =
env.query("SELECT * FROM Comedians").stream()) { // 5
resultStream.forEach(System.out::println); // 6
}
1 | A DatasetReader is required to read the contents from a Dataset. |
2 | A TqlEnvironment is created, which takes as parameter the DatasetReader, |
3 | A String alias for referencing the Dataset in a query, |
4 | and var-args of CellDefinition, which defines the input schema used for processing the Record instances of the Dataset. Another variant uses a Set of cell definitions. |
5 | The query method is used to submit a TQL query string. Within that query string the alias of the TqlEnvironment instance is used to refer to the Dataset. The query takes all cell definitions into account that are defined in the input schema. If the query is valid, the query method returns a Result instance. The Result instance provides a stream method which delivers a ResultStream instance. |
6 | As ResultStream extends the java.util.stream.Stream interface, the results can be consumed in a streaming fashion, in this example by printing all results as terminal operation. The results themselves are instances of CellCollection, which is a collection of Cell instances. |
Setup of Environment
As illustrated in the previous example, a TqlEnvironment instance is based on a DatasetReader, an alias, and a set of CellDefinition instances. Note that TqlEnvironment does not have an explicit lifecycle; it depends on the lifecycle of the Dataset and its associated DatasetManager.
Read Access
In order to run queries, the DatasetReader must be pointing to a Dataset that has not been closed. If the Dataset or the associated DatasetManager have been closed, query execution will fail with an exception.
Alias for Dataset
The mandatory alias for the Dataset defines the name under which the Dataset is referenced in a query. The alias is a non-empty string. If the alias consists of invalid characters, e.g. "name with space" or "?Id", an exception will be thrown.
Input Schema
A TqlEnvironment either uses an array or a set of CellDefinition instances to define the schema, i.e. the structured view on the data, which is used for querying a Dataset.
Consumption of Records
A query can only refer to the cell definitions of that schema, i.e., cells whose definitions have not been included in the schema, cannot be queried. Each Record of a Dataset is processed with respect to the defined input schema. For each cell definition of the input schema, it is checked whether the record contains a corresponding cell. If so, the cell value is used in upstream query processing. If not, NULL is used. For example, let the input schema be "Name" of type String, "Age" of type Integer, "Weight" of type Double. Then the dataset entry [(Name,String,"Moe"), (Age,Integer,42)] will be internally processed as ["Moe",42,NULL]. NULL is also used if the cell definition name in the input schema equals the cell definition name of an existing cell, but only case-insensitively. This is due to names of cell definitions being handled case-sensitively.
Constraints on Cell Definitions
The cell definitions of the input schema must be unique with respect to the names. A reserved name is "key", which is used for accessing the key of a record. If two or more definitions share the same name, the input schema is rejected as TQL requires unique columns. It is also rejected if the names are equal case-insensitively, .e.g. "Age" and "age".
To deal with such ambiguous names, the API offers a manual and an automatic resolution approach.
Automatic Resolution of Ambiguities
The automatic resolution approach resolves ambiguities by introducing aliases for conflicting definitions. The method resolveAmbiguousNames can be applied to an array or a set of cell definitions, returning a set of cell definitions with new alias names. The alias name itself is built by appending "_" and the type in uppercase to the name. If the automatic resolution step introduces new ambiguities, an exception is thrown and the ambiguities have to be resolved manually.
The following example illustrates this approach:
CellDefinition<Integer> ageInt = CellDefinition.defineInt("Age"); // 1
CellDefinition<String> ageString = CellDefinition.defineString("Age");
TqlEnvironment env = new TqlEnvironment(reader,
"Comedians",
TqlEnvironment.resolveAmbiguousNames(ageInt, ageString)); // 2
1 | The cell definitions are ambiguous as they have the same name. |
2 | The method resolveAmbiguousNames automatically resolves ambiguities by appending the type to the name of ambiguous cell definitions. In this example the generated aliases would be "Age_INT" and "Age_STRING". |
Manual Resolution of Ambiguities
The manual approach is to introduce an alias for a cell definition with the alias name being unique. The method as maps an existing cell definition into a new cell definition with an alias.
The following example illustrates this approach:
CellDefinition<Integer> ageInt = CellDefinition.defineInt("Age"); // 1
CellDefinition<String> ageString = CellDefinition.defineString("Age");
TqlEnvironment env = new TqlEnvironment(reader,
"Comedians",
TqlEnvironment.as("Age_Resolved", ageString), ageInt); // 2
1 | The cell definitions are ambiguous as they have the same name. |
2 | The method as introduces an alias for a CellDefinition. The alias name must be unique. In this example, the alias is named "Age_Resolved", which is no more in conflict with the other cell definition named "Age". |
The aliases, either introduced by the manual or the automatic approach, are then used for further query processing.
Sampling of Cell Definitions
When setting up a TqlEnvironment, the contents of the dataset and its structure may be unknown. In order to get an impression of the data, sampling can be used. More precisely, drawing a sample of records from the dataset and investigating their cell definitions provides a reasonable starting point for understanding the structure of the data. TqlEnvironment provides for that purpose the method sampleCellDefinitions. This method takes as parameters a DatasetReader and the sample size, the latter being greater than or equal to zero. If zero, the complete Dataset will be used as sample. Given such a sample, the superset of all cell definitions of records in the sample is determined and returned as a set.
Note: It is important to note that sampling should only be used to get a first understanding of the dataset structure. As the contents of the dataset may change dynamically, so the sample of cell definitions may change. Also the set of cell definitions returned from sampling may contain ambiguities, i.e., definitions with the same name but a different type, or definitions whose names are case-insensitively equal. As a consequence the sample should not be used directly as the input schema, but should be investigated beforehand.
Inclusion of Record Key
Each record of a dataset consists of a key and a set of cells. By default, querying the key is not supported. To query the key, the method TqlEnvironment.includeKeys has to be called on a TqlEnvironment instance. Then the key can be used like every other cell definition of the input schema; the name of the corresponding column is "key".
The following example shows how to include the record key in query processing:
TqlEnvironment env = new TqlEnvironment(reader,
"Comedians",
CellDefinition.defineString("FirstName"),
CellDefinition.defineString("LastName"))
.includeKeys(); // 1
Result result = env.query("SELECT key FROM Comedians"); // 2
1 | Given a TqlEnvironment instance, the key can be included by calling includeKeys in a fluent style. |
2 | Now a query can access the key of a record, using the "key" column. |
Note: If efficient key access is required for an application, the key should be additionally included in the value part of the record with an appropriate indexing setup.
Once the TqlEnvironment has been set up completely, it can be used to run an arbitrary number of queries. As the TqlEnviromment does not maintain any mutable state, it can therefore be used concurrently.
Querying of a Dataset
The method query of a TqlEnvironment instance takes as input the query string, which is based on the SQL-like semantics of TQL API. Within that query, all cell definitions of the input schema can be accessed, analogously to columns in a table being accessed in a SQL query. Note that within the TQL query, all identifiers for columns, functions, etc. are handled case-insensitively.
TQL API is targeted for reading from a Dataset and analyzing its contents. Therefore read-only SQL operations are allowed while data creation and manipulation operations like CREATE TABLE or UPDATE are not allowed. The query interface offers common SQL operations like filtering, projection, aggregation, or grouping. A query can only operate on one dataset; it cannot operate on multiple datasets using joins or other n-ary operations.
Each query has an output schema, describing what the query results look like. The output schema has to be compliant with TCStore type system. Otherwise the query method will fail with an exception. For example, a query like SELECT CAST(Age AS BIGDECIMAL) AS FailType FROM Comedians will fail. In such a case the query has to be adapted so that the output schema only contains types being available in TCStore type system. When calling the method query with a valid query string, it returns an instance of type Result. This instance is used for consuming the results and accessing the output schema.
Consumption of results
A Result instance provides access to the results of the query as well as to the schema of these results.
Streaming of Query Results
The actual consumption of the results is provided by the method stream. This method returns an instance of type ResultStream. Calling method stream multiple times will return ResultStream instances being executed independent of each other. As ResultStream extends java.util.stream.Stream, the results can be consumed in a streaming fashion using arbitrary follow-up intermediate or terminal operations. The results themselves are each a collection of cells. It is worth mentioning that the underlying query processing is done lazily, i.e., the results are not pre-computed in advance, but computed on demand when the next result is to be consumed. Each ResultStream instance can only be consumed once, as with java.util.stream.Stream. Note that the stream has to be closed in order to free resources. When using a terminal operation other than iterator or spliterator, it is automatically closed. However, it is good practice to close the stream explicitly.
The following example uses a try-with-resources statement to automatically close the stream after consuming all results.
try (ResultStream stream =
env.query("SELECT * FROM Comedians").stream()) {
stream.forEach(System.out::println);
}
Output Schema of Query Results
The result schema can be obtained by calling on a Result instance the method cellDefinitions, which returns Collection<CellDefinition<?>>. The cell definitions of the output schema can be used to retrieve the corresponding cell values from a query result entry. When composing the query results, NULL values are translated to absent cells in the resulting cell collection. For example, the output schema includes cell definition "LastName", but due to its result value being NULL, the resulting cell collection does not include that cell.
Note: The alias-based resolution of ambiguities in the input schema may also affect the output schema. For example, resolving ambiguities automatically for input schema with CellDefinition<Integer> named "Age" and CellDefinition<Long> named "Age" will result for query "SELECT * FROM Source" in output schema CellDefinition<Integer> named "Age_INT" and CellDefinition<Long> named "Age_LONG".
Inclusion of Record Key
Example
The following example illustrates the interplay of query schema and query results by plotting a table, using the result schema as header and the results as subsequent rows.
try(ResultStream resultStream =
env.query("SELECT * FROM Comedians").stream()) {
String header = resultSchema.stream() // 1
.map(cd -> String.format(" %10s ", cd.name()))
.collect(Collectors.joining("|"));
String rows = resultStream // 2
.map(row ->
resultSchema.stream()
.map(row::get)
.map(o -> String.format(" %10s ", o.orElse(null)))
.collect(Collectors.joining("|"))
)
.collect(Collectors.joining("\n")
);
System.out.println(String.format("%s%n%s", header, rows)); // 3
1 | Using stream operations, the header is built using the CellDefinition instances of the result schema. |
2 | Using stream operations, the query results are consumed and converted into a rows string. Note that the cell collections are indexed by the cell definitions of the output schema. As mentioned earlier, if the value of a result column is NULL, the cell collection delivered as the result does not include such a cell. When using the corresponding cell definition to get that cell from the cell collection, an Optional.empty() object is returned. |
3 | Finally, header and rows are printed. |
Insights into Query Execution
Each TQL query is parsed and translated into a physical query execution plan. That plan describes how logical operators are represented as physical operators, which indexes are used, and whether all the data needs to be scanned. In case of performance bottlenecks of a query, its query execution plan can be examined to determine which physical operators are used or where indexes might speed up the execution.
For that reason, ResultStream also provides access to a representation of that query plan. The method explainQuery is an intermediate stream operation having a java.util.stream.Consumer as only parameter. That consumer consumes an Object instance, whose toString method returns a textual representation of the query execution plan. The object as well as content and form of the string representation are subject to change without notice. The plan is only available once the ResultStream has been closed, either by calling method close explicitly or using a terminal operation (other than iterator or spliterator).
Note: The method explainQuery can only be called on a ResultStream instance. Adding subsequent stream operations to a ResultStream and calling then explainQuery is not supported.
For more details on the query execution plan and its string-based representation, see section
Performance Considerations.