Providing Dataset Path and Datatype Information in a Schema
Schemas provide three types of metadata to simplify or improve how RAQL queries interacts with XML or CSV datasets:
Path information to define what elements are rows for XML datasets.
Datatype information for each column in the dataset for XML or CSV datasets. For columns that are a date type, this can also include the lexical format for the data.
The delimiter character and whether the dataset has column names in the first row for CSV datasets.
See
Dataset Schema Syntax for information on how to define a dataset schema.
The scope of this schema is based on where you define it: within a single mashup or as a global
Presto attribute that can be used in any number of mashups. See
Dataset Schemas Defined in Mashups and
Global Dataset Schemas as
Presto
Attributes for information.
Dataset Schema Syntax
Schemas for datasets define a variable name for the schema, a set of columns with datatype and optional format information and an optional set of options for a dataset. The schema syntax is in the form:
define dataset variable-name(column-name datatype [format][, column-name datatype [format], ...]) [with options option-name= value, option-name= value, ...]
For example:
define dataset stocks (symbol string,
date datetime "yyyy-MM-dd",
open decimal,
close decimal,
high decimal,
low decimal,
volume decimal)
with options record="/stocks/stock"
See
Valid
RAQL
Datatypes for the types you can use in dataset schemas.
The
format metadata for date or time type columns accepts any lexical pattern that is valid for the Java
SimpleDate class. For the most common patterns you can use, see the
Date Formatter function for the Transformer block in
Wires.
There are three different options you can specify:
record=/path/to/row identifies the elements within the dataset, starting from the root, that should be used as rows. This uses the same syntax as paths you specify in a From clause,
excluding the variable name.
delimiter="character" identifies the delimiter used in CSV datasets when it is not the default delimiter (commas).
hasHeader=[true|false] indicates whether CSV datasets have column names as the first row. The default is true.
Dataset Schemas Defined in Mashups
You can declare a dataset schema in the mashup that loads a dataset using the EMML<variable> statement and a type of schema. For example:
<mashup xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:schemaLocation='http://www.openmashup.org/schemas/v1.0/EMML/
../schemas/EMMLPrestoSpec.xsd'
xmlns='http://www.openmashup.org/schemas/v1.0/EMML'
xmlns:macro='http://www.openmashup.org/schemas/v1.0/EMMLMacro'
xmlns:presto='http://www.jackbe.com/v1.0/EMMLPrestoExtensions'
name='xmlSchema'>
<output name='result' type='document' />
<variable name="stockType" type="schema">
define dataset stocks (symbol string,
date datetime,
open decimal,
close decimal,
high decimal,
low decimal,
volume decimal,
adjclose decimal)
</variable>
<variable name="stocks" type="variable:stockType" stream="true"/>
<directinvoke method='GET' stream='true' outputvariable='stocks'
endpoint='http://mdc.jackbe.com/prestodocs/data/stocks.xml'/>
<raql outputvariable='result'>
select symbol, "date", open, close, volume from stocks
where extract_year("date") = 2011 order by close
</raql>
</mashup>
The variable named
stockType defines a schema for the stock dataset introduced in
Use an
In-Memory Store
to Store and Load Datasets for
Presto Analytics in Getting Started. This variable is then
referenced in the variable named
stocks, using a
type of
variable:stockType, that will hold the dataset once it is loaded. The
type identifies the named variable containing the schema for this dataset.
The primary advantage of having the dataset defined is that RAQL queries now know datatypes so that filter conditions in Where, sorting criteria in Order By and functions or calculations in Over or Group By clauses work seamlessly without having to cast columns to the right datatype.
This is an example of this same mashup without schema information:
Sorting is defined on a numeric column, but because no datatype information is available from the original XML source the sort order in the result is wrong. But run the same query with schema information now available and the results are now sorted correctly:
Global Dataset Schemas as Presto Attributes
If a dataset will be used in many RAQL queries, you can define a schema for the dataset as a Presto global attribute that can be easily used in different mashups.
Presto administrators can create global attributes in the
Admin Console.
See Manage
Presto Global Attributes for instructions. For dataset schemas, the value of the
Presto global attribute is the full definition of the schema. In the following example:
The attribute name is yahooSearchSchema and the full definition of the schema is a single string as the attribute value.
Once you have the dataset schema defined as Presto global attribute you can use it in a mashup in a the <variable> statement with a name in the form global.attribute-name and a type of schema. This allows the mashup to use the global attribute to supply the schema definition. The following example retrieves the schema defined above from the Presto global attribute named yahooSearchSchema:
<mashup name="globalAttrSchema"
xmlns="http://www.openmashup.org/schemas/v1.0/EMML"
xmlns:macro="http://www.openmashup.org/schemas/v1.0/EMMLMacro"
xmlns:presto="http://www.jackbe.com/v1.0/EMMLPrestoExtensions"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openmashup.org/schemas/v1.0/EMML/../schemas/EMMLPrestoSpec.xsd">
<output name="result" type="document"/>
<variable default="coffee" name="query" type="string"/>
<variable default="98102" name="zip" type="string"/>
<variable name="appid" type="string"
default=".kcC72DV34FYTpAGuwwbV8YGI.DsMBQ0RB9eZARS621ecnHq33c.g1XJV93a64hrdaM3" />
<variable default="20" name="results" type="string"/>
<variable name="global.yahooSearchSchema" type="schema"/>
<variable name="searchResults" type="variable:global.yahooSearchSchema"/>
<invoke inputvariables="appid,query,zip,results" operation="getData"
outputvariable="searchResults" service="YahooLocalSearchREST"/>
<raql outputvariable="result">
select Title, Address, City, State, Phone, Latitude, Longitude,
Distance
from searchResults
</raql>
</mashup>
Then add the variable to hold the dataset and reference the schema variable using a type of variable:global.attribute-name, that will hold the dataset once it is loaded. The type identifies the named variable containing the schema for this dataset, In this example, the variable searchResults has a type that pulls in the global.yahooSearchSchema global attribute containing the schema definition.
In this example query when no schema is used, the mashup shows results of a single row even though the query to Yahoo Local Search asked for up to 20 results:
When the schema is added, supplying specific path information to rows in Yahoo’s results, the query now retrieves all 20 results: