Configuration

This page reviews and explains all available configuration options in SPARQLess. Before you explore the individual configuration options, it is recommended that you familiarize yourself with the basic concepts in SPARQLess described here.

Using the builder

The easiest way of building the SPARQLess Config object is by using the builder:

const config = new SPARQLessConfigBuilder()
    .sparqlEndpoint('https://data.gov.cz/sparql')
    .observation({
        observationsOutputPath: path.join(__dirname, '../../observations.ttl'),
    })
    .server({
        port: 4242,
    })
    .build();

You may of course create the Config object manually, but using the builder lets you make use of pre-defined sane defaults, allowing you to only specify the configuration values you care about.

The defaults specify a console logger and configure the server to run on port 4000. They also set many other settings to values deemed valuable for general use, such as enabling hot reloading.

The rest of this document describes the values as present in the Config type, but all of the information is also applicable to the builder.

General configuration

The root configuration type is Config, which exposes the following properties:

interface Config {
    endpoint: SPARQLEndpointDefinition;
    logger?: winston.Logger;
    observation: ObservationConfig;
    postprocessing: PostprocessingConfig;
    schema?: SchemaConfig;
    server: ServerConfig;
    hotReload: HotReloadConfig;
    modelCheckpoint?: ModelCheckpointConfig;
}

The endpoint property is mandatory, since it contains the SPARQL endpoint which you want SPARQLess to run against. An endpoint looks like this:

{
    url: 'https://dev.nkod.opendata.cz/sparql',
    name: 'NKOD',
}

There are other mandatory options, and so the recommended usage is with the SPARQLessConfigBuilder, which specifies sane defaults. You can then use the builder to configure individual options as required, allowing you to leave some mandatory options as their defaults.

It is also very highly recommended to configure a logger, since it is very helpful to know what exactly is happening, since the bootstrapping process can take a very long time. The logging framework of choice is winston, so you are free to pass in any winston logger. However, if you want a sensible default, you can use the DEFAULT_LOGGER exposed by SPARQLess. Again, the SPARQLessConfigBuilder uses this default logger automatically.

Specialized configuration

The remaining configuration values are more specialized, and they affect individual components of SPARQLess.

All of these options have sane default values when used with the SPARQLessConfigBuilder, so it is recommended to first try not defining them (and thereby using the defaults). If you find that you want to adjust the behavior of SPARQLess afterwards, then you can look into modifying these values.

Observation

The observation property of Config modified the endpoint observation phase, and it can contain the following values:

interface ObservationConfig {
    maxPropertyCount: number | undefined;
    propertySampleSize: number | undefined;
    ontologyPrefixIri: string;
    shouldDetectNonArrayProperties: boolean | undefined;
    shouldCountProperties: boolean | undefined;
    observationsOutputPath: string | undefined;
}

maxPropertyCount sets a maximum number of properties to be examined when performing observations which count or enumerate many properties. Unless you need to have the most accurate schema in the GraphQL endpoint from the very start, it is recommended to set this value. It will significantly speed up observations on large datasets. A good default value is 1000, and for best results, combine this value with hot reloading, where each iteration of hot reloading increases it by an order of magnitude.

When analyzing the range for each attribute and association, a sample of up to propertySampleSize occurences is selected, and their types are used to determine that property's type. Setting propertySampleSize is highly recommended, with a reasonable default being 100 or 1000. While this setting may in some rare cases lead to the generated schema missing some return types for some properties, leaving it unlimited may result in errors during observation for large datasets, where the process is unable to allocate enough memory to hold all of the observations.

ontologyPrefixIri sets the IRI for the ontology created during observation. You should not need to modify this value from the default http://skodapetr.eu/ontology/sparql-endpoint/, unless you wish to save the observations themselves and use them for other purposes.

By default, all properties in the created GraphQL schema are arrays, since in RDF, any property can be specified multiple times. shouldDetectNonArrayProperties allows observations which flag properties as scalars, making sure that properties are not flagged as arrays unless they can really contain multiple values.

The created GraphQL schema contains comments which have additional information about the data, namely for each property, it specifies how many times it occurs in the dataset. This can be helpful when first exploring the schema, and deciding whether a property is important or not. Setting shouldCountProperties to true will enable the counting of properties, otherwise their counts will be set to 0.

It is recommended to initially set both shouldCountProperties and shouldDetectNonArrayProperties to false to ensure fast startup time, but to set them to true in the hot reloading config. That way, the necessary observations will be carried out in the background while you can already explore and query the dataset.

If set, observationsOutputPath dictates the path to which the collected observations should be written. They are saved as a Turtle RDF file, so a .ttl suffix is recommended. If this option is undefined, the observations will not be saved to disk.

Postprocessing

Schema

The schema configuration only contains one option:

interface SchemaConfig {
    graphqlSchemaOutputPath: string | undefined;
}

If you set graphqlSchemaOutputPath to a valid file path, the generated GraphQL schema will be saved to this path when it is generated. This can be useful if you want to use some visualization tools to better aid you in exploring the created GraphQL endpoint.

Server

The server configuration contains one option:

interface ServerConfig {
    port: number;
}

The port option will configure the port where the GraphQL endpoint will be available. The default value is 4000. If you visit this port in the browser, you will get access to an instance of Apollo Studio Explorer, which will let you visually build GraphQL queries and examine the GraphQL schema.

Model Checkpointing

Read more about model checkpointing here.