Configuration

Configuration of the RDF Server #

SemSpect Databases #

SemSpect can serve multiple databases at a time. A database in this context means an existing (or to be generated) SemSpect index on disc from a set of RDF sources that is loaded into main memory by SemSpect. A database can either be managed or static.

  • Static databases are defined on start of SemSpect either by a set of RDF sources or an already existing SemSpect index directory. These databases are static in the sense that they can no longer be changed in any way. However, these databases allow a wide range of configuration options, which are described below in this document.

  • Managed databases can be created, changed or deleted at any time during runtime of SemSpect. There is a REST API to interact with SemSpect in managed mode

A fine-grained configuration of SemSpect is available with the help of parameters and the special semspect-spring.sh resp. semspect-spring.bat scripts. In general, there are two ways to specify parameters for SemSpect:

  • Using Spring application parameters: For example

    ./semspect-spring.sh --server.port=8080 --semspect.rdf.databases[0].mode=load [...]

  • Using an external configuration file (.yaml or .properties):

    ./semspect-spring.sh --spring.config.additional-location="path/to/config_file"

The following describes all available configuration options for generating indices, loading indices as well as starting SemSpect with the help of the second way, a YAML configuration file.

Configuration of Managed Databases #

Any run of SemSpect using ./semspect-spring.sh (even without any arguments) launches SemSpect in managed mode and will generate the files of all managed databases in a new folder of the current JVM working directory. The storage folder of the managed databases can be specified as follows:

semspect.rdf:
  # managed.indicesDirectory
  #
  # storage folder for the SemStore indices that are managed over the REST API
  #
  # default: new folder in the JVM working directory
  managed:
    indicesDirectory: "path/where/indices/are/managed"

For more details on the REST API for managing databases, see the dedicated page on the managed mode in SemSpect.

Configuration of Static Databases #

Since SemSpect is able to manage multiple static databases in parallel, the basic structure of our SemSpect server YAML configuration for static databases looks as follows:

semspect.rdf:
  # databases
  #
  # Array of databases that are initialized on startup.
  #
  # default: empty array
  databases:
      # database (required param)
      # 
      # Name of the database - must be unique within configuration. 
      # Allowed characters: 
      # alphanumeric characters '_', '.', '-', '~' 
      # (regex: ^[a-zA-Z0-9_.-~]+$ )
      #
    - database: "database-1" 
      description: "Description of DB 1 example"
      
      # mode (required param)
      # 
      # Startup mode for the specified database
      # (described in more detail in the subsequent sections)
      # 
      # options: 
      # - generate: generate indices to the "indicesDirectory" based
      #   upon the configuration of the "indexing" parameter
      # - load: load indices of "indicesDirectory" and initialize 
      #   them based upon the configuration of the "exploration" parameter
      mode: generate
      indicesDirectory: "path/to/db1-indices"
      indexing:
        ... # indexing configuration - only considered if mode is set to "generate"
      exploration:
        ... # configuration for exploration - considered when indices are loaded into main memory
      
    - database: "database-2"
      mode: load
      description: "Description of DB 2 example"
      indicesDirectory: "path/to/db2-indices"
      indexing:
        ...
      exploration:
        ...

Index Generation #

################################
# required generation parameters
################################

# "databases[0]" can be used to specify the first entry in the databases array
# (used here to highlight the full path of the nested YAML parameter)
semspect.rdf.databases[0].database: default-db   # the default
semspect.rdf.databases[0].mode: generate
  
################################
# optional generation parameters
################################

# rdfDataSources
#
# Supported data sources:
# - plain files: *.ttl, *.ttls, *.owl, *.nt, *.n3, *.rdf, *.hdt, *.jsonld,
#   *.brf, *.rj, *.trig, *.trigs,
# - compressed files: *.bz2, *.gz
# - archives: *.zip
# - directories
# - URLs: supports plain files or compressed files (no archives).
#   - Note: The Http content type must indicate the correct RDF format,
#     otherwise it is defaulted to RDF Turtle.
# - Graph store content URLs:
#   - GraphDB (tested with v10.5.1):
#       http://localhost:7200/repositories/some-store/statements
#   - RDFox (works with >= v6.3a):
#      http://localhost:12110/datastores/some-store/content?default
#
# default: empty array
semspect.rdf.databases[0].indexing.rdfDataSources:
  - DATA_SOURCE_1
  - ...
  - DATA_SOURCE_N

# indicesDirectory
#
# storage folder for the SemStore indices
#
# default is a new folder in JVM working directory
semspect.rdf.databases[0].indicesDirectory: "path/to/directory"

# parsingStrategy
#
# options:
# - ONE_PASS (default):
#   - generate base structures (triples, dictionary) in a single pass
#     over the provided RDF datasets
#   - consumes more main memory because uncompressed dictionary and triples
#     are loaded simultaneously into memory
# - TWO_PASS:
#   - generate base structures (triples, dictionary) in two iterations
#     over the provided RDF datasets
#   - dictionary is generated in first pass, triples in second
#   - consumes less main memory since dictionary is compressed during
#     generation on demand
semspect.rdf.databases[0].indexing.parsingStrategy: ONE_PASS

# numberOfThreads
#
# default: available processors of machine
semspect.rdf.databases[0].indexing.numberOfThreads: 4

# terminateAfterIndexing
#
# default: false
semspect.rdf.databases[0].indexing.terminateAfterIndexing: false

# validateParsedResources
#
# default: false
semspect.rdf.databases[0].indexing.validateParsedResources: false

# iriDictionarySectionType
#
# Which type of dictionary section should be used for all IRIs.
# An uncompressed type requires more space but might lead to a higher
# performance when applying string filters.
#
# options:
# - PLAIN_FRONT_CODING (default)
# - UNCOMPRESSED_STRINGS
semspect.rdf.databases[0].indexing.iriDictionarySectionType: PLAIN_FRONT_CODING

# stringLiteralDictionarySectionType
#
# Which type of dictionary section should be used for all string literals.
# An uncompressed type requires more space but might lead to a higher
# performance when applying string filters.
#
# options:
# PLAIN_FRONT_CODING (default)
# UNCOMPRESSED_STRINGS
semspect.rdf.databases[0].indexing.stringLiteralDictionarySectionType: PLAIN_FRONT_CODING

# disableRDFSDomainRangeEntailment
#
# When set to true all entailments implied by rdfs:domain as well as rdfs:range
# are ignored.
#
# default: false
semspect.rdf.databases[0].indexing.disableRDFSDomainRangeEntailment: false

#########################
# experimental features
#########################

# translateSKOSToRDFS
#
# This option introduces a class hierarchy to reflect the SKOS concept
# taxonomy. A concept class is generated for each SKOS concept that has a narrower concept. 
# and each skos:broader, skos:narrower, and skos:exactMatch relation is translated to the
# corresponding rdfs:subClassOf axioms.
# Moreover, all root and leaf concepts of the derived hierarchy are assigned new classes
# "Root Concept" and "Leaf Concept" respectively.
# Example:
#   For ":super-c skos:narrower :sub-c", we derive ":sub-c rdfs:subClassOf :super-c".
#   For ":sub-c skos:broader :super-c", we derive ":sub-c a :super-c".
#
# default: false
semspect.rdf.databases[0].indexing.translateSKOSToRDFS: false

# indexReificationTriples
#
# If set to "true", reification triples that use the RDF or OWL vocabulary, such as "_:xxx rdf:subject :some-subject",
# are indexed and can then be explored. However, as rdf:Statement and owl:Axiom are disabled by default, it may be useful
# to additionally set the parameter "[...].exploration.showStatementClassInTree" (as specified in the load configuration).
# RDF Reification vocabulary: https://www.w3.org/TR/rdf11-mt/#reification
# OWL Annotation vocabular: https://www.w3.org/TR/owl2-quick-reference/#Annotations
#
# default: false
semspect.rdf.databases[0].indexing.indexReificationTriples: false

Load Indices and Start SemSpect #

#############################
# required loading parameters
#############################

# "databases[0]" can be used to specify the first entry in the databases array
# (used here to highlight the full path of the nested YAML parameter)
semspect.rdf.databases[0].database: default-db
semspect.rdf.databases[0].mode: load
semspect.rdf.databases[0].indicesDirectory: "path/to/indices/directory"

Additional Initialization Parameters (All Optional) #

Server Settings #

##################################
# additional server settings
##################################

# port
#
# default: 8080
server.port: 8080

# x-frame support
# The x-frame functionality is a way of combining and organizing web-based
# documents together on a single webpage through the use of frames (HTML
# elements that allow you to display content from another source, such as
# another webpage or a video). By default, the x-frame functionality is
# disabled by Spring Boot, however, by using the following parameter, it is
# possible to define frame-ancestors to include SemSpect in other HTML pages
# (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy/frame-ancestors) .
#
# default: disabled; below are some examples
server.frame-ancestors:
  # local HTML files are allowed to integrate SemSpect in an x-frame
  - "file://*"
  # any subdomain of "example.org" is allowed to integrate SemSpect in an x-frame
  - "https://www.*.example.org"

# context path
#
# Determines at which context path the content will be served by the server.
# For instance, if it is set to "/dataset-x", SemSpect gets hosted on "localhost:8080/dataset-x".
# This option might be helpful to distinguish different running instances of
# SemSpect not merely by their port.
#
# default: /
server.servlet.context-path: /

Exploration Settings #

##################################
# additional exploration settings
##################################

# explorationMenuWithoutCount
#
# If enabled, the exploration menu does not show the distinct number of connected
# resources. It only shows whether at least one resource of a particular class is
# connected to the given group (increases performance).
#
# default: false
semspect.rdf.databases[0].exploration.explorationMenuWithoutCount: false

# numberOfThreads
#
# default: number of processors of machine
semspect.rdf.databases[0].exploration.numberOfThreads: 4

# cacheGroups
#
# default: true
semspect.rdf.databases[0].exploration.cacheGroups: true

# showClassesAndPropertiesAsResources
#
# If set to true, rdf:Property and rdfs:Class are shown in the class tree.
#
# default: false
semspect.rdf.databases[0].exploration.showClassesAndPropertiesAsResources: false

# showTopClassInTree
#
# If set to true, the top class rdfs:Resource is shown in the class tree.
#
# default: false
semspect.rdf.databases[0].exploration.showTopClassInTree: false

#
# If set to true, the classes rdf:Statement and owl:Axiom are shown in the class tree.
#
# default: false
semspect.rdf.databases[0].exploration.showStatementClassInTree: false

# logMemoryUsage
#
# Logs the used main memory to the file
# INDICES_DIRECTORY/exploration/log/memoryConsumption.csv
#
# default: false
semspect.rdf.databases[0].exploration.logMemoryUsage: false

# explorationMenuComputationMethod
#
# options:
# - ROARING_BITMAPS_PER_CLASS
# - SORTED_ITERATION_PER_CLASS
# - INDIVIDUAL_QUERIES_PER_CLASS
# - INDIVIDUAL_QUERIES
# - ROARING_BITMAPS
# - DYNAMICALLY_DETERMINED_PER_CLASS
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.explorationMenuComputationMethod: DYNAMICALLY_DETERMINED

# explorationMenuWithoutCountComputationMethod
#
# options:
# - PER_CLASS
# - GLOBAL
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.explorationMenuWithoutCountComputationMethod: DYNAMICALLY_DETERMINED

# predecessorCountComputationMethod
#
# options:
# - SORTED_ITERATION
# - HASH_SET
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.predecessorCountComputationMethod: DYNAMICALLY_DETERMINED

# filterComputationMethod
#
# options:
# - QUERY_PER_INDIVIDUAL
# - INDEX_ITERATION
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.filterComputationMethod: DYNAMICALLY_DETERMINED

# sortingMethod
#
# options:
# - INDEX_ITERATION
# - SORTING_ON_THE_FLY
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.sortingMethod: DYNAMICALLY_DETERMINED

IRI Prefix Configuration #

In order to shorten resource IRIs in the UI, SemStore collects all prefixes that have been defined in the provided RDF datasets. Furthermore, a list of commonly deployed RDF prefixes is added by default. Note that the explicitly given prefixes of the RDF datasets have a higher priority than the defaults. To examine and modify the IRI-to-prefix map, inspect the file INDICES-DIRECTORY/exploration/config/iriToPrefixMap.yaml. The changes will be applied after the next startup of SemSpect.

Script Configuration #

All scripts can be configured through environment variables:

  • Path of the java command: SEMSPECT_JAVA_PATH
    (default: "java")
  • SemSpect specific JDK options: SEMSPECT_JDK_OPTIONS
    (default: <empty>, overrides JDK_JAVA_OPTIONS and JAVA_TOOL_OPTIONS)
  • Installation directory: SEMSPECT_HOME
    (default: <script location>)

Moreover, settings for the paths of the configuration and output paths are available for the smart scripts (semspect.sh & semspect.bat):

  • Location of the semspect configuration: SEMSPECT_CONFIG_PATH
    (default: <SEMSPECT_HOME>/semspect-config/semspect-config.yaml)
  • Location of the semstore configuration: SEMSTORE_CONFIG_PATH
    (default: <SEMSPECT_HOME>/semspect-config/default-semstore-config.yaml)
  • Location of the indices: SEMSTORE_INDICES_DIR
    (default: <SEMSPECT_HOME>/semspect-indices/)

Dossier Configuration #

For each dataset, we can define the rendering type of data property values such that URLs become links or image URLs show the respective image etc. This is done in a separate dossier configuration.

As an example, consider the following data:

@prefix :     <http://www.example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:Company
  a rdfs:Class .

:derivo
  a      :Company ;
  :name  "Derivo" ;
  :image "https://www.derivo.de/fileadmin/externe_websites/ext.derivo/images/layout/logo_deutsch_01.png" ;
  :email "info@derivo.de" ;
  :url   "https://www.derivo.de/home/" .

Now, to specify the UI types of the respective properties (called attributes in the configuration), a YAML dossier configuration is required:

version: 1
databases:
  - database: default-db
    attributes:
      - id: 'http://www.example.org/image'
        type: 'IMAGE' # images or GIFs
      - id: 'http://www.example.org/url'
        type: 'URL'
      - id: 'http://www.example.org/email'
        type: 'EMAIL'

Depending on the startup script, SemSpect is reading the dossier configuration by default from the following paths:

  • semspect.sh: SEMSPECT_HOME/config/datasets/SOME_DATASET/dossier.yaml
  • semspect-spring.sh: INDICES_DIRECTORY/exploration/config/dossier.yaml

Alternatively a different file can be specified using the following JVM parameter (not available as Spring parameter):

-Dde.derivo.semspect.server.configuration.dossier.path

Example:

#!/bin/bash
export SEMSPECT_JDK_OPTIONS=-Dde.derivo.semspect.server.configuration.dossier.path="path/to/dossier-config.yaml"
./semspect.sh run test-data.ttl

Result in SemSpect:

Facet Configuration (Experimental) #

In RDF SemSpect, facets are configurable filter for groups shown in separate sections of the dossier on the right-hand side. They behave in the same way as the class filters and can be collapsed or expanded in the case of a hierarchy. In fact, facets display classes from the class schema. With the help of a facet configuration one can define which parts of the class schema at which particular combination are shown as a separate facet. This is useful to distinguish between the backbone class schema and classes that provide a “second perspective” on the data. As an example, consider the following data set that deals with clothes, their sales status and usage type:

In case our main focus is on the type of clothing (shirt, trouser etc.) it might be useful to declare sales and seasonal information as facets. This has the advantage that the class tree and exploration menu becomes shorter (facets are not shown in the class tree and exploration menu). However, the facet classes are still available for filtering on those groups that contain individuals of those classes:

Please note that all specified classes of a facet no longer appear in the class hierarchy shown on the left, unless they also occur in a branch that is not declared as a facet.

Facets are defined in a facet configuration file. Each facet definition consists of one or more so-called facet values and a facet name. There are two types of facets:

  • simple (default): only the listed classes are used as facet values, not their subclasses (unless they are also listed). This setting is mostly relevant for Neo4j SemSpect where the label hierarchies are extrapolated from the data and might not be semantically meaningful (this does not happen in RDF SemSpect).
    Remark: unlisted subclasses of a simple facet class will be moved up one level in the class tree and histograms (or hidden if they appear in a neighbour branch).

  • subtree: the listed classes and all their subclasses are used as facet values.
    Remark: subclasses of a subtree facet class that appear in other non facet branches of the class hierarchy can still be found there.

In the example above, the following facet definition was given:

version: 1
databases:
  - database: default-db
    facets:
      - name: 'Product type'
        values: [ 'http://www.semspect.de/test#Special_offer',
                  'http://www.semspect.de/test#Remaining_item' ]
        type: simple
      - name: 'Seasonal classification'
        values: [ 'http://www.semspect.de/test#Seasonal_type' ]
        type: subtree
Click to expand the matching RDF data of this example
@prefix : <http://www.semspect.de/test#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:Textiles rdf:type rdfs:Class .
:Trousers rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Pullover rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Shirts rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Dresses rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Suits rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .

:Special_offer rdf:type rdfs:Class .
:Remaining_item rdf:type rdfs:Class .

:Seasonal_type rdf:type rdfs:Class .
:Summer rdf:type rdfs:Class ; rdfs:subClassOf :Seasonal_type .
:Winter rdf:type rdfs:Class ; rdfs:subClassOf :Seasonal_type .
:Transitional rdf:type rdfs:Class ; rdfs:subClassOf :Seasonal_type .
:Spring rdf:type rdfs:Class ; rdfs:subClassOf :Transitional .
:Autumn rdf:type rdfs:Class ; rdfs:subClassOf :Transitional .

:trouser1 rdf:type :Trousers ; rdf:type :Winter .
:trouser2 rdf:type :Trousers ; rdf:type :Winter ; rdf:type :Special_offer .
:trouser3 rdf:type :Trousers ; rdf:type :Autumn .
:trouser4 rdf:type :Trousers ; rdf:type :Spring .
:trouser5 rdf:type :Trousers ; rdf:type :Winter .
:short1 rdf:type :Trousers ; rdf:type :Summer ; rdf:type :Special_offer .
:short2 rdf:type :Trousers ; rdf:type :Summer .

:pullover1 rdf:type :Pullover ; rdf:type :Autumn .
:pullover2 rdf:type :Pullover ; rdf:type :Autumn .
:pullover3 rdf:type :Pullover ; rdf:type :Winter ; rdf:type :Remaining_item .
:pullover4 rdf:type :Pullover ; rdf:type :Winter .

:shirt1 rdf:type :Shirts ; rdf:type :Autumn ; rdf:type :Special_offer .
:shirt2 rdf:type :Shirts ; rdf:type :Winter .
:shirt3 rdf:type :Shirts ; rdf:type :Spring .
:tshirt1 rdf:type :Shirts ; rdf:type :Summer .
:tshirt2 rdf:type :Shirts ; rdf:type :Summer .

:dress1 rdf:type :Dresses ; rdf:type :Autumn .
:dress2 rdf:type :Dresses ; rdf:type :Winter .
:dress3 rdf:type :Dresses ; rdf:type :Autumn ; rdf:type :Remaining_item .

:suit1 rdf:type :Suits ; rdf:type :Autumn .
:suit2 rdf:type :Suits ; rdf:type :Winter ; rdf:type :Special_offer .
:suit3 rdf:type :Suits ; rdf:type :Summer .
:suit4 rdf:type :Suits ; rdf:type :Summer .

:matches rdf:type rdf:Property .

:trouser1 :matches :pullover1 .
:trouser1 :matches :shirt1 .
:trouser2 :matches :pullover1 .
:trouser2 :matches :pullover2 .
:trouser3 :matches :pullover3 .
:trouser4 :matches :pullover4 .
:trouser4 :matches :shirt3 .
:trouser4 :matches :shirt2 .

Depending on the startup script, SemSpect is reading the facet definitions from the following paths (if existent):

  • semspect.sh: SEMSPECT_HOME/config/datasets/SOME_DATASET/facets.yaml
  • semspect-spring.sh: INDICES_DIRECTORY/exploration/config/facets.yaml Alternatively a different file can be specified by a JVM parameter: -Dde.derivo.semspect.server.configuration.facets.path=/path/to/semspect_facets.yaml

SemStore Statistics #

SemStore collects statistics while generating all indices as well as during the exploration of a corresponding dataset. These statistics are stored in a subdirectory of the specified indices folder and can be visualized with our SemStore statistics Python application that is available on DockerHub. To execute the respective application with Docker, take a look at the scripts located in the tools/semstore-statisics/ folder of the SemSpect installation directory.

To generate plots for a single indices directory:

./semstore-eval.sh ./path/to/indices/directory

To carry out the meta-evaluation for several index directories, the shell script ./semstore-meta-eval.sh must be adapted accordingly.