Configuration

Configuration of the RDF Server #

Main Configuration File #

The main SemSpect configuration file is optional. It is a YAML file (default name: semspect_config.yaml) with the following structure:

---
version: 2                         # format of the configuration file

# optional license configuration
license:
  file: "path to semspect license" # default semspect.lic
  session-keep-alive-minutes: 15   # default 15
  warn-before-session-expiration-minutes: 3 # default 3

The smart scripts use per default the provided configuration file ./smart-config/semspect_config.yaml.

License File Location #

To set the file name and location of your SemSpect license file (default is ./semspect.lic in the same folder as the SemSpect server configuration YAML file) use:

license ➞ file:[path to semspect license]

Note: Relative path declarations are resolved relative to the directory of the SemSpect server configuration YAML file.

Session Expiry #

The number of concurrent SemSpect users is encoded within the license file. To set the time after which the session of an inactive user is expiring (default 15) use:

license ➞ session-keep-alive-minutes:[integer]

Moreover, to help prevent the session from expiring and possibly being given to another user when all available sessions are in use, a warning message before expiration can be enabled as follows:

license ➞ warn-before-session-expiration-minutes:[integer]

SemSpect Databases #

SemSpect can serve multiple databases at a time. A database in this context means an existing (or to be generated) SemSpect index on disc from a set of RDF sources that is loaded into main memory by SemSpect. A database can either be managed or static.

  • Static databases are defined on start of SemSpect either by a set of RDF sources or an already existing SemSpect index directory. These databases are static in the sense that they can no longer be changed in any way. However, these databases allow a wide range of configuration options, which are described below in this document.

  • Managed databases can be created, changed or deleted at any time during runtime of SemSpect. There is a REST API to interact with SemSpect in managed mode

A fine-grained configuration of SemSpect is available with the help of parameters and the special semspect-server.sh resp. semspect-server.bat scripts. In general, there are two ways to specify parameters for SemSpect:

  • Using Spring application parameters: For example

    ./semspect-server.sh --server.port=8080 --semspect.rdf.databases[0].mode=load [...]

  • Using an external configuration file (.yaml or .properties):

    ./semspect-server.sh --spring.config.additional-location="path/to/config_file"

The following describes all available configuration options for generating indices, loading indices as well as starting SemSpect with the help of the second way, a YAML configuration file.

Configuration of Managed Databases #

Any run of SemSpect using ./semspect-server.sh (even without any arguments) launches SemSpect in managed mode and will generate the files of all managed databases in a new folder of the current JVM working directory. The storage folder of the managed databases can be specified as follows:

semspect.rdf:
  # managed.indicesDirectory
  #
  # storage folder for the SemStore indices that are managed over the REST API
  #
  # default: new folder in the JVM working directory
  managed:
    indicesDirectory: "path/where/indices/are/managed"

For more details on the REST API for managing databases, see the dedicated page on the managed mode in SemSpect.

Configuration of Static Databases #

Since SemSpect is able to manage multiple static databases in parallel, the basic structure of our SemSpect server YAML configuration for static databases looks as follows:

semspect.rdf:
  # databases
  #
  # Array of databases that are initialized on startup.
  #
  # default: empty array
  databases:
      # database (required param)
      # 
      # Name of the database - must be unique within configuration. 
      # Allowed characters: 
      # alphanumeric characters '_', '.', '-', '~' 
      # (regex: ^[a-zA-Z0-9_.-~]+$ )
      #
    - database: "database-1" 
      description: "Description of DB 1 example"
      
      # mode (required param)
      # 
      # Startup mode for the specified database
      # (described in more detail in the subsequent sections)
      # 
      # options: 
      # - generate: generate indices to the "indicesDirectory" based
      #   upon the configuration of the "indexing" parameter
      # - load: load indices of "indicesDirectory" and initialize 
      #   them based upon the configuration of the "exploration" parameter
      mode: generate
      indicesDirectory: "path/to/db1-indices"
      indexing:
        ... # indexing configuration - only considered if mode is set to "generate"
      exploration:
        ... # configuration for exploration - considered when indices are loaded into main memory
      
    - database: "database-2"
      mode: load
      description: "Description of DB 2 example"
      indicesDirectory: "path/to/db2-indices"
      indexing:
        ...
      exploration:
        ...

Index Generation #

################################
# required generation parameters
################################

# "databases[0]" can be used to specify the first entry in the databases array
# (used here to highlight the full path of the nested YAML parameter)
semspect.rdf.databases[0].database: default-db   # the default
semspect.rdf.databases[0].mode: generate
  
################################
# optional generation parameters
################################

# rdfDataSources
#
# Supported data sources:
# - plain files: *.ttl, *.ttls, *.owl, *.nt, *.n3, *.rdf, *.hdt, *.jsonld,
#   *.brf, *.rj, *.trig, *.trigs,
# - compressed files: *.bz2, *.gz
# - archives: *.zip
# - directories
# - URLs: supports plain files or compressed files (no archives).
#   - Note: The Http content type must indicate the correct RDF format,
#     otherwise it is defaulted to RDF Turtle.
# - Graph store content URLs:
#   - GraphDB (tested with v10.5.1):
#       http://localhost:7200/repositories/some-store/statements
#   - RDFox (works with >= v6.3a):
#      http://localhost:12110/datastores/some-store/content?default
#
# default: empty array
semspect.rdf.databases[0].indexing.rdfDataSources:
  - DATA_SOURCE_1
  - ...
  - DATA_SOURCE_N

# indicesDirectory
#
# storage folder for the SemStore indices
#
# default is a new folder in JVM working directory
semspect.rdf.databases[0].indicesDirectory: "path/to/directory"

# parsingStrategy
#
# options:
# - ONE_PASS (default):
#   - generate base structures (triples, dictionary) in a single pass
#     over the provided RDF datasets
#   - consumes more main memory because uncompressed dictionary and triples
#     are loaded simultaneously into memory
# - TWO_PASS:
#   - generate base structures (triples, dictionary) in two iterations
#     over the provided RDF datasets
#   - dictionary is generated in first pass, triples in second
#   - consumes less main memory since dictionary is compressed during
#     generation on demand
semspect.rdf.databases[0].indexing.parsingStrategy: ONE_PASS

# numberOfThreads
#
# default: available processors of machine
semspect.rdf.databases[0].indexing.numberOfThreads: 4

# terminateAfterIndexing
#
# default: false
semspect.rdf.databases[0].indexing.terminateAfterIndexing: false

# validateParsedResources
#
# default: false
semspect.rdf.databases[0].indexing.validateParsedResources: false

# iriDictionarySectionType
#
# Which type of dictionary section should be used for all IRIs.
# An uncompressed type requires more space but might lead to a higher
# performance when applying string filters.
#
# options:
# - PLAIN_FRONT_CODING (default)
# - UNCOMPRESSED_STRINGS
semspect.rdf.databases[0].indexing.iriDictionarySectionType: PLAIN_FRONT_CODING

# stringLiteralDictionarySectionType
#
# Which type of dictionary section should be used for all string literals.
# An uncompressed type requires more space but might lead to a higher
# performance when applying string filters.
#
# options:
# PLAIN_FRONT_CODING (default)
# UNCOMPRESSED_STRINGS
semspect.rdf.databases[0].indexing.stringLiteralDictionarySectionType: PLAIN_FRONT_CODING

# disableRDFSDomainRangeEntailment
#
# When set to true all entailments implied by rdfs:domain as well as rdfs:range
# are ignored.
#
# default: false
semspect.rdf.databases[0].indexing.disableRDFSDomainRangeEntailment: false

#########################
# experimental features
#########################

# translateSKOSToRDFS
#
# This option introduces a class hierarchy to reflect the SKOS concept
# taxonomy. A concept class is generated for each SKOS concept that has a narrower concept. 
# and each skos:broader, skos:narrower, and skos:exactMatch relation is translated to the
# corresponding rdfs:subClassOf axioms.
# Moreover, all root and leaf concepts of the derived hierarchy are assigned new classes
# "Root Concept" and "Leaf Concept" respectively.
# Example:
#   For ":super-c skos:narrower :sub-c", we derive ":sub-c rdfs:subClassOf :super-c".
#   For ":sub-c skos:broader :super-c", we derive ":sub-c a :super-c".
#
# default: false
semspect.rdf.databases[0].indexing.translateSKOSToRDFS: false

# indexReificationTriples
#
# If set to "true", reification triples that use the RDF or OWL vocabulary, such as "_:xxx rdf:subject :some-subject",
# are indexed and can then be explored. However, as rdf:Statement and owl:Axiom are disabled by default, it may be useful
# to additionally set the parameter "[...].exploration.showStatementClassInTree" (as specified in the load configuration).
# RDF Reification vocabulary: https://www.w3.org/TR/rdf11-mt/#reification
# OWL Annotation vocabular: https://www.w3.org/TR/owl2-quick-reference/#Annotations
#
# default: false
semspect.rdf.databases[0].indexing.indexReificationTriples: false

Load Indices and Start SemSpect #

#############################
# required loading parameters
#############################

# "databases[0]" can be used to specify the first entry in the databases array
# (used here to highlight the full path of the nested YAML parameter)
semspect.rdf.databases[0].database: default-db
semspect.rdf.databases[0].mode: load
semspect.rdf.databases[0].indicesDirectory: "path/to/indices/directory"

Additional Initialization Parameters (All Optional) #

Server Settings #

##################################
# additional server settings
##################################

# port
#
# default: 8080
server.port: 8080

# x-frame support
# The x-frame functionality is a way of combining and organizing web-based
# documents together on a single webpage through the use of frames (HTML
# elements that allow you to display content from another source, such as
# another webpage or a video). By default, the x-frame functionality is
# disabled by Spring Boot, however, by using the following parameter, it is
# possible to define frame-ancestors to include SemSpect in other HTML pages
# (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Security-Policy/frame-ancestors) .
#
# default: disabled; below are some examples
server.frame-ancestors:
  # local HTML files are allowed to integrate SemSpect in an x-frame
  - "file://*"
  # any subdomain of "example.org" is allowed to integrate SemSpect in an x-frame
  - "https://www.*.example.org"

# context path
#
# Determines at which context path the content will be served by the server.
# For instance, if it is set to "/dataset-x", SemSpect gets hosted on "localhost:8080/dataset-x".
# This option might be helpful to distinguish different running instances of
# SemSpect not merely by their port.
#
# default: /
server.servlet.context-path: /

# basic auth
#
# Enables Basic HTTP Authentication for a single user. 
# 
# Generate the password_hash via spring-boot cli's encodepassword command.
# Installation and usage see: https://docs.spring.io/spring-boot/cli
# E.g: ./spring encodepassword my_password
#
# default: disabled
#spring.security:
#    user:
#      name: 'username'
#      password: 'password_hash' 

Exploration Settings #

##################################
# additional exploration settings
##################################

# explorationMenuWithoutCount
#
# If enabled, the exploration menu does not show the distinct number of connected
# resources. It only shows whether at least one resource of a particular class is
# connected to the given group (increases performance).
#
# default: false
semspect.rdf.databases[0].exploration.explorationMenuWithoutCount: false

# numberOfThreads
#
# default: number of processors of machine
semspect.rdf.databases[0].exploration.numberOfThreads: 4

# cacheGroups
#
# default: true
semspect.rdf.databases[0].exploration.cacheGroups: true

# showClassesAndPropertiesAsResources
#
# If set to true, rdf:Property and rdfs:Class are shown in the class tree.
#
# default: false
semspect.rdf.databases[0].exploration.showClassesAndPropertiesAsResources: false

# showTopClassInTree
#
# If set to true, the top class rdfs:Resource is shown in the class tree.
#
# default: false
semspect.rdf.databases[0].exploration.showTopClassInTree: false

#
# If set to true, the classes rdf:Statement and owl:Axiom are shown in the class tree.
#
# default: false
semspect.rdf.databases[0].exploration.showStatementClassInTree: false

# logMemoryUsage
#
# Logs the used main memory to the file
# INDICES_DIRECTORY/exploration/log/memoryConsumption.csv
#
# default: false
semspect.rdf.databases[0].exploration.logMemoryUsage: false

# explorationMenuComputationMethod
#
# options:
# - ROARING_BITMAPS_PER_CLASS
# - SORTED_ITERATION_PER_CLASS
# - INDIVIDUAL_QUERIES_PER_CLASS
# - INDIVIDUAL_QUERIES
# - ROARING_BITMAPS
# - DYNAMICALLY_DETERMINED_PER_CLASS
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.explorationMenuComputationMethod: DYNAMICALLY_DETERMINED

# explorationMenuWithoutCountComputationMethod
#
# options:
# - PER_CLASS
# - GLOBAL
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.explorationMenuWithoutCountComputationMethod: DYNAMICALLY_DETERMINED

# predecessorCountComputationMethod
#
# options:
# - SORTED_ITERATION
# - HASH_SET
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.predecessorCountComputationMethod: DYNAMICALLY_DETERMINED

# filterComputationMethod
#
# options:
# - QUERY_PER_INDIVIDUAL
# - INDEX_ITERATION
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.filterComputationMethod: DYNAMICALLY_DETERMINED

# sortingMethod
#
# options:
# - INDEX_ITERATION
# - SORTING_ON_THE_FLY
# - DYNAMICALLY_DETERMINED (default)
semspect.rdf.databases[0].exploration.sortingMethod: DYNAMICALLY_DETERMINED

IRI Prefix Configuration #

In order to shorten resource IRIs in the UI, SemStore collects all prefixes that have been defined in the provided RDF datasets. Furthermore, a list of commonly deployed RDF prefixes is added by default. Note that the explicitly given prefixes of the RDF datasets have a higher priority than the defaults. To examine and modify the IRI-to-prefix map, inspect the file INDICES-DIRECTORY/exploration/config/iriToPrefixMap.yaml. The changes will be applied after the next startup of SemSpect.

Script Configuration #

All scripts can be configured through environment variables:

  • Path of the java command: SEMSPECT_JAVA_PATH
    (default: "java")
  • SemSpect specific JDK options: SEMSPECT_JDK_OPTIONS
    (default: <empty>, overrides JDK_JAVA_OPTIONS and JAVA_TOOL_OPTIONS)
  • Installation directory: SEMSPECT_HOME
    (default: <script location>)

Moreover, settings for the paths of the configuration and output paths are available for the smart scripts (semspect-smart.sh & semspect-smart.bat):

  • Location of the semspect configuration: SEMSPECT_CONFIG_PATH
    (default: <SEMSPECT_HOME>/smart-config/semspect_config.yaml.yaml)
  • Location of the semstore configuration: SEMSTORE_CONFIG_PATH
    (default: <SEMSPECT_HOME>/smart-config/semstore_config.yaml)
  • Location of the indices: SEMSTORE_INDICES_DIR
    (default: <SEMSPECT_HOME>/smart-indices/)

Dossier Configuration #

For each dataset, we can define the rendering type of data property values such that URLs become links or image URLs show the respective image etc. This is done in a separate dossier configuration.

As an example, consider the following data:

@prefix :     <http://www.example.org/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:Company
  a rdfs:Class .

:derivo
  a      :Company ;
  :name  "Derivo" ;
  :image "https://www.derivo.de/fileadmin/externe_websites/ext.derivo/images/layout/logo_deutsch_01.png" ;
  :email "info@derivo.de" ;
  :url   "https://www.derivo.de/home/" .

Now, to specify the UI types of the respective properties (called attributes in the configuration), a YAML dossier configuration is required:

version: 1
databases:
  - database: default-db
    attributes:
      - id: 'http://www.example.org/image'
        type: 'IMAGE' # images or GIFs
      - id: 'http://www.example.org/url'
        type: 'URL'
      - id: 'http://www.example.org/email'
        type: 'EMAIL'

Depending on the startup script, SemSpect is reading the dossier configuration by default from the following paths:

  • semspect-smart.sh: SEMSPECT_HOME/smart-config/datasets/SOME_DATASET/dossier.yaml
  • semspect-server.sh: $(pwd)/server-config/dossier.yaml

Alternatively a different file can be specified using the following JVM parameter (not available as Spring parameter):

-Dde.derivo.semspect.server.configuration.dossier.path

Example:

#!/bin/bash
export SEMSPECT_JDK_OPTIONS=-Dde.derivo.semspect.server.configuration.dossier.path="path/to/dossier-config.yaml"
./semspect-smart.sh run test-data.ttl

Result in SemSpect:

Facet Configuration (Experimental) #

In RDF SemSpect, facets are configurable filter for groups shown in separate sections of the dossier on the right-hand side. They behave in the same way as the class filters and can be collapsed or expanded in the case of a hierarchy. In fact, facets display classes from the class schema. With the help of a facet configuration one can define which parts of the class schema at which particular combination are shown as a separate facet. This is useful to distinguish between the backbone class schema and classes that provide a “second perspective” on the data. As an example, consider the following dataset that deals with clothes, their sales status and usage type:

In case our main focus is on the type of clothing (shirt, trouser etc.) it might be useful to declare sales and seasonal information as facets. This has the advantage that the class tree and exploration menu becomes shorter (facets are not shown in the class tree and exploration menu). However, the facet classes are still available for filtering on those groups that contain individuals of those classes:

Please note that all specified classes of a facet no longer appear in the class hierarchy shown on the left, unless they also occur in a branch that is not declared as a facet.

Facets are defined in a facet configuration file. Each facet definition consists of one or more so-called facet values and a facet name. There are two types of facets:

  • simple (default): only the listed classes are used as facet values, not their subclasses (unless they are also listed). This setting is mostly relevant for Neo4j SemSpect where the label hierarchies are extrapolated from the data and might not be semantically meaningful (this does not happen in RDF SemSpect).
    Remark: unlisted subclasses of a simple facet class will be moved up one level in the class tree and histograms (or hidden if they appear in a neighbour branch).

  • subtree: the listed classes and all their subclasses are used as facet values.
    Remark: subclasses of a subtree facet class that appear in other non facet branches of the class hierarchy can still be found there.

In the example above, the following facet definition was given:

version: 1
databases:
  - database: default-db
    facets:
      - name: 'Product type'
        values: [ 'http://www.semspect.de/test#Special_offer',
                  'http://www.semspect.de/test#Remaining_item' ]
        type: simple
      - name: 'Seasonal classification'
        values: [ 'http://www.semspect.de/test#Seasonal_type' ]
        type: subtree
Click to expand the matching RDF data of this example
@prefix : <http://www.semspect.de/test#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:Textiles rdf:type rdfs:Class .
:Trousers rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Pullover rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Shirts rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Dresses rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .
:Suits rdf:type rdfs:Class ; rdfs:subClassOf :Textiles .

:Special_offer rdf:type rdfs:Class .
:Remaining_item rdf:type rdfs:Class .

:Seasonal_type rdf:type rdfs:Class .
:Summer rdf:type rdfs:Class ; rdfs:subClassOf :Seasonal_type .
:Winter rdf:type rdfs:Class ; rdfs:subClassOf :Seasonal_type .
:Transitional rdf:type rdfs:Class ; rdfs:subClassOf :Seasonal_type .
:Spring rdf:type rdfs:Class ; rdfs:subClassOf :Transitional .
:Autumn rdf:type rdfs:Class ; rdfs:subClassOf :Transitional .

:trouser1 rdf:type :Trousers ; rdf:type :Winter .
:trouser2 rdf:type :Trousers ; rdf:type :Winter ; rdf:type :Special_offer .
:trouser3 rdf:type :Trousers ; rdf:type :Autumn .
:trouser4 rdf:type :Trousers ; rdf:type :Spring .
:trouser5 rdf:type :Trousers ; rdf:type :Winter .
:short1 rdf:type :Trousers ; rdf:type :Summer ; rdf:type :Special_offer .
:short2 rdf:type :Trousers ; rdf:type :Summer .

:pullover1 rdf:type :Pullover ; rdf:type :Autumn .
:pullover2 rdf:type :Pullover ; rdf:type :Autumn .
:pullover3 rdf:type :Pullover ; rdf:type :Winter ; rdf:type :Remaining_item .
:pullover4 rdf:type :Pullover ; rdf:type :Winter .

:shirt1 rdf:type :Shirts ; rdf:type :Autumn ; rdf:type :Special_offer .
:shirt2 rdf:type :Shirts ; rdf:type :Winter .
:shirt3 rdf:type :Shirts ; rdf:type :Spring .
:tshirt1 rdf:type :Shirts ; rdf:type :Summer .
:tshirt2 rdf:type :Shirts ; rdf:type :Summer .

:dress1 rdf:type :Dresses ; rdf:type :Autumn .
:dress2 rdf:type :Dresses ; rdf:type :Winter .
:dress3 rdf:type :Dresses ; rdf:type :Autumn ; rdf:type :Remaining_item .

:suit1 rdf:type :Suits ; rdf:type :Autumn .
:suit2 rdf:type :Suits ; rdf:type :Winter ; rdf:type :Special_offer .
:suit3 rdf:type :Suits ; rdf:type :Summer .
:suit4 rdf:type :Suits ; rdf:type :Summer .

:matches rdf:type rdf:Property .

:trouser1 :matches :pullover1 .
:trouser1 :matches :shirt1 .
:trouser2 :matches :pullover1 .
:trouser2 :matches :pullover2 .
:trouser3 :matches :pullover3 .
:trouser4 :matches :pullover4 .
:trouser4 :matches :shirt3 .
:trouser4 :matches :shirt2 .

Depending on the startup script, SemSpect is reading the facet definitions from the following paths (if existent):

  • semspect-smart.sh: SEMSPECT_HOME/smart-config/datasets/SOME_DATASET/facets.yaml
  • semspect-server.sh: $(pwd)/server-config/facets.yaml Alternatively a different file can be specified by a JVM parameter: -Dde.derivo.semspect.server.configuration.facets.path=/path/to/semspect_facets.yaml

Category Configuration (Experimental) #

Remark: Since this configuration is shared by the Neo4j and RDF versions of SemSpect, it uses the generic terms categories and id. In the RDF case this should be understood as classes and IRI.

Captions and descriptions provide essential textual representations of resources within an exploration.

Caption and description

These representations are either the value of a user-selected data property of the resource or the IRI of the resource. Captions are used as primary text elements, appearing as titles or flag labels when specific resources are highlighted. Descriptions are secondary text elements that are shown as supplementary information under headers or in tooltips. The caption and description for each class can be set via the context menu within the class tree.

Categories Config

A configuration file can be added to define the default properties for captions and descriptions of some classes, which can be overridden by the user via the context menu within the class tree.

As an example, consider the following dataset:

@prefix :     <http://www.example.org/> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

:City rdfs:subClassOf :Place.
:Country rdfs:subClassOf :Place.
:Coordinate rdfs:subClassOf :Place.

:germany a :Country;
  rdfs:label "Germany";
  :countryName "DE".
:coordinate1 a :Coordinate;
  :coordinate "48.398537, 9.992537" ;
  :name "Ulmer Münster" .
:ulm a :City;
  rdfs:label "Ulm";
  :postalCode "89073".

We use java regular expression to define which data property should be used as default caption/description.

databases:
  - database: default-db
    categories:
      - id: 'semspect::top' # special category 'TOP'. Fallback if no other configuration matches.
        description: 
          regex:
            - "\\Qsemspect::object_id\\E" # semspect::object_id is a special reference for the IRI of a resource
      - id: 'http://www.example.org/Place'
        caption:
          regex:
            - "(?i).*name" #  use some property ending on name
            - "\\Qhttp://www.w3.org/2000/01/rdf-schema#label\\E" # if no such property exists use rdfs:label 
      - id: 'http://www.example.org/City'
        caption: 
          regex:
            - '.*postalCode' # ending with postalCode 

We use a “nearest matching super-class” strategy to select which caption/description to use.

We first select a matching class:

  • If a regular expression was defined for the class itself, and it has at least one property matching one of the regular expressions, the class itself is a match.
  • If no regular expression was defined for the class itself or none of its properties matched, we collect the parent classes and repeat the process with these parent classes level by level, until at least a match is found among the ancestors.
    • If multiple matches occur in the same iteration, we select an ancestor class in a deterministic but unspecified manner.
  • If no match is found, we default to the definition of ‘semspect::top’.
    • If ‘semspect::top’ was not defined, rdfs:label is used as a default.

We then select a matching property:

  • The first matching regular expression is selected
  • If multiple properties match the select regular expression, we select one of them in a deterministic but unspecified manner.

In the example above:

  • The default caption of :City is :postalCode since the regular expression .*postalCode matches.
  • The default caption of :Country is :countryName since the nearest superclass is :Place and the regex (?i).*name" matches.
  • The default caption of :Coordinate is :name since the nearest superclass is :Place and the regex (?i).*name" matches.
  • The default descriptions of all classes are the IRIs because it’s defined as default in semspect::top.

Remark: most of the use cases should be easily configurable with regex lists ordered by preference for the caption and description for semspect::top (or for a handful of top classes indicating provenance of the data in heterogeneous datasets).

Depending on the startup script, SemSpect is reading the category definitions from the following paths (if existent):

  • semspect-smart.sh: SEMSPECT_HOME/smart-config/datasets/SOME_DATASET/categories.yaml
  • semspect-server.sh: $(pwd)/server-config/categories.yaml Alternatively a different file can be specified by a JVM parameter: -Dde.derivo.semspect.server.configuration.category.path=/path/to/semspect_categories.yaml

SemStore Statistics #

SemStore collects statistics while generating all indices as well as during the exploration of a corresponding dataset. These statistics are stored in a subdirectory of the specified indices folder and can be visualized with our SemStore statistics Python application that is available on DockerHub. To execute the respective application with Docker, take a look at the scripts located in the tools/semstore-statisics/ folder of the SemSpect installation directory.

To generate plots for a single indices directory:

./semstore-eval.sh ./path/to/indices/directory

To carry out the meta-evaluation for several index directories, the shell script ./semstore-meta-eval.sh must be adapted accordingly.