Resource Considerations

Resource Considerations #

This section is intended to provide assistance in estimating the required memory and the resulting indexing speed of SemSpect on the basis of empirical values.

The following is a statistical evaluation of RDF SemSpect on the indexing speed and the memory requirements. For this, we considered altogether 26 datasets with up to 352M triples. The benchmark has been conducted for different JVM memory allocations in order to roughly estimate the memory requirements to support a desired amount of triples. Moreover, it compares our two indexing methods, namely the “one pass” and “two pass” approaches. In particular, we generate the index (compressed dictionary and triples) in a single parsing iteration (namely “one pass”: faster, higher memory consumption) or in two separate parsing iterations (“two pass”: slower, but less memory consumption).

For the experiments, we used SemSpect v16.0.1 and conducted them on an Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz, 6 cores with 2 threads per core, 64 GB DDR3 @ 1334MHz.

Memory Requirements #

The following table should be read as a lookup table: Assuming the JVM is allocated with a certain amount of memory (JVM Memory (GB)), how many triples can you expect to be able to index (Max. Num. Triples) with RDF SemSpect? Please note that a comparison of memory consumption of “one pass” against “two pass” for a specific memory setting should be treated with caution, as the results often refer to a different number of datasets. Furthermore, this table also lists the initial memory allocation for loading an existing index into SemSpect for exploration. Since SemSpect uses caching for performance reasons the latter will increase over time up to the given allocation limit.

Dataset Statistics Maximum Memory Consumption (MB)
Approach JVM Memory (GB) Num. Datasets Max. Num. Triples Max. Num. Classes Max. Num. Instances Indexing Exploration
(initially after
a restart)
One Pass 1 16 5,344,375 809 776,845 787 185
Two Pass 1 16 5,344,414 809 776,845 741 184
One Pass 5 19 19,903,402 809 1,583,073 2,839 238
Two Pass 5 19 19,903,402 809 1,583,073 4,174 238
One Pass 10 22 72,820,690 809 5,401,200 9,749 1,485
Two Pass 10 23 152,528,023 809 19,837,857 10,011 2,749
One Pass 15 22 72,820,690 809 5,401,200 10,446 1,485
Two Pass 15 24 158,962,783 10,193 19,837,857 14,827 3,633
One Pass 20 23 152,528,023 809 19,837,857 19,852 2,846
Two Pass 20 25 257,831,425 10,193 50,567,464 19,421 5,833
One Pass 30 24 158,962,783 10,193 19,837,857 28,582 3,762
Two Pass 30 25 257,831,425 10,193 50,567,464 28,736 5,834
One Pass 40 24 158,962,783 10,193 19,837,857 36,022 4,043
Two Pass 40 26 352,417,591 10,193 50,567,464 40,689 17,059

Indexing Throughput #

The indexing throughput metric that can act as a factor to estimate the initial generation time of an index for a given RDF dump file before it can be explored in SemSpect. Assuming our dataset has 10,000,000 triples and we are using the “one pass” approach, the average required time corresponds to 10,000,000 / 50,500 ≈ 180(s) = 3(m). As can be seen, the maximum indexing throughput is much higher (factor ~3) since the individual speed depends on the dataset and its inherent characteristics such as the depth of the class and property hierarchy or the number of object property assertions in connection with the reasoning mode of SemSpect. As soon as an index has been created and saved on disk, it only takes a fraction of the indexing time to load it into memory for all subsequent calls of SemSpect.

Triples / Second
Approach Mean Maximum
One pass 50.5K 182.0K
Two pass 31.0K 104.3K