Comparison of Linked Data Triplestores: Developing the Methodology

It’s impossible to tell but almost certainly the latter.

This is of course equally true for query 7.

One interesting point to think about is how these stores may perform in a clustered environment.

As mentioned, AnzoGraph is the only OLAP database in this comparison so in theory should perform significantly better once clustered.

This is of course important when analysing big data.

Another problem I have in this comparison is the scalability of the data.

How these triplestores perform as they transition from a single node to a clustered environment is often important for large scale or high growth companies.

To tackle this, a data generator alongside my query generators will allow us to scale from 10 triples to billions.

Query 7:This query (found here) finds all people born in Berlin before 1900.

PREFIX xsd: <http://www.

w3.

org/2001/XMLSchema#>PREFIX foaf: <http://xmlns.

com/foaf/0.

1/>PREFIX : <http://dbpedia.

org/resource/>PREFIX dbo: <http://dbpedia.

org/ontology/>SELECT ?name ?birth ?death ?personWHERE { ?person dbo:birthPlace :Berlin .

?person dbo:birthDate ?birth .

?person foaf:name ?name .

?person dbo:deathDate ?death .

FILTER (?birth < "1900-01-01"^^xsd:date) } ORDER BY ?nameThis is a simple extract and filter query that is extremely common.

With a simple query like this across 245 million triples, the maximum time difference is just over 100ms.

I learned a great deal from the feedback following my last comparison but this experiment has really opened my eyes to how difficult it is to find the “best” solution.

Next StepsI learned recently that benchmarks require significantly more than three warm up runs.

In my benchmark I will run around 1,000.

Of course, this causes problems if my queries do not have random seeds so I think it is clear from this article that I will have at least one random seed in each query template.

Many queries will have multiple random seeds to ensure query caching isn’t storing optimisations that can slow down possible performance.

For example, if one query gathers all football players in Peru and this is followed by a search for all la canne players in China – caching optimisation could slow down performance.

I really want to test the scalability of each solution so alongside my query generator I will create a data generator (this allows clustering evaluation).

Knowledge graphs are rarely static so in my benchmark I will have insert, delete and construct queries.

I will use full text search where possible instead of regex.

I will not use order-less limits as these are not used in production.

My queries will be realistic.

If the data generated was real, they would return useful insight into the data.

This ensures that I am not testing something that is not optimised for good reason.

I will work with vendors to fully optimise each system.

Systems are optimised for different structures of data by default which effects the results and therefore needs to change.

Full optimisation, for the data and queries I create, by system experts ensures a fair comparison.

ConclusionFairly benchmarking RDF systems is more convoluted than it initially seems.

Following my next steps with a similar methodology, I believe a fair benchmark will be developed.

The next challenge is evaluation metrics… I will turn to literature and use-case experience for this but suggestions would be very welcome!AnzoGraph is the fastest if you sum the times (even if you switch regex for fti times where possible).

Stardog is the fastest if you sum all query times (including 5a and 5b) but ignore loading time.

Virtuoso is the fastest if you ignore loading time and switch regex for fti times where possible…If this was a fair experiment, which of these results would be the “best”?It of course depends on use case so I will have to come up with a few use cases to assess the results of my future benchmark for multiple purposes.

All feedback and suggestions are welcome, I’ll get to work on my generators.

AppendixBelow I have listed each triplestore (in alphabetical order) alongside which version, query method and load method I used:AnzoGraphVersion: r201901292057.

betaQueried with:azgi -silent -timer -csv -f /my/query.

rqLoaded with:azgi -silent -f -timer /my/load.

rqBlazegraphVersion: 2.

1.

5Queried with:Rest APILoaded with:Using the dataloader Rest API by sending a dataloader.

txt file.

GraphDBVersion: GraphDB-free 8.

8.

1Queried with:Rest APILoaded with:loadrdf -f -i repoName -m parallel /path/to/data/directoryIt is important to note that with GraphDB I switched to a Parallel garbage collector while loading which will be default in the next release.

StardogVersion: 5.

3.

5Queried with:stardog query myDB query.

rqLoaded with:stardog-admin db create -n repoName /path/to/my/data/*.

ttl.

gzVirtuosoVersion: VOS 7.

2.

4.

2Queried within isql-v:SPARQL PREFIX .

rest of query .

 ;Loaded within isql-v:ld_dir ('directory', '*.

*', 'http://dbpedia.

org') ; then I ran a load script that run three loaders in parallel.

It is important to note with Virtuoso that I used:BufferSize = 1360000DirtyBufferSize = 1000000This was a recommended switch in the default virtuoso.

ini file.

.. More details

Leave a Reply