Chapter 4. Advanced repository configuration

Table of Contents

4.1. Basic setup
4.1.1. The repository id and title
4.1.2. The Sail stack
4.2. Native Sail Indexing
4.3. Custom inferencing
4.3.1. XML syntax
4.3.2. Example
4.3.3. Configuration
4.3.4. Notes and Hints
4.4. Change Tracking

When setting up a repository in Sesame, you can make a number of choices: should the repository support versioning or security, or should it be as fast as possible? What database will it use, or will it be in-memory?

In this chapter, we look at several of these configuration options in more detail.

4.1. Basic setup

The setup for each Sesame repository is configured using Configure Sesame!. As we have already seen in Server administration, this configuration tool allows tweaking of numerous parameters, which we will discuss in more detail here.

4.1.1. The repository id and title

In the repository tab (Figure 3.6, “The "Repository" tab”), the repository id and title are declared. The id is how the repository will be known by Sesame: all client access will need to use this identifier.

The title is for human convenience and can be used to give a short description of the repository's purpose. Clients such as the web interface use it to represent the repository to the end user.

4.1.2. The Sail stack

The most important part of the repository configuration is the sail stack, which can be found in the "repository details" screen (Figure 3.7, “The "Repository details" window”). Here, you configure where the actual repository storage resides, whether or not inferencing, security and versioning, etc. should be used, and what additional options are needed.

The sail stack is represented top-to-bottom. In the example, we see two sail declarations: org.openrdf.sesame.sailimpl.sync.SyncRdfSchemaRepository and org.openrdf.sesame.sailimpl.rdbms.RdfSchemaRepository. The first sail is stacked on top of the second one (which means that it operates by calling methods on the Sail underneath it). The second sail is the base sail: it is the lowest of the stack and does not operate on another sail, but directly on the actual data source. In this example, the base sail is an RDF Schema-aware driver for a relational database that supports (currently) MySQL (3.23.47 and higher), PostgreSQL (7.0.2 and higher) and Oracle 9i.

The SyncRdfSchemaRepository is optional, but we strongly recommend using it. This Sail handles concurrent access issues, without it Sesame would behave unpredictably when several users access the repository simultaneously.

Other base sails to choose from include:

  • org.openrdf.sesame.sailimpl.rdbms.RdfRepository: an non-inferencing driver for relational database storage.
  • org.openrdf.sesame.sailimpl.omm.versioning.VersioningRdbmsSail: an inferencing driver for relational database storage that supports change tracking.
  • org.openrdf.sesame.sailimpl.memory.RdfRepository: a non-inferencing driver for storage in main memory.
  • org.openrdf.sesame.sailimpl.memory.RdfSchemaRepository: an inferencing driver for storage in main memory that support RDF and RDF Schema entailment.
  • org.openrdf.sesame.sailimpl.nativerdf.NativeRdfRepository: a non-inferencing driver for storage directly on disk.

All base sails that work on relational databases need a number of parameters to function:

  • jdbcDriver identifies the JDBC (Java Data Base Connectivity) driver that is to be used to access the database. In the example, com.mysql.jdbc.Driver, the standard MySQL JDBC driver, is used.
  • jdbcUrl identifies the location of the database through a URL. The precise syntax of this URL is DBMS-dependent. An example URl for a MySQL database would be jdbc:mysql://localhost:3306/testdb. This specifies a database names testdb on a MySQL server running on localhost, which uses port 3306 for communication. The last part of the URL identifies the name of the database (in this case testdb). Note that this is the name of the database as it is known to the DBMS, and that it is not related to the Sesame repository id (though it might be convenient to assign them identical names).
  • user identifies a username with which Sesame can access the database. This must therefore be a user which is known to the DBMS, and which has been granted access rights (see also Server administration).
  • password identifies a password with which Sesame can access the database. This must therefore be a password that matches the username configured in the user parameter.

The RDBMS-based sails also take some optional parameters:

  • dependency-inferencing indicates whether the dependency-based truth maintenance should be used (possible values are 'yes' and 'no', the default is 'yes'). Dependency-based truth maintenance speeds up removal operations, but performance of uploads is slowed down.
  • commitInterval indicates a number of triples to be added before the sail does an in-between commit during upload of large datasets. The default is '1000'. This figure can be tweaked to improve upload performance.

The memory-based sails take four optional parameters:

  • file specifies a file in which the in-memory repository stores its contents on local disk. This file is automatically saved and reloaded on (re)start of the server.
  • dataFormat specifies the format of the data in the file. Legal values are 'rdfxml' (the default), 'ntriples' and 'turtle'.
  • compressFile specifies whether the file used for storage should be compressed with gzip.
  • syncDelay specifies the time (in milliseconds) to wait after a transaction was commited before writing the changed data to file. Setting this variable to '0' (the default value) will force a file sync immediately after each commit. A negative value will deactivate file synchronization until the Sail is shut down. A positive value will postpone the synchronization for at least that amount of milliseconds. If in the meantime a new transaction is started, the file synchronization will be rescheduled to wait for another syncDelay ms. This way, bursts of transaction events can be combined in one file sync, improving performance.

The native sail has one required parameter:

  • dir specifies the directory that can be used by the native sail to store its files.

The native sail also has an optional triple-indexes parameter, with which one can specify the indexing strategy the native sail should take. We will explain this in more detail in the next section.

4.2. Native Sail Indexing

The native store uses B-Trees for indexing statements, where the index key consists of three fields: subject (s), predicate (p) and object (o). The order in which each of these fields is used in the key determines the usability of an index on a specify triple query pattern: searching triples with a specific subject in an index that has the subject as the first field is signifantly faster than searching these same triples in an index where the subject field is second or third. In the worst case, the 'wrong' triple pattern will result in a sequential scan over the entire set of triples.

By default, the native store only uses a single index, with a subject-predicate-object key pattern. However, it is possible to define different indexes for the native store, using the triple-indexes parameter. This can be used to optimize performance for query patterns that occur frequently.

The subject-, predicate- and object fields are represented by the characters 's', 'p' and 'o', respectively. Indexes can be specified by creating 3-letter words from these three characters. Multiple indexes can be specified by separating these words with comma's, spaces and/or tabs. For example, the string "spo, pos" specifies two indexes; a subject-predicate-object index and a predicate-object-subject index.

Of course, creating multiple indexes speeds up querying, but there is a cost factor to take into account as well: adding and removing data will become more expensive, because each index will have to be updated. Also, each index takes up additional disk space.

The native store automatically creates/drops indexes upon (re)initialization, so the parameter can be adjusted and upon the first refresh of the configuration the native store will change its indexing strategy, without loss of data.

4.3. Custom inferencing

The basic set of RDFS inference rules (as defined in the RDF(S) MT semantics) sometimes can be insufficient to build custom applications. For example, in some applications there is a need for defining one's own transitive, symmetric or inverse properties. Providing an infrastructure to define such custom inference rules helps developers to tune the Sesame inferencer so it can suit better in the application.

Since Sesame release 0.95, we provide an alternative inferencer that works with org.openrdf.sesame.sailimpl.rdbms.RdfSchemaRepository SAIL. This custom inferencer can be initialized with a set of axiomatic triples and inference rules defined in an external file. The format of these definitions is very simple and intuitive and it is explained in greater detail in the next section.

Support for inter-rule dependency is also added to the customizable inferencer. Now we can state explicitly which rules are triggered if a rule infers a new statement. This information is given within an additional tag within the 'rule' one - 'triggers_rule'. It consists of several 'rule' tags with a name attribute specifying the rules affected.

4.3.1. XML syntax

The definition file is in XML and should conform to the following DTD:

<!DOCTYPE InferenceRules [
  <!ENTITY rdf 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
  <!ENTITY rdfs 'http://www.w3.org/2000/01/rdf-schema#'>
  <!ENTITY daml 'http://www.daml.org/2001/03/daml+oil#'>

  <!ELEMENT InferenceRules (axiom | rule)*>

  <!ELEMENT axiom (subject, predicate, object)>

  <!ELEMENT rule ((premise+, consequent, triggers_rule?) | EMPTY)>
  <!ATTLIST rule
            name CDATA #REQUIRED>

  <!ELEMENT premise (subject, predicate, object)>
  <!ELEMENT consequent (subject, predicate, object)>
  <!ELEMENT triggers_rule (rule)*>

  <!ELEMENT subject EMPTY>
  <!ATTLIST subject
            var     CDATA      #IMPLIED
            uri     CDATA      #IMPLIED
            pattern CDATA      #IMPLIED
            escape  CDATA      #IMPLIED
            type    (resource) #IMPLIED>

  <!ELEMENT predicate EMPTY>
  <!ATTLIST predicate
            var     CDATA      #IMPLIED
            uri     CDATA      #IMPLIED
            pattern CDATA      #IMPLIED
            escape  CDATA      #IMPLIED
            type    (resource) #IMPLIED>

  <!ELEMENT object EMPTY>
  <!ATTLIST object
            var     CDATA      #IMPLIED
            uri     CDATA      #IMPLIED
            pattern CDATA      #IMPLIED
            escape  CDATA      #IMPLIED
            type    (resource) #IMPLIED>
]>

If a 'uri' attribute is present within the 'subject', 'predicate' or 'object' tags, its value is assumed to be a name of a resource.

The value of the 'var' attribute of the above tags gives the name of that variable. This attribute cannot be used within an 'axiom' tag.

For example, here are two of the axiomatic triples, as they are defined in the RDF(S) MT semantics. They appear in the configuration file like this:

<axiom>
	<subject   uri="&rdfs;subPropertyOf"/> 
	<predicate uri="&rdfs;domain"/> 
	<object    uri="&rdf;Property"/>
</axiom>
<axiom>
	<subject   uri="&rdfs;subPropertyOf"/>
	<predicate uri="&rdfs;range"/>
	<object    uri="&rdf;Property"/>
</axiom>

An example of an inference rule (one stating that - if a resource is used as predicate then it is of 'type' 'Property') looks like:

<rule name="rdfs1">
    <premise>
        <subject   var="xxx"/>
        <predicate var="aaa"/>
        <object    var="yyy"/>
    </premise>

    <consequent>
        <subject   var="aaa"/>
        <predicate uri="&rdf;type"/>
        <object    uri="&rdf;Property"/>
    </consequent>

    <triggers_rule>
        <rule name="rdfs2" />
        <rule name="rdfs3" />
        <rule name="rdfs4a" />
        <rule name="rdfs5b" />
        <rule name="rdfs6" />
        <rule name="rdfs9" />
    </triggers_rule>
</rule>

In the above example 'xxx', 'aaa' and 'yyy' are variables and 'rdf:type' and 'rdf:Property' are exact resource URIs.

A 'pattern' attribute with conjunction with an 'escape' attribute is used to define a pattern for matching resource names. They both can appear only in a triple component denoting variables, e.g. with 'var' attribute specified. Use '?' to denote any single character and '*' to match any character combination with length greater than 0.

Use a character declared in 'escape' attribute to escape '?' or '*' characters within pattern. You need to specify 'pattern' and 'escape' attributes for a given variable only once per rule (note that pattern and escape are used only once for variable 'id'.

An example of rule using pattern matching:

<rule name="rdfsXI">
    <premise>
        <subject   var="xxx"/>
        <predicate var="id" pattern="&rdf;_*" escape="\"/>
        <object    var="yyy"/>
    </premise>

    <consequent>
        <subject   var="id"/>
        <predicate uri="&rdf;type"/>
        <object    uri="&rdfs;ContainerMembershipProperty"/>
    </consequent>

    <triggers_rule>
        <rule name="rdfs2" />
        <rule name="rdfs3" />
        <rule name="rdfs6" />
        <rule name="rdfs9" />
        <rule name="rdfs10" />
    </triggers_rule>
</rule>

Note that you can match these triple templates by the values to the variables used in them and the specified resources used as subjects, predicates or objects of a triple.

4.3.2. Example

Consider the property URI is http://somewhere.org#partOf. In our example domain, we wish to ensure that this resource is always inserted in the repository, so we add the axiomatic triple stating that it is a property:

<axiom>
    <subject   uri="http://somewhere.org#partOf"/> 
    <predicate uri="&rdf;type"/> 
    <object    uri="&rdf;Property"/>
</axiom>

We also wish to define that the property is transitive. To this end, we add a single inference rule:

<rule name="userPartOf">
    <premise>
        <subject   var="xxx"/>
        <predicate uri="http://somewhere.org#partOf"/>
        <object    var="yyy"/>
    </premise>
    <premise>
        <subject   var="yyy"/>
        <predicate uri="http://somewhere.org#partOf"/>
        <object    var="zzz"/>
    </premise>

    <consequent>
        <subject   var="xxx"/>
        <predicate uri="http://somewhere.org#partOf"/>
        <object    var="zzz"/>
    </consequent>

    <triggers_rule>
        <rule name="rdfs2" />
        <rule name="rdfs3" />
        <rule name="rdfs6" />
        <rule name="userPartOf" />
    </triggers_rule>
</rule>

If the repository has these two triples: T1 - (finger.1, partOf, Hand.Left) and T2 - (Hand.Left, partOf, Human.1) and if they match the condition (since the same 'yyy' variable is used in both 'premise' tags) T1.object = T2.subject, a triple corresponding to the 'consequent' tag is added to the repository, using the current variable bindings and will have the form TInfer = (T1.subject, partOf, T2.object) e.g. Tinfer=(Finger.1, partOf, Human.1).

4.3.3. Configuration

The inferencer used by a repository based on org.openrdf.sesame.sailimpl.rdbms.RdfSchemaRepository sail is defined by a parameter passed to it during the initialization. To start using the custom inferencer on a repository, add the following extra parameter to the configuration of that repository:

  • use-inferencer specifies the full classname of the inferencer. To use the custom inferencer, use the value org.openrdf.sesame.sailimpl.rdbms.CustomInferenceServices.
  • rule-file specifies the location of the XML file in which the inference rules for the custom inferencer are specified. Make sure that you specify the full path name.

4.3.4. Notes and Hints

An example rules file, containing the axioms and entailment rules as specified by the January 23 Working Draft of the RDF Model Theory, can be found in the Sesame source tree, specifically in src/org/openrdf/sesame/sailimpl/rdbms/entailment-rdf-mt-20030123.xml. This file is used per default by the custom inferencer if the rule-file parameter is not specified.

Changes to the rules file do not lead to automatic reapplication of the rules over the existing data in the repository. So clean the repository first to avoid inconsistency problems.

The dependency information used by the TMS system is also affected by the rules. The default inferencer uses dependency database table, that can handle cases where up to two triples leads to the inference of a new one. Since there can exist inference rules involving arbitrary number of 'premise' tags in the configuration file - the structure of the default dependency table cannot handle them. To avoid loss of data, the structure of that table is not altered and it is created only if it not exist. This check is performed during repository initialization phase. So it is better to apply new/modified inference rules on a completely clean datastorage (database).

4.4. Change Tracking

[This section not yet available. See the documentation at http://www.ontotext.com/omm/ for details.]