Diversicon allows importing and merging LMF XMLs produced by different people,
preventing the clashes that may arise. Diversicon should be able to read the XML files created with UBY 0.7.0, provided you add some bookkeeping information to the files to indicate to which namespace they belong.
name must be worldwide unique. so you should pick a
reasonable long and unique prefix for your organization. In the case of Diversicon example resources, we allowed ourselves the luxury of picking a short name like
div. So for example, the resource
smartphones declaration begins like this:
<LexicalResource name="div-smartphones" . . .
XML allows declaring namespaces only for tags and attributes (so you can write stuff like
<my-pfx:my-tag my-pfx:my-attribute="bla bla">) but we abuse them to give a scope also to tag IDs:
<tag id="my-pfx_bla">. Namespaced IDs are necessary because in UBY IDs are global, and when merging multiple sources into te db conflicts might occur.
There are a few things to keep in mind:
_separating the prefix (i.e. 'my-pfx') from the name ('i.e.
bla) like in
wn31we don't require version numbers in them
In Diversicon LMF you can declare namespaces in
LexicalResource tag the
<?xml version="1.0" encoding="UTF-8"?> <LexicalResource name="div-smartphones" prefix="sm" xmlns:sm="https://github.com/diversicon-kb/diversicon-model/blob/master/src/main/resources/smartphones.xml" xmlns:wn31="https://github.com/diversicon-kb/diversicon-wordnet-3.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://diversicon-kb.eu/schema/1.0/diversicon.xsd">
The value of a
prefix attribute is intended to be
the prefix of the document. Such prefix must be also defined in the
section of the document, like
<LexicalResource name="div-smartphones" prefix="sm" xmlns:sm="https://github.com/diversicon-kb/diversicon-model/blob/master/src/main/resources/smartphones.xml" xmlns:wn31="https://github.com/diversicon-kb/diversicon-wordnet-3.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://diversicon-kb.eu/schema/1.0/diversicon.xsd"> ... ...
Note all document tag ids must begin with the declared document prefix followed by an underscore, like
sm_ss_tablet in the following example. All referenced external ids must begin with a declared prefix, like
<Synset id="sm_ss_tablet"> <SynsetRelation target="wn31_ss_n3086983" relType="taxonomic" relName="hypernym"/> </Synset>
TODO Each Synset must be associated to at least one Sense respective LexicalEntry
Schema is provided as DTD and XSD at these addresses:
Canonical relations are privileged with reference to the inverse they might have, because Diversicon algorithms only consider canonical relations, and not their inverses.
For example, since hypernymy is considered as canonical, transitive closure graph is computed only for hypernyms, not hyponyms. To avoid missing information, after an import Diversicon makes sure canonical relations are materialized in the db from the inverses, using provenance
Wordnet 3.1 ships with information about domains, and UBY converter recognize and convert such domains. Still, we needed to work a bit on the domain representation. These were our desiderata:
First we describe domains as implemented in Wordnet, then how they are converted in UBY and finally we introduce they are modelled in Diversicon.
In UBY seems like you can't directly state that a synset is a domain (see issue about UBY Wordnet converter). You can know if a
Synset is a domain if
a) it has associated at least one
Sense that is linked in turn to a
SemanticLabel of type
b) OR other synsets point to it with one relation among
usage (pointer key
region (pointer key
topic (pointer key
Note also that LMF converters use the generic word
We did the following modifications:
1) introduced new
superDomain relations, plus the respective inverses
subDomain. Usage example:
(note computation of transitive closure won't consider leaves).
2) Established a new a root domain synset as
div_ss_n_domain in DivUpper LexicalResource, and made sure existing domains point to it. If a lexical resource has topics expressed only via UBY b) method (like Wordnet 3.1) during import normalization substep, edges pointing to the root marked with
div provenance will be automatically added.
3) When importing, Diversicon will:
domain. If in input graph there are inverses but not canonical edges, add the canonical
4) When exporting:
As usual, Diversicon will try to export a graph very similar to input one, and avoid exporting computed domain edges.
When an LMF is normalized, these conditions are met:
topicedge there must be also a