Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove xsd:duration datatype from the mappings #145

Open
Lars-H opened this issue Oct 11, 2023 · 7 comments
Open

Remove xsd:duration datatype from the mappings #145

Lars-H opened this issue Oct 11, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@Lars-H
Copy link

Lars-H commented Oct 11, 2023

Describe the bug
Thanks for providing these insightful resources. I have been using them lately and I have encountered some minor issues.
I tried to follow your description from your journal paper to materialize the KG as RDF. I have seen a couple of problems.

  • Materializing the virtual KG as RDF using rdfizer leads to non-absolute IRIs in the RDF.
  • The remaining data seems to be valid RDF. However, the datatype of the values for the properties arrivalTime and departureTime is specified as xsd:duration while the values are not valid durations (under D-entailment).
  • The constructed data seems to be quite redundant. At scale 100, there more than 5 million different ShapePoints with the exact same latitude and longitude. (Also, there are only 960 distinct values for latitude and 1000 distinct values for longitude)

To Reproduce

  1. Generate the datasets using the provided docker tool and scale = 100
  2. Import the data in the sql directory into a MySQL DB using the provided script
  3. Materialize the data using rdfizer and the mapping file provided in the kgc-eval repo. (See rdfizer config below)
  4. Convert ntriples to turtle using rapper

Expected behavior
The materialized RDF should be valid.

Screenshots or Video
Example of a non-absolute IRI:

<http://transport.linkeddata.es/madrid/metro/feed/0000000000000000002s> <http://xmlns.com/foaf/0.1/page> <0000000000000000002s>.

Example of an invalid duration value:

<http://vocab.gtfs.org/terms#departureTime> "000000000000000000qe"^^<http://www.w3.org/2001/XMLSchema#duration>

Repeated ShapePoint geo-location. The following query yields ?cnt = 5852988.

SELECT (COUNT(DISTINCT *) AS ?cnt)
{
?x a gtfs:ShapePoint ;
 	geo:lat "999.999999999999999"^^xsd:double;
 	geo:long "999.999999999999999"^^xsd:double.
}

Resources (please complete the following information):

Additional material/context
rdfizer config:

[default]
main_directory: /data/gtfs/datasets

[datasets]
number_of_datasets: 1
output_folder: ${default:main_directory}/graph
all_in_one_file: yes
remove_duplicate: yes
enrichment: yes
name: gtfs-rdf-100
ordered: yes
dbType: mysql

[dataset1]
name: MySQLDataset
mapping: ${default:main_directory}/sql/gtfs-rdb-rml-noselfjoin.ttl
host: localhost
port: 3306
db: gtfssql
user: root
password: XXX

Thanks for your support.

@dachafra dachafra self-assigned this Oct 11, 2023
@dachafra dachafra added the question Further information is requested label Oct 11, 2023
@dachafra
Copy link
Member

Hi @Lars-H!
Happy to see that you are using the benchmark! :-). I'll answer you in detail all the questions.

Materializing the virtual KG as RDF using rdfizer leads to non-absolute IRIs in the RDF.

You're right. The problem here is that real data (the actual GTFS feed) that comes from Madrid Metro provides correctly the URL but the data generator does not support this kind of data. That should be already fixed by VIG generator, with our configuration. We will see what is happening.

The remaining data seems to be valid RDF. However, the datatype of the values for the properties arrivalTime and departureTime is specified as xsd:duration while the values are not valid durations (under D-entailment).

Yes! I thought I removed all the datatypes duration, as again, VIG generator does not support them. I'll clean and fix the mappings. Please use the ones from this official GitHub repo (not the ones from kgc-eval which could be not up to date)

The constructed data seems to be quite redundant. At scale 100, there more than 5 million different ShapePoints with the exact same latitude and longitude. (Also, there are only 960 distinct values for latitude and 1000 distinct values for longitude)

This is again a problem with the generator that we rely on. In any case, I'll try to take a look at their code to see if it can be solved (my suspicion here is that they may have the random generator not working very randomly).

In any case, there would be nice work to be done on improving the data generator of the benchmark using SHACL constraints

@Lars-H
Copy link
Author

Lars-H commented Oct 11, 2023

Hi @dachafra,

thanks for the quick reply and clarification 🙂 I'll try using the up-to-date mappings from this repo.
Looking forward to future improvements on the benchmark.

Best regards
Lars

@Lars-H Lars-H closed this as completed Oct 11, 2023
@dachafra
Copy link
Member

Hi @Lars-H,
Would you mind to open specific issues for each question? So I can solve and track all of them!

@Lars-H
Copy link
Author

Lars-H commented Oct 11, 2023

Sure, I can do that. I'll make sure to re-run the process with the updated mappings and see which issues remain. Which mappings file should I use to materialize the RDF from a MySQL DB using rdfizer?

@dachafra
Copy link
Member

It should be automatically output from the docker I guess. If not, you can use R2RML and Morph-KGC or Ontop instead of the rdfizer https://github.com/oeg-upm/gtfs-bench/blob/master/mappings/gtfs-rdb.r2rml.ttl

@Lars-H
Copy link
Author

Lars-H commented Oct 12, 2023

Ok, that worked. The only issue I am seeing now is the mentioned xsd:duration datatype. Should I report it in a separate issue?

@dachafra
Copy link
Member

no worries, I'll reopen this issue and just change the name

@dachafra dachafra reopened this Oct 12, 2023
@dachafra dachafra changed the title Issues with materialized KG Remove xsd:duration datatype from the mappings Oct 12, 2023
@dachafra dachafra added bug Something isn't working and removed question Further information is requested labels Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants