Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use xmlbf for XML parsing #67

Merged
merged 44 commits into from
Jun 3, 2019
Merged

Conversation

wismill
Copy link
Contributor

@wismill wismill commented May 23, 2019

Work in progress to add xmlbf backend for RDF/XML parsing.

@robstewart57
Copy link
Owner

@wismill it'd be absolutely brilliant if you're able to complete this xmlbf implementation of an RDF/XML parser. It'd complete parsing support for the three parsers, 100% compliant with the w3c RDF parsing unit tests.

@wismill
Copy link
Contributor Author

wismill commented May 30, 2019

By the way, this PR relies on an unmerged MR for xmlbf. I hope it will be merged soon and we will have a new release of xmlbf.

Once we're done I think it would be better to just squash the commits.

@robstewart57
Copy link
Owner

@wismill

I am glad to say with this PR all but one test of for XML are passed!

Amazing! Thank you.

The remaining one should be fixed upstream in Xmlbf.

Is there a specific GitLab issue raised at https://gitlab.com/k0001/xmlbf/issues for this fix?

@wismill
Copy link
Contributor Author

wismill commented May 30, 2019

I have fixed the test suite for Turtle. Note that I have introduced a breaking change for the Turtle parser: document URI has precedence over the base URI in the initial state of the parser. It is not clear to me which base URI (as defined in https://www.ietf.org/rfc/rfc3986.txt, section 5.1) is passed to the parser.

There are now only 2 tests failing (out of 1208!!) for this package: one that seems to be linked to a possible bug in IRI resolution, the other about XML serialization. Master branch has still 65 tests failing. We are pretty close!

I have opened a new issue for xmlbf.

@wismill
Copy link
Contributor Author

wismill commented May 30, 2019

Ok, from what I can read from this page, the tests in rdf4h/data/ttl/conformance are made obsolete by the W3C test suite. In fact the only error we got from these tests was a bug. Could we drop these tests completely?

@robstewart57
Copy link
Owner

@wismill are you suggesting removing just rdf4h/data/ttl/conformance?

I'd be fine with that. Perhaps we should keep the removal of that directory separate from this PR? If so, either we can remove it after the commits here a squashed into a single commit. Or otherwise, I could remove it in master now, for you to rebase or merge in wismill:wip/xmlbf?

@robstewart57
Copy link
Owner

There are now only 2 tests failing (out of 1208!!)

This is really excellent!

I'm following your efforts on xmlbf at https://gitlab.com/k0001/xmlbf/merge_requests/5 and https://gitlab.com/k0001/xmlbf/merge_requests/9 .

Something that occurred to me is that from xmlbf, might we get XML/RDF serialisation for free from a ToXml instance in rdf4h? I.e.

instance RdfSerializer XmlSerializer where ...

in src/Text/RDF/RDF4H/XmlSerialzer.hs .

@wismill
Copy link
Contributor Author

wismill commented Jun 1, 2019

Something that occurred to me is that from xmlbf, might we get XML/RDF serialisation for free from a ToXml instance in rdf4h?

Yes, exactly! Then we could add tests that parse . serialize == id. We cannot ensure the converse, as the conversion is lossy.

@wismill
Copy link
Contributor Author

wismill commented Jun 1, 2019

@robstewart57 do you mind that there are a lot of changes unrelated to the XML parser? I got very enthusiastic ;-) I will contain myself from now.

By the way my first PR was accepted for xmlbf.

@robstewart57
Copy link
Owner

serialize . parse == id

I've moved that discussion over to another issue: #68 .

Now that https://gitlab.com/k0001/xmlbf/commit/96ac0e1930837a58282b343a06f86055e037f627 has been accepted, once xmlbf 0.6 is uploaded to hackage is this PR ready to be squashed and merged? I will then merge the xmlbf branch into the master branch, with the in-progress XML literals support over at xmlbf causing the 1 remaining failing rdf4h test. It'd make sense to merge to master even before XML canonical serialization is implemented (https://gitlab.com/k0001/xmlbf/issues/21), since 1 failing test is better than the current 124 failing tests with the HXT/arrows based RDF/XML parser.

@wismill
Copy link
Contributor Author

wismill commented Jun 2, 2019

I need to do some cleanup and check basic performance.

@wismill
Copy link
Contributor Author

wismill commented Jun 3, 2019

The new XML parser is faster that the XHT-based one, but probably still too slow.

$ cabal v2-run rdf4h-bench -- "rdf4h/parsers/xml-"

benchmarking rdf4h/parsers/xml-xmlbf
time                 1.475 s    (1.388 s .. 1.556 s)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.442 s    (1.397 s .. 1.465 s)
std dev              43.17 ms   (2.698 ms .. 52.39 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking rdf4h/parsers/xml-xht
time                 1.984 s    (1.780 s .. 2.081 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 2.053 s    (2.015 s .. 2.073 s)
std dev              36.46 ms   (2.247 ms .. 44.34 ms)
variance introduced by outliers: 19% (moderately inflated)

@wismill
Copy link
Contributor Author

wismill commented Jun 3, 2019

You may review and merge this PR. There are still some corner cases in the XML parser (indicated by TODO, FIXME) but I am confident they are unlikely to cause much trouble since the parser passes the W3C test suite.

I think we should create an issue to release a new version of rdf4h only when the new xmlbf 0.6 is released.

@wismill wismill mentioned this pull request Jun 3, 2019
@robstewart57 robstewart57 merged commit eca4dae into robstewart57:xmlbf Jun 3, 2019
@wismill
Copy link
Contributor Author

wismill commented Jun 3, 2019

@robstewart57 Once you merge it to master I will work on #68, #70, #71 and #73.

@robstewart57
Copy link
Owner

Once you merge it

This PR has been merged, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants