Skip to content
This repository has been archived by the owner on Mar 25, 2024. It is now read-only.

404 on SENNA links #72

Closed
msgoff opened this issue Sep 10, 2022 · 3 comments
Closed

404 on SENNA links #72

msgoff opened this issue Sep 10, 2022 · 3 comments
Labels

Comments

@msgoff
Copy link

msgoff commented Sep 10, 2022

Hello

I spend a lot of time learning how to parse LaTeX found in the ArXiV corpus. I am interested in contributing if you have some basic tasks where I could be useful.

Best Regards,
Mike

@dginev
Copy link
Member

dginev commented Sep 11, 2022

Hi @msgoff !

You titled the issue 404 on SENNA links, which sounds like a problem for the rust-senna wrapper. Maybe we can transfer the issue there?

Separately, the way you've described your interest, it sounds like the intersection between ar5iv and latexml - you can take a look at the ar5iv issues for known problems with our conversion to HTML, and consider contributing upgrades to latexml - if that seems like an activity you would enjoy.

The llamapun repository here is currently in maintenance mode and isn't actively developed. Its tasks start where the conversion to HTML ends -- there are utilities to map down to plain text, and some experiments using basic ~2016 NLP methods.

There is a separate preprocessing library I have been working on, but I have kept its repository private until the bits there stabilize.

@msgoff
Copy link
Author

msgoff commented Sep 12, 2022

Hello @dginev
Sorry, I wasn't aware that this repository is in maintenance mode.
On the in the Readme for this project, the following links are no longer valid.
Maybe it would be ok to link to web.archive.org instead.
http://web.archive.org/web/20140208134927/http://ml.nec-labs.com/senna/

Tokenization - rule-based sentence segmentation, and SENNA word tokenization
Part-of-speech tagging (via SENNA),
Named Entity recognition (via SENNA),
Chunking and shallow parsing (via SENNA),

I have seen that you are working on the NLP side of things and had not heard of SENNA before which is why I was interested in learning more about the project.

Thank you for the suggestions.
I will look into ar5iv and latexml issues.

@dginev
Copy link
Member

dginev commented Sep 12, 2022

Thanks for clarifying, I just updated the readme file.

I hope I can make more public from the post-2020 NLP work I have been doing at some pointer before the end of the year, but it could be next year. You can take a look at the other open issue here ( at #59 ), which gives a taste of the data. There is an associated talk I gave a couple of years ago too. Although there are now mainstream models one can use instead, if math syntax isn't a core interest (and if latex macros are the preferred modality for math).

@dginev dginev closed this as completed Sep 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants