Preprocessing custom dataset without removing punctuation #115

ninavdPipple · 2023-12-14T16:30:47Z

Hi,
I'm trying to load a custom dataset without removing the punctuation. However, if I set remove_punctuation = False, still all punctuation is removed and even worse; words connected to any punctuation are also gone. For example, 'Good evening!' simply becomes 'Good' in the corpus. How can I fix this? Ideally I want to remove all punctuation except '<' and '>', but I cannot come to any configuration where some punctuation is left at all.
Thanks in advance!
Nina

ninavdPipple · 2024-01-05T07:49:08Z

I figured this has to do with the fact that inside the preprocessing a vocabulary is created in which automatically all punctuation is removed. By ignoring the vocabulary, this could be avoided.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing custom dataset without removing punctuation #115

Preprocessing custom dataset without removing punctuation #115

ninavdPipple commented Dec 14, 2023

ninavdPipple commented Jan 5, 2024

Preprocessing custom dataset without removing punctuation #115

Preprocessing custom dataset without removing punctuation #115

Comments

ninavdPipple commented Dec 14, 2023

ninavdPipple commented Jan 5, 2024