Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom invalid characters (#57) #58

Closed

Conversation

Chaphasilor
Copy link

  • Implemented the new feature
  • Added new tests for the new feature
  • Added new TypeScript type declarations
  • updated jsdoc documentation
  • updated readme

I tried to minimize the amount of additions and to stay in line with the other code.
I hope this is what we both had in mind in #57 :)

Some notes:

  • The strings specified in options.additionalInvalids will be replaced first and every other invalid character will be replaced afterwards. This is needed because if a|b is invalid, and we have the input xa|bx, the output should be xx and not xabx, which would be the case if the default invalid characters would be replaced first.

  • Recursive replacement is supported:
    e.g. ab is invalid, and we have the input aabb, if we only replace the inner ab we will still be left with ab, which is still invalid, so we replace ab as long as it appears in the string

  • This also results in some input parameters being illegal in combination, for example:

    • specifying a replacement which is part of or has substrings that are part of options.additionalInvalids
    • supplying a replacement function that always replaces a certain character with itself, where the character is part of options.additionalInvalids
      (this will most likely never happen, but if it happens it causes an infinite loop, so I handled it)
  • I've renamed the option from invalid to additionalInvalids to make it more clear that this doesn't overwrite anything

  • I'm using the split&join method for replacing occurences of the strings specified in options.additionalInvalids. This results in worse performance (about 40% slower), but I had to do it because creating regular expressions from user input can cause issues with special regex characters.

  • I am using some 'newer' JavaScript syntax and functions:

    • ES6 arrow functions
    • block-scope variable declarations using the let-keyword
    • the <Array>.some()-function

    If you want the package to support more legacy code, I could rewrite it though ;)

- added new functionality in index.js
- added new tests and confirmed current tests are still working
- added new TypeScript declarations
- this helps to cope with certain cenarios
- more complex tests
- used some additional invalids when testing with FS, some from blns
- only check if replacement is part of invalids if it's a string and not a function
- implemented replacement of recursive patterns (this causes some restrictions with options.invalid)
- added error handling for unsupported cases
@Chaphasilor
Copy link
Author

I see that travis is struggling with older versions of node as well as Internet Explorer 11, so I guess I will have to rewrite my code to not make use of any 'new' JavaScript features :)

@parshap
Copy link
Owner

parshap commented Apr 8, 2020

This is great! Nice work @Chaphasilor. Responses to some of your notes below:

  • The strings specified in options.additionalInvalids will be replaced first and every other invalid character will be replaced afterwards. This is needed because if a|b is invalid, and we have the input xa|bx, the output should be xx and not xabx, which would be the case if the default invalid characters would be replaced first.

This makes sense. I could see there being a use case for the opposite order as well, if that ever comes up though then we could have another option added to control that.

  • Recursive replacement is supported:
    e.g. ab is invalid, and we have the input aabb, if we only replace the inner ab we will still be left with ab, which is still invalid, so we replace ab as long as it appears in the string

  • This also results in some input parameters being illegal in combination, for example:

    • specifying a replacement which is part of or has substrings that are part of options.additionalInvalids
    • supplying a replacement function that always replaces a certain character with itself, where the character is part of options.additionalInvalids
      (this will most likely never happen, but if it happens it causes an infinite loop, so I handled it)

Makes sense, good thinking.

  • I've renamed the option from invalid to additionalInvalids to make it more clear that this doesn't overwrite anything

I like the clearer name. A minor thought, but how about additionalInvalidStrings? "Invalids" as a noun reads oddly for me.

  • I'm using the split&join method for replacing occurences of the strings specified in options.additionalInvalids. This results in worse performance (about 40% slower), but I had to do it because creating regular expressions from user input can cause issues with special regex characters.

Seems fine.

  • I am using some 'newer' JavaScript syntax and functions:

    • ES6 arrow functions
    • block-scope variable declarations using the let-keyword
    • the <Array>.some()-function

    If you want the package to support more legacy code, I could rewrite it though ;)

I think it might be a good idea to keep ES5 support right now since there may be users that are depending on this module in legacy browser or node environments. I'm not opposed to using ES6 in the module overall, but if we do that then I think we should do a major version release and I'm not sure it's worth doing for this feature since ES6 syntax doesn't seem like a huge win for the implementation. Open to your opinion though.

Copy link
Owner

@parshap parshap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Only real open question for me is the ES6 syntax. Ready to approve and merge once you've responded to that. I'd also love to see if @joelmukuthu or anyone else has input on this.

index.js Outdated Show resolved Hide resolved
test.js Show resolved Hide resolved
@Chaphasilor
Copy link
Author

  • I am using some 'newer' JavaScript syntax and functions:

    • ES6 arrow functions
    • block-scope variable declarations using the let-keyword
    • the <Array>.some()-function

    If you want the package to support more legacy code, I could rewrite it though ;)

I think it might be a good idea to keep ES5 support right now since there may be users that are depending on this module in legacy browser or node environments. I'm not opposed to using ES6 in the module overall, but if we do that then I think we should do a major version release and I'm not sure it's worth doing for this feature since ES6 syntax doesn't seem like a huge win for the implementation. Open to your opinion though.

I totally agree with you and will fix that tomorrow :)
I just made use of those because that came naturally to me, but there should be no issues with converting them to ES5 compatible equivalents!

@lgtm-com
Copy link

lgtm-com bot commented Apr 9, 2020

This pull request introduces 1 alert when merging 057e3e9 into 209c39b - view on LGTM.com

new alerts:

  • 1 for Useless comparison test

@lgtm-com
Copy link

lgtm-com bot commented Apr 9, 2020

This pull request introduces 1 alert when merging 7365cdb into 209c39b - view on LGTM.com

new alerts:

  • 1 for Useless comparison test

@Chaphasilor
Copy link
Author

Okay so it wasn't as easy as initially thought and I found some issues along the way which I fixed, but everything should work now :)

@joelmukuthu joelmukuthu self-requested a review April 10, 2020 08:31
@joelmukuthu
Copy link
Collaborator

Hi @Chaphasilor, thanks for the incredible work and thanks @parshap for the heads up. I like the idea and current implementation but as an alternative, would it make sense to allow users to pass a regex instead of an array of strings? That would simplify the implementation quite a bit and preclude the edge-cases that came up.

The only issue I can think of to do with security -- we should try to protect users from passing in regex's that are susceptible to ReDOS, but I'm not sure if we can do that or even want to take that burden upon ourselves.

What do you think?

@Chaphasilor
Copy link
Author

I like the idea and current implementation but as an alternative, would it make sense to allow users to pass a regex instead of an array of strings? That would simplify the implementation quite a bit and preclude the edge-cases that came up.

This is a great idea! I'd like to keep the array of strings still, because it's a lot easier than having to deal with RegEx.

The only issue I can think of to do with security -- we should try to protect users from passing in regex's that are susceptible to ReDOS, but I'm not sure if we can do that or even want to take that burden upon ourselves.

What do you think?

That's an interesting thought. Given that the package might be used with filenames downloaded from the internet, encountering such an attack isn't that unlikely.
I think it would be better to add this feature with another pull request though, because for now it's actually not too hard for the user to do the replacing themselves. They just need to create the expression (which they need to do in any case) and use the <String>.replace() method along with their RegExp and the replacement. This is 1-2 lines of code, so I don't think this has high priority for now :)

Copy link
Owner

@parshap parshap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thank you for the contribution, @Chaphasilor. Also thanks for the input @joelmukuthu. I agree that the regex support would be nice, but it can be done as a separate contribution.

I have one question about a potential test case, but otherwise ready to merge this once I hear back.

@@ -55,8 +55,8 @@ function sanitize(input, replacement, invalids) {
throw new Error("The replacement string can't be part of options.additionalInvalidStrings or contain substrings which are part of options.additionalInvalidStrings!");
}
var sanitized = input;
var counter = 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix with moving the declaration outside of the while loop -- at first I thought that lgtm bot was crazy but I guess it was right.

Do you have a test that validates this working to prevent that bug from regressing in the future?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've downloaded eslint right after realizing what I've done lol. Embarrassing :D

I added the test illegal replacement function, currently in line 159 in test.js. This should check if the exception is thrown :)

Copy link
Collaborator

@joelmukuthu joelmukuthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for responding. I still think implementing this feature with a regex is a better way to go but all the same I've made some comments regarding the string-array implementation that might help improve performance.

index.js Outdated Show resolved Hide resolved
for (var i = 0; i < invalids.length; i++) {
if (typeof replacement !== 'string') {
var tempReplacement = replacement(invalids[i]);
sanitized = sanitized.split(invalids[i]).join(tempReplacement);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will perform better:

Suggested change
sanitized = sanitized.split(invalids[i]).join(tempReplacement);
sanitized = sanitized.replace(invalids[i], tempReplacement);

Or what's the reason behind using split/join?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replace function uses a RegExp to find the parts to be replaced. Passing for example | as an invalid character will cause an error, because the pipe character is a reserved character in RegEx.

I used a RegExp for this at first as well, but had to change it for this reason. Not sure if it will work when simply passing a string though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pass the strings directly to replace:

console.log('a|a'.replace('|', ''));
// => 'aa'

Copy link
Author

@Chaphasilor Chaphasilor Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I'll change the code :D

Copy link
Author

@Chaphasilor Chaphasilor Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From MDN:

If pattern is a string, only the first occurrence will be replaced.

I guess now we'd have to add another while loop and our performance improvement will be gone. Damn...

Actually, this isn't the problem. The test that's failing is about nested invalid strings. So we'd need to implement an inner loop to make sure all invalid strings on the same 'level' will be removed in the current iteration, or else the output will be something unexpected. Not invalid, but unexpected.

Thoughts? @parshap @joelmukuthu

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I'm not even sure I fully understood that failure but if all else fails we can document our way out of it.

But frankly, with all the issues coming up, I'm having a hard time backing the string-array implementation. Not that I'm shying away from hard work, but rather because I think the right way to implement this feature is by allowing users to pass in a regex. I've thought about it some more and I don't think we'd be introducing an extra attack vector. Calling it out in the documentation is also good enough in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we add both?
We could make the users aware of the RegEx attack problem and call the string-array part and 'experimental' feature? Then see if this causes many issues or not?
I know what you're getting at with the RegEx, but I just really want to have the simplicity of strings...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation to keeping it simple, for me it just doesn't equate to the cost of maintaining it -- having both implementations would be an even greater cost.

I don't want to be a blocker for this, @parshap what do you think?

index.js Show resolved Hide resolved
index.js Show resolved Hide resolved
@joelmukuthu
Copy link
Collaborator

Btw @Chaphasilor, the last two comments I made are rather involving and might be a bit discouraging. I don't mean to overburden you while your intentions are nothing but great. If you feel that it's too much work, those updates can be done later without affecting how the new feature will be used.

@Chaphasilor
Copy link
Author

I don't mean to overburden you while your intentions are nothing but great.

No problem :)
It's important to discuss these changes properly!
Any feedback is welcome :D

- added test for wrong type of replacement function
- added tests for making sure replacement function works with additional invalids
@Chaphasilor
Copy link
Author

I guess this PR is dead? @parshap

@parshap
Copy link
Owner

parshap commented Aug 2, 2020

Hey @Chaphasilor, this PR fell off my radar for a while, I apologize. Coming back to this, I had a new thought: is there a reason why this functionality (replacing additional invalid strings) couldn't be a separate function that runs on the string before it's passed to this module's sanitize()?

Looking back at the code and discussion I realized there's a good amount of nuance (e.g., order of replacement) and complexity (e.g., recursion) and I'm wondering if it's worth having this functionality baked into this module. It might be easier to solve this in use case specific ways, or have the generic functionality as a separate module you use alongside sanitize-filename. I think the value of this module is in providing the specifics of various filesystems (e.g., what does NTFS support?), rather than generic string sanitization. Is baking this functionality in better than a separate module?

@Chaphasilor
Copy link
Author

Hey @parshap, thanks for getting back :D

After the discussions we had about this, it might be better not to include this functionality in the core module.

I still think that a module handling this would be a good idea, because it is so complex to implement properly, but it's probably time to close the PR :)

@parshap
Copy link
Owner

parshap commented Aug 2, 2020

Sounds good. Apologies for going in a circle. I agree this is hard to get right and would be valuable in a module.

@parshap parshap closed this Aug 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants