Custom invalid characters (#57) #58

Chaphasilor · 2020-04-08T22:02:21Z

Implemented the new feature
Added new tests for the new feature
Added new TypeScript type declarations
updated jsdoc documentation
updated readme

I tried to minimize the amount of additions and to stay in line with the other code.
I hope this is what we both had in mind in #57 :)

Some notes:

The strings specified in options.additionalInvalids will be replaced first and every other invalid character will be replaced afterwards. This is needed because if a|b is invalid, and we have the input xa|bx, the output should be xx and not xabx, which would be the case if the default invalid characters would be replaced first.
Recursive replacement is supported:
e.g. ab is invalid, and we have the input aabb, if we only replace the inner ab we will still be left with ab, which is still invalid, so we replace ab as long as it appears in the string
This also results in some input parameters being illegal in combination, for example:
- specifying a replacement which is part of or has substrings that are part of options.additionalInvalids
- supplying a replacement function that always replaces a certain character with itself, where the character is part of options.additionalInvalids
  (this will most likely never happen, but if it happens it causes an infinite loop, so I handled it)
I've renamed the option from invalid to additionalInvalids to make it more clear that this doesn't overwrite anything
I'm using the split&join method for replacing occurences of the strings specified in options.additionalInvalids. This results in worse performance (about 40% slower), but I had to do it because creating regular expressions from user input can cause issues with special regex characters.
I am using some 'newer' JavaScript syntax and functions:
- ES6 arrow functions
- block-scope variable declarations using the let-keyword
- the <Array>.some()-function
If you want the package to support more legacy code, I could rewrite it though ;)

- added new functionality in index.js - added new tests and confirmed current tests are still working - added new TypeScript declarations

- this helps to cope with certain cenarios

- more complex tests - used some additional invalids when testing with FS, some from blns

- only check if replacement is part of invalids if it's a string and not a function

- implemented replacement of recursive patterns (this causes some restrictions with options.invalid) - added error handling for unsupported cases

Chaphasilor · 2020-04-08T22:24:06Z

I see that travis is struggling with older versions of node as well as Internet Explorer 11, so I guess I will have to rewrite my code to not make use of any 'new' JavaScript features :)

parshap · 2020-04-08T22:30:12Z

This is great! Nice work @Chaphasilor. Responses to some of your notes below:

The strings specified in options.additionalInvalids will be replaced first and every other invalid character will be replaced afterwards. This is needed because if a|b is invalid, and we have the input xa|bx, the output should be xx and not xabx, which would be the case if the default invalid characters would be replaced first.

This makes sense. I could see there being a use case for the opposite order as well, if that ever comes up though then we could have another option added to control that.

Recursive replacement is supported:
e.g. ab is invalid, and we have the input aabb, if we only replace the inner ab we will still be left with ab, which is still invalid, so we replace ab as long as it appears in the string

This also results in some input parameters being illegal in combination, for example:

specifying a replacement which is part of or has substrings that are part of options.additionalInvalids

supplying a replacement function that always replaces a certain character with itself, where the character is part of options.additionalInvalids
(this will most likely never happen, but if it happens it causes an infinite loop, so I handled it)

Makes sense, good thinking.

I've renamed the option from invalid to additionalInvalids to make it more clear that this doesn't overwrite anything

I like the clearer name. A minor thought, but how about additionalInvalidStrings? "Invalids" as a noun reads oddly for me.

I'm using the split&join method for replacing occurences of the strings specified in options.additionalInvalids. This results in worse performance (about 40% slower), but I had to do it because creating regular expressions from user input can cause issues with special regex characters.

Seems fine.

I am using some 'newer' JavaScript syntax and functions:

ES6 arrow functions

block-scope variable declarations using the let-keyword

the <Array>.some()-function

If you want the package to support more legacy code, I could rewrite it though ;)

I think it might be a good idea to keep ES5 support right now since there may be users that are depending on this module in legacy browser or node environments. I'm not opposed to using ES6 in the module overall, but if we do that then I think we should do a major version release and I'm not sure it's worth doing for this feature since ES6 syntax doesn't seem like a huge win for the implementation. Open to your opinion though.

parshap

This looks great. Only real open question for me is the ES6 syntax. Ready to approve and merge once you've responded to that. I'd also love to see if @joelmukuthu or anyone else has input on this.

index.js

test.js

Chaphasilor · 2020-04-08T22:38:48Z

I am using some 'newer' JavaScript syntax and functions:

ES6 arrow functions

block-scope variable declarations using the let-keyword

the <Array>.some()-function

If you want the package to support more legacy code, I could rewrite it though ;)

I think it might be a good idea to keep ES5 support right now since there may be users that are depending on this module in legacy browser or node environments. I'm not opposed to using ES6 in the module overall, but if we do that then I think we should do a major version release and I'm not sure it's worth doing for this feature since ES6 syntax doesn't seem like a huge win for the implementation. Open to your opinion though.

I totally agree with you and will fix that tomorrow :)
I just made use of those because that came naturally to me, but there should be no issues with converting them to ES5 compatible equivalents!

- also removed unneeded object

lgtm-com · 2020-04-09T12:44:05Z

This pull request introduces 1 alert when merging 057e3e9 into 209c39b - view on LGTM.com

new alerts:

1 for Useless comparison test

lgtm-com · 2020-04-09T12:48:55Z

This pull request introduces 1 alert when merging 7365cdb into 209c39b - view on LGTM.com

new alerts:

1 for Useless comparison test

Chaphasilor · 2020-04-09T14:16:57Z

Okay so it wasn't as easy as initially thought and I found some issues along the way which I fixed, but everything should work now :)

joelmukuthu · 2020-04-10T09:03:02Z

Hi @Chaphasilor, thanks for the incredible work and thanks @parshap for the heads up. I like the idea and current implementation but as an alternative, would it make sense to allow users to pass a regex instead of an array of strings? That would simplify the implementation quite a bit and preclude the edge-cases that came up.

The only issue I can think of to do with security -- we should try to protect users from passing in regex's that are susceptible to ReDOS, but I'm not sure if we can do that or even want to take that burden upon ourselves.

What do you think?

Chaphasilor · 2020-04-10T16:12:57Z

I like the idea and current implementation but as an alternative, would it make sense to allow users to pass a regex instead of an array of strings? That would simplify the implementation quite a bit and preclude the edge-cases that came up.

This is a great idea! I'd like to keep the array of strings still, because it's a lot easier than having to deal with RegEx.

The only issue I can think of to do with security -- we should try to protect users from passing in regex's that are susceptible to ReDOS, but I'm not sure if we can do that or even want to take that burden upon ourselves.

What do you think?

That's an interesting thought. Given that the package might be used with filenames downloaded from the internet, encountering such an attack isn't that unlikely.
I think it would be better to add this feature with another pull request though, because for now it's actually not too hard for the user to do the replacing themselves. They just need to create the expression (which they need to do in any case) and use the <String>.replace() method along with their RegExp and the replacement. This is 1-2 lines of code, so I don't think this has high priority for now :)

parshap

This looks great! Thank you for the contribution, @Chaphasilor. Also thanks for the input @joelmukuthu. I agree that the regex support would be nice, but it can be done as a separate contribution.

I have one question about a potential test case, but otherwise ready to merge this once I hear back.

parshap · 2020-04-10T19:21:20Z

index.js

@@ -55,8 +55,8 @@ function sanitize(input, replacement, invalids) {
    throw new Error("The replacement string can't be part of options.additionalInvalidStrings or contain substrings which are part of options.additionalInvalidStrings!");
  }
  var sanitized = input;
+  var counter = 0;


Nice fix with moving the declaration outside of the while loop -- at first I thought that lgtm bot was crazy but I guess it was right.

Do you have a test that validates this working to prevent that bug from regressing in the future?

I've downloaded eslint right after realizing what I've done lol. Embarrassing :D

I added the test illegal replacement function, currently in line 159 in test.js. This should check if the exception is thrown :)

joelmukuthu

Thank you both for responding. I still think implementing this feature with a regex is a better way to go but all the same I've made some comments regarding the string-array implementation that might help improve performance.

index.js

joelmukuthu · 2020-04-10T19:53:22Z

index.js

+    for (var i = 0; i < invalids.length; i++) {
+      if (typeof replacement !== 'string') {
+        var tempReplacement = replacement(invalids[i]);
+        sanitized = sanitized.split(invalids[i]).join(tempReplacement);


I think this will perform better:

Suggested change

sanitized = sanitized.split(invalids[i]).join(tempReplacement);

sanitized = sanitized.replace(invalids[i], tempReplacement);

Or what's the reason behind using split/join?

The replace function uses a RegExp to find the parts to be replaced. Passing for example | as an invalid character will cause an error, because the pipe character is a reserved character in RegEx.

I used a RegExp for this at first as well, but had to change it for this reason. Not sure if it will work when simply passing a string though?

You can pass the strings directly to replace:

console.log('a|a'.replace('|', '')); // => 'aa'

~~Nice! I'll change the code :D~~

From MDN:

If pattern is a string, only the first occurrence will be replaced.

~~I guess now we'd have to add another while loop and our performance improvement will be gone. Damn...~~

Actually, this isn't the problem. The test that's failing is about nested invalid strings. So we'd need to implement an inner loop to make sure all invalid strings on the same 'level' will be removed in the current iteration, or else the output will be something unexpected. Not invalid, but unexpected.

Thoughts? @parshap @joelmukuthu

Haha, I'm not even sure I fully understood that failure but if all else fails we can document our way out of it.

But frankly, with all the issues coming up, I'm having a hard time backing the string-array implementation. Not that I'm shying away from hard work, but rather because I think the right way to implement this feature is by allowing users to pass in a regex. I've thought about it some more and I don't think we'd be introducing an extra attack vector. Calling it out in the documentation is also good enough in my opinion.

How about we add both?
We could make the users aware of the RegEx attack problem and call the string-array part and 'experimental' feature? Then see if this causes many issues or not?
I know what you're getting at with the RegEx, but I just really want to have the simplicity of strings...

I understand the motivation to keeping it simple, for me it just doesn't equate to the cost of maintaining it -- having both implementations would be an even greater cost.

I don't want to be a blocker for this, @parshap what do you think?

index.js

joelmukuthu · 2020-04-10T20:30:00Z

Btw @Chaphasilor, the last two comments I made are rather involving and might be a bit discouraging. I don't mean to overburden you while your intentions are nothing but great. If you feel that it's too much work, those updates can be done later without affecting how the new feature will be used.

Chaphasilor · 2020-04-10T21:10:43Z

I don't mean to overburden you while your intentions are nothing but great.

No problem :)
It's important to discuss these changes properly!
Any feedback is welcome :D

- added test for wrong type of replacement function - added tests for making sure replacement function works with additional invalids

Chaphasilor · 2020-04-18T15:11:16Z

I guess this PR is dead? @parshap

parshap · 2020-08-02T03:34:56Z

Hey @Chaphasilor, this PR fell off my radar for a while, I apologize. Coming back to this, I had a new thought: is there a reason why this functionality (replacing additional invalid strings) couldn't be a separate function that runs on the string before it's passed to this module's sanitize()?

Looking back at the code and discussion I realized there's a good amount of nuance (e.g., order of replacement) and complexity (e.g., recursion) and I'm wondering if it's worth having this functionality baked into this module. It might be easier to solve this in use case specific ways, or have the generic functionality as a separate module you use alongside sanitize-filename. I think the value of this module is in providing the specifics of various filesystems (e.g., what does NTFS support?), rather than generic string sanitization. Is baking this functionality in better than a separate module?

Chaphasilor · 2020-08-02T12:07:57Z

Hey @parshap, thanks for getting back :D

After the discussions we had about this, it might be better not to include this functionality in the core module.

I still think that a module handling this would be a good idea, because it is so complex to implement properly, but it's probably time to close the PR :)

parshap · 2020-08-02T20:59:46Z

Sounds good. Apologies for going in a circle. I agree this is hard to get right and would be valuable in a module.

Chaphasilor added 11 commits April 8, 2020 18:34

ran npm audit fix

1ca1541

added new feature: additional invalid characters

bb9016f

- added new functionality in index.js - added new tests and confirmed current tests are still working - added new TypeScript declarations

updated jsdoc

6f49e84

remove additional invalids first

6abc74c

- this helps to cope with certain cenarios

added further testing

41dc758

- more complex tests - used some additional invalids when testing with FS, some from blns

handle replacement functions

5251bfb

- only check if replacement is part of invalids if it's a string and not a function

updated readme

91a1fb5

edge cases

038971a

- implemented replacement of recursive patterns (this causes some restrictions with options.invalid) - added error handling for unsupported cases

modified tests to account for new edge cases

6f57187

rename option from 'invalid' to 'additionalInvalids' for clarity

07704fd

more renaming and fixed typos

da93854

parshap reviewed Apr 8, 2020

View reviewed changes

index.js Outdated Show resolved Hide resolved

test.js Show resolved Hide resolved

Chaphasilor added 3 commits April 9, 2020 14:31

convert to ES5

e4b609b

rename option to 'additionalInvalidStrings'

ea2bd1e

replace keyword 'const' with 'var'

057e3e9

- also removed unneeded object

replaced backticks with "

7365cdb

Chaphasilor added 5 commits April 9, 2020 15:05

remove shorthand from for loop

020a4ee

move counter declaration to prevent re-initialization

99cde11

replace String.includes() with String.indexOf

0dd8b47

properly handle replacement function

6ec3383

add test for illegal replacement function

8843c20

joelmukuthu self-requested a review April 10, 2020 08:31

parshap approved these changes Apr 10, 2020

View reviewed changes

joelmukuthu reviewed Apr 10, 2020

View reviewed changes

check type of replacement

05a7f43

added more tests

4ba9696

- added test for wrong type of replacement function - added tests for making sure replacement function works with additional invalids

nmalenovic mentioned this pull request Jul 24, 2020

new feature: options.truncate length or cb fn #60

Open

5 tasks

parshap closed this Aug 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom invalid characters (#57) #58

Custom invalid characters (#57) #58

Chaphasilor commented Apr 8, 2020

Chaphasilor commented Apr 8, 2020

parshap commented Apr 8, 2020

parshap left a comment

Chaphasilor commented Apr 8, 2020

lgtm-com bot commented Apr 9, 2020

lgtm-com bot commented Apr 9, 2020

Chaphasilor commented Apr 9, 2020

joelmukuthu commented Apr 10, 2020

Chaphasilor commented Apr 10, 2020

parshap left a comment

parshap Apr 10, 2020

Chaphasilor Apr 10, 2020

joelmukuthu left a comment •

edited

Loading

joelmukuthu Apr 10, 2020

Chaphasilor Apr 10, 2020

joelmukuthu Apr 10, 2020

Chaphasilor Apr 10, 2020 •

edited

Loading

Chaphasilor Apr 10, 2020 •

edited

Loading

joelmukuthu Apr 10, 2020

Chaphasilor Apr 10, 2020

joelmukuthu Apr 12, 2020

joelmukuthu commented Apr 10, 2020

Chaphasilor commented Apr 10, 2020

Chaphasilor commented Apr 18, 2020

parshap commented Aug 2, 2020

Chaphasilor commented Aug 2, 2020

parshap commented Aug 2, 2020

	sanitized = sanitized.split(invalids[i]).join(tempReplacement);
	sanitized = sanitized.replace(invalids[i], tempReplacement);

Custom invalid characters (#57) #58

Custom invalid characters (#57) #58

Conversation

Chaphasilor commented Apr 8, 2020

Chaphasilor commented Apr 8, 2020

parshap commented Apr 8, 2020

parshap left a comment

Choose a reason for hiding this comment

Chaphasilor commented Apr 8, 2020

lgtm-com bot commented Apr 9, 2020

lgtm-com bot commented Apr 9, 2020

Chaphasilor commented Apr 9, 2020

joelmukuthu commented Apr 10, 2020

Chaphasilor commented Apr 10, 2020

parshap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joelmukuthu left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chaphasilor Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Chaphasilor Apr 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joelmukuthu commented Apr 10, 2020

Chaphasilor commented Apr 10, 2020

Chaphasilor commented Apr 18, 2020

parshap commented Aug 2, 2020

Chaphasilor commented Aug 2, 2020

parshap commented Aug 2, 2020

joelmukuthu left a comment •

edited

Loading

Chaphasilor Apr 10, 2020 •

edited

Loading

Chaphasilor Apr 10, 2020 •

edited

Loading