Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated fix for #1652: autodetect line breaks type in editor #49

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

pfalcon
Copy link
Contributor

@pfalcon pfalcon commented Dec 6, 2014

This is rebased and updated branch https://github.com/MidnightCommander/mc/tree/1652_autodetect_lb , addressing bugreport http://www.midnight-commander.org/ticket/1652 .

It consists of rebasing to the latest codebase original patches (editor functionality by myself, editor options by Ilia Maslakov, line break detection unit tests by Slava Zanko), and additional changes to make linebreak type detection much more conservative and robust (plus some refactoring of unittests).

This patcheset adheres to following requirements:

  1. By default, linebreak type autodetection is off, and editor behavior is fully as before (including binary-safety, etc.)
  2. When autodetection is on, it is done in conservative manner: bigger sampling buffer size is used to increase confidence; a non-default linebreak type is chosen only if it there was no conflicts in detection; any suspicion that file is binary disables detection. If any of the above conditions are not met, file is loaded with default "as is" line break type (which is the original, binary-safe mode).

@ossilator
Copy link
Contributor

i would choose an entirely different approach: instead of converting upon loading/saving, deal with line endings "on the fly". this has the advantage that even the worst misdetection cannot corrupt the file, you can always make a best effort to display it, and you can switch the mode dynamically when you find it wrong.
do you think this make sense and is feasible?

@pfalcon
Copy link
Contributor Author

pfalcon commented May 30, 2015

@ossilator:

i would choose an entirely different approach

Feel free to choose an entirely different approach and provide patches. For 6 years the original issue (http://www.midnight-commander.org/ticket/1652) exists, this is the only patch which surfaced, on which 3+ people worked over last 2 years. If you think that these people didn't consider other approaches, then sorry, it's a bit naive.

@ossilator
Copy link
Contributor

if you did consider different approaches, then document your conclusions for others to verify.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 30, 2015

Bother to read the linked ticket.

@ossilator
Copy link
Contributor

yes, sorry.
anyway, i think your conclusions are wrong.

on the technical side, your "fear" of adding the necessary abstraction seems entirely overblown. there is already an abstraction for multi-byte chars (for the utf-8 support). adding the necessary abstraction for multi-byte line endings isn't exactly rocket science. and quite frankly, for somebody who volunteers for maintainership, you have a funny reluctance to getting to know the code.

on the user side, the concerns about the potential detriments of the patch have led you to choosing an approach that is so conservative that many people won't even notice that the feature is there. this is entirely counterproductive. the right solution would be always on, and yet without risk of accidentally destroying a file (fwiw, at least for me, this isn't about editing actual binaries - it's about "funny" text files, text interspersed with binary, etc.).

@pfalcon
Copy link
Contributor Author

pfalcon commented May 30, 2015

anyway, i think your conclusions are wrong.

Alternative patches welcome.

and quite frankly, for somebody who volunteers for maintainership, you have a funny reluctance to getting to know the code.

Better maintainer welcome.

on the user side, the concerns about the potential detriments of the patch has led you

Note: not my concerns, but concerns of the overly conservative maintainer. It did not "led" me, I just proposed that as a compromise, to get some progress on this issue.

Currently, we have funny situation that there're concerns of binary safety of text editor, while the text editor has problem working with real-world text files.

And finally: seeing you were a contributor, I'm not interested in fingerpointing between you and me. As you're not bad maintainer, I'm not interested in having quarrels with you. And as you're not a maintainer, I'm not interested spending time on arguments which then are empty.

@ossilator
Copy link
Contributor

Currently, we have funny situation that there're concerns of binary safety of text editor, while the text editor has problem working with real-world text files.

i don't find this strange at all. seriously, if you want a real text editor, by all means, use one. i use mcedit only for quick edits of more or less random files - and i know i can trust it to behave consistently irrespective of file type and not mess up the file. i really wouldn't want to lose that, even if it goes at the cost of other features.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 30, 2015

i don't find this strange at all. seriously, if you want a real text editor, by all means, use one. i use mcedit only for quick edits of more or less random files - and i know i can trust it to behave consistently irrespective of file type and not mess up the file.

And I use mc as a super-IDE for all my development work, thru the years, leaving behind vims, emacses, ideas, eclipses, and other "exciting-new" stuff popping up and down thru the time. So, as a casual mc user (feel free to skip arguing on this definition, you said what you said with your "quick edits") you will keep the behavior you're used to. And core users, like myself, will be able to work with real-world text files adequately. So, proposed compromise looks very viable in practice.

@ossilator
Copy link
Contributor

i've been using mcedit like that myself a decade ago, and the slavaz team clearly did so as well judging by the enhancements they made.
and while i can totally understand why somebody would deem vim, emacs, and whatever other pseudo-IDE no sufficient improvement over mcedit to bother, it's beyond me how one can seriously disregard a modern IDE which is targeted towards (and understands) the programming language one uses - the amount of useless monkey work this has cut for me is outstanding.
but anyway ...

how is the EOL feature relevant for your use as an IDE? you're supposed to use the SCM system's EOL conversions.

while you're right that i could disable the feature, at that point it just becomes dead code from my perspective. why would i be supposed to approve that?

there is an additional factor: i could (very occasionally) actually make good use of the feature, so i don't want it to be so cumbersome to use. but i don't want to enable it by default because of the risk (or the inconsistent behavior when it needs to fall back to an entirely different mode when it encounters a file it cannot deal with). so it's not just that i dislike the kinda hacky proposed solution - i'd also really like to see the feature done properly, which is why i'm trying to convince you to give it another shot.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 30, 2015

i've been using mcedit like that myself a decade ago, and the slavaz team clearly did so as well judging by the enhancements they made.

I figure, visible tabs, visible trailing spaces are pretty advanced features, and one of the reasons why I think elaborating mcedit further makes good sense.

it's beyond me how one can seriously disregard a modern IDE which is targeted towards (and understands) the programming language one uses

That's mostly offtopic, but the passage above gives the answer - there're too many such IDEs, and they're usually language-specific. I'm an open-source hacker, and need to quickly "scroll" thru many projects in short time, and I don't know what next language I make hack on next, and for how long, and don't want to spend extended time setting up an infrastructure for it. Instead, I want to have a universal IDE to work with any kind of source, and mc and mcedit serve that pretty well, modulo few genric last-mile imrovements like this (well, I my also imagine few more adhoc improvements, but I don't even want to think about them now too much, as I value mc's simplicity and generic nature).

how is the EOL feature relevant for your use as an IDE? you're supposed to use the SCM system's EOL conversions.

Yes, and I use that. But as I mentioned, I'm an open-source hacker deal with steady stream of new projects regularly (say, at least dozen each month, usually more). Not all of them set up right, a lot have line-endings mix up. In the end, I figured that I spend too much effort just to report such issues upstream (which are hardly appreciated), or fork and clean up somebody's stuff, etc. Going Internet-style, i.e. being conservative regarding what I produce, but liberal regarding what can be dealt with without much trouble is what revived my interest in this patch, so I tried to clean it up based on collected feedback, and get into mainline after all, instead of maintaining in my fork.

while you're right that i could disable the feature, at that point it just becomes dead code from my perspective. why would i be supposed to approve that?

mc open-source project, and I think it's important that it would cater itself for working well with other open-source projects (including quirks as described above).

so it's not just that i dislike the kinda hacky proposed solution - i'd also really like to see the feature done properly, which is why i'm trying to convince you to give it another shot.

mc itself is kinda hacky coded thing. mc is buggy, including hard bugs like segfaults, hangs, history of data corruption. We live in real world, and this patch provides real-world solution to a problem. As I mentioned, my study of the problem showed that alternative to this way is either doing myriad small patches throughout the editor codebase, which while small, not a few chars, but make control structure more complicated; or rewrite editor pretty much. Both of these choices would be irresponsible - both on my side to propose and on maintainers' to accept - that's what can easily lead to omission/overlook which may lead to segfaults/data corruption. Done like it is, the patch does solve the problem and is contained - you can look at it, and raise concerns, unlike some spaghetti patch, whose effect you cannot predict. Once raise, concerns can be addressed, even if it won't be a complete solution, but compromise. But that's what art of real-world software engineering is.

@ossilator
Copy link
Contributor

mc itself is kinda hacky coded thing.

how could one argue with that ...

ok, let's assume a maintainer who wants to merge this pops up.

the series obviously needs a rebase, and i think it should be squashed (slavaz' remaining parts are rather small, so i think a simple done-with: slavaz footer will do for attribution).

i really don't like the option being there. why not make the thing so conservative that it cannot possibly fail - check the whole file? i don't really understand why you are using mc_read in the first place - the file is read in its entirety anyway, so why not insert the detection and adjustment between the regular read and the insertion into the edit buffer?

where is the inverse conversion on file writeout?

    * On opening file, detect line-endings used by sampling some initial content.
    * If it happen to be CR or CRLF, skip fast load path, and in edit_insert_file()
      convert such line endings to '\n'.
    * Save detected line ending type for editor.

Signed-off-by: Paul Sokolovsky <[email protected]>
Signed-off-by: Slava Zanko <[email protected]>
@pfalcon pfalcon force-pushed the 1652_autodetect_lb-revive branch from 501be0a to 93c77de Compare May 31, 2015 20:08
@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2015

Patchset rebased onto latest master. Will respond to other questions a bit later.

ilia-maslakov and others added 4 commits June 1, 2015 00:40
Allows to enable or disable line break type detection in the editor. Default
value is off, based on concerns expressed regarding binary-safety of the
editor. With this setting conservatively set to off, editor behavior stays
exactly the same as before, unless user explicitly enables this option.

Signed-off-by: Ilia Maslakov <[email protected]>
Signed-off-by: Paul Sokolovsky <[email protected]>
Detection now requires consistent line breaks being present within sampling
buffer. If breaks of different types are found, or there's a slightest
suspicion that it may be binary content, it reverts to binary-safe "as-is"
mode.

Also, refactor detection routines to facilitate unit testing.

Signed-off-by: Paul Sokolovsky <[email protected]>
Also, test only line break detection algorithm, not anything else (no
file operations, etc.).

Signed-off-by: Paul Sokolovsky <[email protected]>
@pfalcon pfalcon force-pushed the 1652_autodetect_lb-revive branch from 93c77de to 6d496be Compare May 31, 2015 21:45
@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2015

and i think it should be squashed (slavaz' remaining parts are rather small, so i think a simple done-with: slavaz footer will do for attribution).

I keep patches separate and with original authorship as an argument that this is not my patch - 2 official maintainers worked on it too. The patch can be squashed easily anytime, once it's confirmed as ready for merging.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2015

why not make the thing so conservative that it cannot possibly fail - check the whole file?

Which whole file - 1GB, 4GB, 1TB one? It's one of the known issues that mistakenly pressing F4 on a wrong large file is a good way to DoS current mc session and possibly whole system. So, scanning thru entire file is irresponsible, you need to make that operation be O(1) of file size, not O(n). Then you just select suitable window size and few Ks already offer 99% guarantee of detection.

This implementation also nicely decouples from the rest of editor, to not make it more of spaghetti.

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2015

where is the inverse conversion on file writeout?

That's handled by the existing "Save As..." code, which allows to save a file with different line-breaks. So, the detection process just sets another default value for Save/Save As to use (standard default is "don't change").

@pfalcon
Copy link
Contributor Author

pfalcon commented May 31, 2015

i really don't like the option being there.

Again, that option was added as a compromise to get going with that feature. I was pointed out that this patch changes behavior of editor, and I made changes so it wasn't, unless user opts in. Yes, I myself think that this feature should be on by default, as apparently more users use text editor to edit text files than binary ones. But it's not me to decide.

(For perspective, I also think that "Esc key mode: single press" should be default, as realistically nobody uses Esc to manually enter escape sequences nowadays, or use 300 or so baud rates. If I were a maintainer, I would raise a question to make it a default, perhaps with a command-line switch to force-disable - I'm sure we lose bunch of users because of the current default).

@ossilator
Copy link
Contributor

Again, that option was added as a compromise [...]

my point is that the compromise should not be necessary:

It's one of the known issues that mistakenly pressing F4 on a wrong large file is a good way to DoS current mc session and possibly whole system

this is entirely true. but this problem exists regardless of your patch. given that the line ending scan is at least an order of magnitude faster than the code to insert the file into the buffer (char by char!), injecting it between the existing mc_read and insertion loop would not noticeably worsen the situation.

and when a fix is implemented (probably a more or less arbitrary file size limitation inducing a confirmation dialog), the detection code would be automatically covered by it.

This implementation also nicely decouples from the rest of editor, to not make it more of spaghetti.

i totally don't buy that. in the worst case you need an out parameter which the load function sets as a "side effect" (i haven't looked at the code in detail, but it isn't that bad).

@pfalcon
Copy link
Contributor Author

pfalcon commented Jun 10, 2015

(sorry for delay with response)

but this problem exists regardless of your patch.

Sorry, I don't understand what you hint at - if a design is compromised, then everyone should feel free to compromise it even further? Sorry, I don't think that's sustainable or responsible approach.

given that the line ending scan is at least an order of magnitude faster than the code to insert the file into the buffer (char by char!), injecting it between the existing mc_read and insertion loop would not noticeably worsen the situation.

So, if you mean detecting line-break type in parallel with loading file contents itself, then in general that wouldn't be possible, because before loading each line, one already needs to know line break type of a file. And doing it in 2 passes over file has exactly DoS potential I mentioned above: on a system with low free physical memory, doing that will cause swapping. If system also has a lot of other I/O activity, that can lead to true system-wide I/O due to bugs in contemporary OSes: https://bugzilla.kernel.org/show_bug.cgi?id=12309 (these bugs happen because people do not think about issues we discuss here). Such issues may affect stability and performance of a modern desktop, but people (e.g. myself) run mc on small systems (like home routers/NASes/etc.) where 32MB of RAM or less is norm. It's huge difference whether unneeded DoS-like operation will scan something once and completes in 10mins or twice and completes in 20mins.

i totally don't buy that.

And yet that's generally accepted software engineering practice - consider that the most "efficient" way to write something is to write tightly-couples spaghetti code in assembler, nobody any longer does that. Vice-versa, there're design patterns how to decouple things to achieve other kinds of efficiencies than the one implied by spaghetti assembler code. You miss the fact that this is just a first version of the implementation of a fix for this issue (trac:1652). If got into mainline, but deemed as requiring more improvements, it can be further elaborated. You also miss the fact that many people contributed to even this solution, and if you see immediate need for further tweaks, please contribute them. Otherwise, the way this patch is done, it addresses the original issue, while being conservative enough to not affect users which want to be unaffected, and has other important traits, like being easy to maintain on top of the mainline sources. And that's how it is maintained so far. If official maintainers see some (non DoS imposing) changes as a pre-requisite to it being merged, I'm happy to address them.

@ossilator
Copy link
Contributor

I don't understand what you hint at - if a design is compromised, then everyone should feel free to compromise it even further?

no, that's actually your argument:

mc itself is kinda hacky coded thing.

my argument is that not duplicating work is good, and comes with the added bonus of getting fixed for free when the code path that exists anyway is fixed.

but you're right that my idea won't work as-is, because the files are read in chunks, so a whole-sale detection and conversion can't happen before the file has been loaded in its entirety. that sounds like a somewhat bigger refactoring exercise.

however, your argumentation still makes no sense. the existing code path is sufficient to DoS the system. adding a second full-file reader would double the problem, but at that point this isn't very relevant any more. so the rather obvious conclusion is not avoiding it, but adding an adequate protection mechanism to the new code (which would automatically cover the old code as well if it was always enabled).

@seyko2
Copy link
Contributor

seyko2 commented Nov 6, 2015

Wow! It work's! Nice. Thanks.

@seyko2
Copy link
Contributor

seyko2 commented Nov 8, 2015

Do you look at the vi editor from the openwatcom-v2? Looks like mcedit, mouse support, multiwindow, crlf autodetection....

@seyko2
Copy link
Contributor

seyko2 commented Nov 8, 2015

Suggestion: always show LB type. Curretly LF type is not displayed

@Tomas-M
Copy link

Tomas-M commented Sep 25, 2016

Is this going to be merged? I would really appreciate it. Thanks!

@zyv zyv force-pushed the master branch 2 times, most recently from a48cc07 to 5583c3d Compare January 2, 2023 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants