Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add StringView / substring by reference to core library. #56850

Open
hydro63 opened this issue Oct 4, 2024 · 3 comments
Open

Add StringView / substring by reference to core library. #56850

hydro63 opened this issue Oct 4, 2024 · 3 comments
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-core type-enhancement A request for a change that isn't a bug

Comments

@hydro63
Copy link

hydro63 commented Oct 4, 2024

DISCLAIMER > I was sent here from the Dart language repo

I think it would be nice having a way to do substring without copying the entire string, thereby decreasing a both memory and performence overhead.

Problem to solve

The problem that it solves is the fact that currently, the substring method copies the whole buffer of the string. This copying can potentially take a lot of time and allocate a lot of unnecessary resources, that have to be then collected. It can slow down the program significantly.

Definition

The StringView would have all the same definitions of String, but instead of copying the whole buffer, it would just reference to the original / base string with some offset, with adjusted length. Strings are already immutable by default, meaning there is no problem with sideeffects that would otherwise cause bugs.

class StringView extends String{
  String _base;
  int _length;
  int _start = 0;
  StringView(this._base) : _length = _base.length;
  StringView.from(this._base, this._start, this._end) : _length = _end - _start;

  @override
  String substring(int start, [int end]){
    if(end == null) end = _length;
    return StringView.from(_base, _start + start, _start + end);
  }

  @override
  bool operator==(String right){
    // ...
  }

  @override
  String operator[](int index){
    if(index >= _length || index < 0) throw RangeError();
    return _base[index];
  }
  
  // other things implemented;
}

var view = StringView(input);
view = view.substring(1); // start = 1
view = view.substring(1); // start = 2

Motivation

The reason i brought it up was because i was doing some parsing, and i wanted to tokenize a string by continuously finding tokens and eating them from the start with str.substring(token.end).

// if we use substring, the performence is horrible
// with StringView, there is no copying of the buffer, hence no significant slowdown
List<Token> tokenize(String input){
  var buffer = StringView(input);
  var tokens = <Token>[];
  Token? tokenIdent();
  Token? tokenString();
  Token? tokenWhite();
  Token? tokenNumber();
  // ...
  
  while(buffer.length > 0){
    Token? foundToken;
    if(tokenIdent() case var tok?) foundToken = tok;
    else if(tokenString() case var tok?) foundToken = tok;
    else if(tokenWhite() case var tok?) foundToken = tok;
    else if(tokenNumber() case var tok?) foundToken = tok;
    else throw "Unrecognised character > ${input[0]}";
    input = buffer.substring(foundToken.end);
    tokens.add(foundToken);
  }

  return tokens;
}

It could also be used in conjuction with RegExp like ^\s+, so that the ^ character refers to the start of the view, not start of the string. Currently:

var reg = RegExp("^hello");
reg.allMatches("hellohello", 0).firstOrNull // -> RegExpMatch
reg.allMatches("hellohello", 2).firstOrNull // -> null
reg.allMatches("hellohello", 5).firstOrNull // -> null <= I expected RegExpMatch, since it starts matching from index 5;

Other

The reason why i'm asking for a feature, and not implementing it myself, is because it is impossible to implement it as a user. I need to be able to use the StringView as a plain old string, for example in regex. At first i wanted to extend the String class, but since String is final, i can't do it. I also can't do it with extensions, since i need to be able to use it as a normal string, and extensions can't override defined behaviour, that, for example RegExp depends on.

TLDR > since i can't implement it as a user, it needs to be implemented in the core library where String is defined.

Edit

I am not generally a sucker for speed, but in this case using substring can have a drastic performence issues, that are IMO not all that difiicult to solve (at least not looking at the big picture). I'm also willing to trade off some speed for ease of writing, as long as the performence degradation is not really visible. That's why i don't mind using regex, despite it being a "slow" abstraction. I'm using regex because i really hate writing NFA from scratch (that's a really big hassle) which i would have to write either way and i can also optimize the regex expression to get some speed back.

@dart-github-bot
Copy link
Collaborator

Summary: The user proposes adding a StringView class to the Dart core library. This class would represent a substring without copying the underlying data, improving performance and memory usage for operations like tokenization and regular expression matching.

@dart-github-bot dart-github-bot added area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. triage-automation See https://github.com/dart-lang/ecosystem/tree/main/pkgs/sdk_triage_bot. type-enhancement A request for a change that isn't a bug labels Oct 4, 2024
@mraleph
Copy link
Member

mraleph commented Oct 4, 2024

@hydro63 for this specific purpose you should use RegExp.matchAsPrefix which allows you to specify the starting position.

String views have some nice properties and are occasionally useful, but they also cause accidental memory leaks - because people forget that substring can be retaining a large string underneath.

I am also of the opinion that most parsers should work directly with utf8 bytes rather than Dart strings if performance is important.

@lrhn
Copy link
Member

lrhn commented Oct 4, 2024

I would like to have a way to give Pattern an end point too, not just a start point. But JS RegExps don't have that, so we don't have a good way to achieve that.

Memory retention is one classical argument against having built-in slices.
Polymorphism is another. Today there is a known number of implementations of String that all share a common layout (I think), which makes operations on the fairly efficient. Simple operations like string.codeUnitAt(i) can be inlined and optimized.
If every such place needs to account for yet-another subtype of String, that gets harder.
(But maybe it's just like an external string, with a pointer into the middle of the original one. Probably not.)

If allocating many (not-so-small) substrings is an issue, then the native VM could consider allocating string slices behind the curtain, maybe with "copy on GC" semantics. (I believe that is what V8 does, or have done at some point.)
But it has to be worth it.

A string already contains a length, maybe a hash code, so a slice would contain a pointer and a start position on top of that. That's likely 16 bytes, so strings up to 16 bytes are more efficeint to just create directly, instead of creating a slice.

(And yes, work on Uint8List or Uint16List directly if you're looking for speed. Can't use RegExp on those, though, but if you were looking for speed, RegExp might not be the way to go.)

@lrhn lrhn added library-core and removed triage-automation See https://github.com/dart-lang/ecosystem/tree/main/pkgs/sdk_triage_bot. labels Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-core-library SDK core library issues (core, async, ...); use area-vm or area-web for platform specific libraries. library-core type-enhancement A request for a change that isn't a bug
Projects
None yet
Development

No branches or pull requests

4 participants