Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Abstraction: Introducing PathId & FileSystem (Phase 1) #3219

Merged
merged 2 commits into from
Sep 9, 2022

Conversation

hs-apotell
Copy link
Collaborator

IO Abstraction: Introducing PathId & FileSystem (Phase 1)

NOTE: This is only the first phase of the large goal of IO abstraction.

Big picture goal is to abstract IO operations out of Surelog and into a class implementation that can be overriden by client application. This facilitates implementation of other more extensive features (like support for UNC paths, portale cache, multi-source root directories, cache archives, etc). Adding those directly to Surelog is not ideal since in many cases this features pull in other thrid-party dependecnies that might not have the same licensing terms as Surelog (and so leave it to the client/application to decide).

Introducing two new classes - PathId & FileSystem

PathId - New class parallel to SymbolId that represent path-like objects for IO operations. Each instance can represent a directory or a file but the internal representation doesn't necessarily have to be a path. For instance, in case of compressed archive of sources, it can be a string representation of the archive + the offset. The only enforcement is that the representation needs to be printable i.e. should be convertible to a string (since thee representation is stored in SymbolTable). For the advanced users. this limitation can very easily be overcome and is left as an exercise.

FileSystem - This reroutes access to native file system or in special advanced cases can restrict access to the file system. All operations of PathId are meant to be routed thru' this class. This commit contains only the bare-bones of the implementation (to keep the rest of the change review-able). Follow up commit will add the real meat to this implementation.

Most of this change is about switching from SymbolId to use of PathId where applicable. Some specific ones that may be of interest to the reviewer.

  • Restrict use of std::filesystem to minimal scope (functions where possible). Do NOT pass std::filesystem::path across function boundaries. Use PathId instead.
  • Do NOT cache unpinned strings, especially the ones that represent paths. Add them to the SymbolTable and resolve it on restore.
  • PreprocessFile::m_fileId used to represent both a file path (when processing a file) or a macro name (when processing a macro). This was a problem since now the path needs to be represented using PathId but the macro name is represented using SymbolId. So, added two independent variables PreprocessFile::m_macroId & PreprocessFile::m_fileId.

NOTE: This is only the first phase of the large goal of IO abstraction.

Big picture goal is to abstract IO operations out of Surelog and into a
class implementation that can be overriden by client application. This
facilitates implementation of other more extensive features (like
support for UNC paths, portale cache, multi-source root directories,
cache archives, etc). Adding those directly to Surelog is not ideal
since in many cases this features pull in other thrid-party dependecnies
that might not have the same licensing terms as Surelog (and so leave it
to the client/application to decide).

Introducing two new classes - PathId & FileSystem

PathId - New class parallel to SymbolId that represent path-like objects
for IO operations. Each instance can represent a directory or a file but
the internal representation doesn't necessarily have to be a _path_. For
instance, in case of compressed archive of sources, it can be a string
representation of the archive + the offset. The only enforcement is that
the representation needs to be *printable* i.e. should be convertible to
a string (since thee representation is stored in SymbolTable). For the
advanced users. this limitation can very easily be overcome and is left
as an exercise.

FileSystem - This reroutes access to native file system or in special
advanced cases can restrict access to the file system. All operations of
PathId are meant to be routed thru' this class. This commit contains
only the bare-bones of the implementation (to keep the rest of the
change review-able). Follow up commit will add the real meat to this
implementation.

Most of this change is about switching from SymbolId to use of PathId
where applicable. Some specific ones that may be of interest to the
reviewer.

* Restrict use of std::filesystem to minimal scope (functions where
  possible). Do NOT pass std::filesystem::path across function boundaries.
  Use PathId instead.
* Do NOT cache unpinned strings, especially the ones that represent
  paths. Add them to the SymbolTable and resolve it on restore.
* PreprocessFile::m_fileId used to represent both a file path (when
  processing a file) or a macro name (when processing a macro). This was a
  problem since now the path needs to be represented using PathId but the
  macro name is represented using SymbolId. So, added two independent
  variables PreprocessFile::m_macroId & PreprocessFile::m_fileId.
@hs-apotell
Copy link
Collaborator Author

@alaindargelas This PR pairs with chipsalliance/UHDM#793

@hs-apotell
Copy link
Collaborator Author

@alaindargelas @hzeller Since this is only the first phase of the bigger change, the attached files should help provide larger picture of the end goal.

FileSystem.h.txt
FileSystem.cpp.txt

A follow up phase 2 will introduce these files and all relevant changes to go with it across the project.

@alaindargelas alaindargelas merged commit 40766f4 into chipsalliance:master Sep 9, 2022
@hzeller
Copy link
Collaborator

hzeller commented Sep 9, 2022

There are absolutely no unit-tests that verify the new path-ID behavior, nor are there any extension to existing unit tests to verify corner cases.
Please for the next itereration, add unit also provide unit tests.

Copy link
Collaborator

@hzeller hzeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of review comments.

Unfortunately, this one is already merged, but before any other pull request can be added in this series, we first need to work out all the details here.

In general, this is too much and not reviewable, you really need to break independent changes to what they are

  • DesignComponent: change return value of name() from const std::string& to std::string can be a separate PR. In particular it would be then easy to discuss why this is not std::string_view and work towards what it is
  • PathId should start out as a no-op: have a typedef SymboldID to PathId, then change everything that needs to be a path from SymbolId to Pathid. Huge text change, but easy to review, no actual change.
  • All new classes need a description of what they are. Also naming: FileSystem sounds like a poor name for what it does, or it is not explained enough.
  • All new classes need unit tests.
  • Why does something that is called SometingId have a reference to the thing that supplies these IDs ? This sounds like a convenience hack.

Here are a first round of comments that we should address before the next phase of the filesystem series.

Whenever you work on changes, please make sure to break them up in simple steps

  • big changes only in a way that they are No-ops first, e.g. with the typedef of SymboldId -> NodeId.
  • introducing new classes one at a time with proper unit tests
  • Wire new classes up as needed in separate changes.

include/Surelog/Common/PathId.h Show resolved Hide resolved
return *this;
}

const SymbolTable *getSymbolTable() const { return m_symbolTable; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear why a path-ID needs to know about the SymbolTable. It sounds like a leaky abstraction or 'convenience' member.
Ideally, this should be removed. Anything called 'ID' should not not carry more context around. Or it needs to be clearly explained in the PathId class description.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentional. It makes the implementation of FileSystem API lot easier and elegant. Please check the full implementation of FileSystem in files attached here

I did bring up this design decision in #3054 and we had some back and forth before it went silent. For SymbolId this might not be as necessary since the use of it is restricted to Surelog code. But for PathId, it is meant to be accessed and used in application code. Putting the overhead of managing the owner SymbolTable on client application seemed unnecessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaking abstraction often makes implementation easier, but it doesn't mean it is the right thing.

I'll have a look at the CC file, but it currently smells like a convenience hack.

include/Surelog/Common/SymbolId.h Show resolved Hide resolved
include/Surelog/Common/FileSystem.h Show resolved Hide resolved
virtual const std::filesystem::path &getCwd();
virtual PathId getCwd(SymbolTable *const symbolTable);

virtual PathId copy(PathId id, SymbolTable *const toSymbolTable);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a class called FileSystem, a function called copy() copies files around, right ?

If not, this needs a different name, and also, it needs a description.

Also, don't make the parameter const, it makes absolutely no difference (as you const the value, not the point location), and thus is just confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a class called FileSystem, a function called copy() copies files around, right ?

Good point. I didn't think along those lines. Do you have a suggestion of what this function should be renamed to. The purpose is to copy a PathId from one SymbolTable into another (giving the FileSystem a chance for any bookkeeping it might need to do).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which contexts is this used ? Maybe duplicatePath() could be a good name ?
What are the scenarios where this will be used ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location takes in a PathId object but the owner of this object could be whoever. When an error is recorded with the ErrorContainer, it copies the contained PathId into its own SymbolTable.

In some ways, the name isn't so bad. FileSystem is responsible for everything PathId and so copy with PathId as parameter can be a potential overloaded function.

PathId copy(PathId srcId, SymbolTable *destinationSourceTable); // Copies srcId into destinationSymbolTable
bool copy(PathId srcId, PathId dstId); // Copies files from location srcId to dstId

include/Surelog/Common/FileSystem.h Show resolved Hide resolved
include/Surelog/Common/NodeId.h Show resolved Hide resolved
include/Surelog/Common/PathId.h Show resolved Hide resolved
include/Surelog/Common/PathId.h Show resolved Hide resolved
include/Surelog/Design/DesignComponent.h Show resolved Hide resolved
@hs-apotell
Copy link
Collaborator Author

There are absolutely no unit-tests that verify the new path-ID behavior, nor are there any extension to existing unit tests to verify corner cases. Please for the next itereration, add unit also provide unit tests.

PathId is a data container class not intended to have any logic. There are few operator functions that might be worth some test cases that I can add.

As for the FileSystem itself, yes, I do have uni tests covering most file operations and example to illustrate In-memory file system, portable cache and multiple source directories. These will be part of the follow up Phase-3 PR.

@hs-apotell
Copy link
Collaborator Author

To be honest, this isn't one night effort. This was a group effort from over three months. We did a number of design iterations for deciding how we wanted to implement some of the requested features in our tool. The change was organic but the effort paid off. I am cherry-picking from a much larger change in an effort to get this back into public repository.

Frankly, I thought that's what this change was - a big text change and doesn't change anything in terms of logic. No test case broke and no test log really change in any significant meaningful manner. All this change was doing is introducing PathId class in the codebase and fixing all the fallouts from it.

PathId should start out as a no-op: have a typedef SymboldID to PathId, then change everything that needs to be a path from SymbolId to Pathid. Huge text change, but easy to review, no actual change.

That approach wouldn't have caught all the places that needed attention. I did do a few PR before this one for the changes that I could easily pull out from the larger effort. But breaking this into any smaller would have meant redoing (and verifying) a lot of work that was already completed.

DesignComponent: change return value of name() from const std::string& to std::string can be a separate PR. In particular it would be then easy to discuss why this is not std::string_view and work towards what it is

And I wouldn't have known that such a change would be needed unless I had the PathId change in place.

All new classes need a description of what they are. Also naming: FileSystem sounds like a poor name for what it does, or it is not explained enough.

Sure. Will do so.

All new classes need unit tests.

PathId meant to be a data class and so didn't see much value in adding tests. There are few operator functions (comparison operators) that might demand such and so will add those in a follow up PR.

Why does something that is called SometingId have a reference to the thing that supplies these IDs ? This sounds like a convenience hack.

I have already replied back to your comment about this on the review. We can discuss this further there. IMO, PathId containing the owner SymbolTable makes sense to keep the API simpler. May be PathId shouldn't be called PathId.

Whenever you work on changes, please make sure to break them up in simple steps
big changes only in a way that they are No-ops first, e.g. with the typedef of SymboldId -> NodeId.
introducing new classes one at a time with proper unit tests
Wire new classes up as needed in separate changes.

This is good rule book to follow when you have a concrete design in front of you to implement. Much of our effort is organic and so I try my best to contribute changes back. It's not always that easy.

All that said, I am happy to iterate on specific PR if you feel that needs more work and needs to be broken up. Unfortunately. given time constraints, it might not always be possible but will make best effort.

hs-apotell added a commit to hs-apotell/Surelog that referenced this pull request Sep 9, 2022
…rnings

* Fixed build warnings and also updated the workflow to match local
  settings so the warnings should now be reported on CI as well.
* Add unit tests for PathId
* Some documentation & decorative changes
@hs-apotell
Copy link
Collaborator Author

Follow up PR #3223

@hs-apotell hs-apotell deleted the pathid branch September 9, 2022 19:28
hs-apotell added a commit to hs-apotell/Surelog that referenced this pull request Sep 9, 2022
…rnings

* Fixed build warnings and also updated the workflow to match local
  settings so the warnings should now be reported on CI as well.
* Add unit tests for PathId
* Some documentation & decorative changes
hs-apotell added a commit to hs-apotell/Surelog that referenced this pull request Sep 9, 2022
…rnings

* Fixed build warnings and also updated the workflow to match local
  settings so the warnings should now be reported on CI as well.
* Add unit tests for PathId
* Some documentation & decorative changes
@hzeller
Copy link
Collaborator

hzeller commented Sep 9, 2022

To be honest, this isn't one night effort. This was a group effort from over three months. We did a number of design iterations for deciding how we wanted to implement some of the requested features in our tool. The change was organic but the effort paid off. I am cherry-picking from a much larger change in an effort to get this back into public repository.

Frankly, I thought that's what this change was - a big text change and doesn't change anything in terms of logic. No test case broke and no test log really change in any significant meaningful manner. All this change was doing is introducing PathId class in the codebase and fixing all the fallouts from it.

PathId should start out as a no-op: have a typedef SymboldID to PathId, then change everything that needs to be a path from SymbolId to Pathid. Huge text change, but easy to review, no actual change.

That approach wouldn't have caught all the places that needed attention. I did do a few PR before this one for the changes that I could easily pull out from the larger effort. But breaking this into any smaller would have meant redoing (and verifying) a lot of work that was already completed.

But that is how code reviews and pull requests work.

Sure, you might not know beforehand what changes, that is why you work on a full change first until you're happy with it. The work then to make it reviewable is to break it up into the parts.
So now that you know what parts need to change to PathId, you then can retroactively do a change where everything is text-changed but made a typedef. etc.

Find all the changes independent from the rest, but that you also need for everything coming together, that can have their own review. Such as the const std::string& to std::string return value.

Yes, it requires a little bit more work to prepare a pull request, but that way it is not just a code-dump, but a well thought-out sequence of design decisions that the reviewers can follow.
I usually start out with the big local change (let's call it the creative phase), but then for review I re-do the exact same changes but break them out into separate commits.
That process often discovers that there are things that I don't have to change anymore.
Yes, it might take an hour of work, but it also introduces clarity and it is needed to convey ideas to fellow engineers.

This code review contains 116 changed files! With a mix of functionality addition, text changes, return value changes...
That many files should only really be with no-op starting with a typedef (such as SymbolId -> FileId where needed), or changing a single return value type throughout and have that an easy-to-review thing. So, large changes should only be text changes. Sometimes it requires a little 'dance': do a bulk change of something, then do some functionality change, do a bulk-change of some things turned back. And that is ok.

Any functionality change should be as minimal as possible (a few files and all the corresponding unit tests). Followed by a separate PR or at least commit that wires it up.

DesignComponent: change return value of name() from const std::string& to std::string can be a separate PR. In particular it would be then easy to discuss why this is not std::string_view and work towards what it is

And I wouldn't have known that such a change would be needed unless I had the PathId change in place.

That is the point: you first develop things and find out what all is needed. That is for sure.
But that is not the pull request. It is your responsibility to then break these up into independent, coherent changes before sending them as review. It also gives you another chance to reflect on why each change is needed (sometimes some change needed earlier in a larger effort turns out to not be needed anymore).

All that said, I am happy to iterate on specific PR if you feel that needs more work and needs to be broken up. Unfortunately. given time constraints, it might not always be possible but will make best effort.

Please do. I am not saying that it is simple. And often it requires to re-do the change manually into multiple commits.
The burden you put on Alain, me or others who have to review it is much higher; so you're transferring your time constraint to multiple times that effort to the reviewers. You see that Alain often just gives up on the review and merges, hoping for the best.

alaindargelas added a commit that referenced this pull request Sep 10, 2022
Issue #3221, #3219: Follow up and fixing warnings
@hs-apotell
Copy link
Collaborator Author

Taking the PathId conversation to discussion board #3228

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants