Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/content extension #62

Merged
merged 2 commits into from
Mar 31, 2014

Conversation

linzhp
Copy link
Contributor

@linzhp linzhp commented Feb 13, 2014

@andygrunwald

This is the refreshed pull request for the content extension.
If #60 is accepted, I will update this pull request to call fr.get_path(repo, path or repo.get_uri()) instead.

Please review.

@andygrunwald
Copy link
Contributor

Hey @linzhp,

this PR is (sadly) not runnable or depends on another PR.
The output of my testrun:

Invalid extension Content (cannot import name SIZE)

@linzhp
Copy link
Contributor Author

linzhp commented Feb 23, 2014

Sorry, I failed to mention this PR depends on MetricsGrimoire/RepositoryHandler#4

@@ -394,7 +395,7 @@ def create_tables (self, cursor):
")")
cursor.execute ("CREATE TABLE actions (" +
"id integer primary key," +
"type varchar(1)," +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which action (Added, Modified, Deleted, etc.) got two character?
I dont see a change / select / insert on the actions table in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In merge commits, the actions often have two characters.

@@ -64,7 +64,7 @@ def set_tail(self, tail):
patterns['committer'] = re.compile("^Commit:[ \t]+(.*)[ \t]+<(.*)>$")
patterns['date'] = re.compile(
"^CommitDate: (.* [0-9]+ [0-9]+:[0-9]+:[0-9]+ [0-9][0-9][0-9][0-9]) ([+-][0-9][0-9][0-9][0-9])$")
patterns['file'] = re.compile("^([MAD])[ \t]+(.*)$")
patterns['file'] = re.compile("^([MAD]+)[ \t]+(.*)$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why this is needed. I wouldn't have to change the database schema without a good explanation because this can cause collateral effects.

@sduenas
Copy link
Member

sduenas commented Mar 25, 2014

I've tested your code with a CVSAnalY repo and it works perfectly but before merging it, I would like to know why it is needed to change the database schema. Are there any action with more than a letter to categorize it?

@linzhp
Copy link
Contributor Author

linzhp commented Mar 25, 2014

Run this command on the master branch of CVSAnalY and you will see actions with two letters:

git show --name-status a3ea57e2096bfeb15c429206a7bcc7d21eed72f5

@sduenas
Copy link
Member

sduenas commented Mar 26, 2014

Thanks for the clarification @linzhp.

From my point of view, we should keep the table as is. The idea of 'type' field was to homogenize or to establish a map for all the kind of actions that we can find in SCM systems. I know in MininGit you didn't need that because it only analyzes git repositories and the change in the structure of the table was because of that. But, I think the same behavior is not valid for CVSAnalY.

If there are actions with two letters we should try to map into an existing type, if possible. If not, find another letter. If this is not possible, then we will change the type field.

My personal reason here is I have a huge number of databases that follow the current schema. A change on the schema makes that I couldn't use the latest version of CVSAnalY anymore (unless I make the change by myself in all the databases). Other users of CVSAnalY will have the same problem. But this doesn't mean that we cannot change the schema. What I am saying is we have to find strong arguments to do it.

By the way, I think the code that changes the schema should go out of this patch. Reviewing the code of the extension I didn't find any place where it is needed.

Anyway, @linzhp thanks for your contribution and your patience with us.

@linzhp
Copy link
Contributor Author

linzhp commented Mar 26, 2014

@sduenas Here is how this extension needs the schema change: this extension intends to store all revisions of all files in source repositories. However, taking a snapshot of all files at all commits will blow up the database. So it only stores the content of a file when it gets a new revision, i.e., it's changed in the commit. In a merge commit, a file can get a new revision different from the revisions before the merge, indicated by actions such as 'MM'. If you are reluctant to modify the database schema and the regular expression, those actions will be ignored by CVSAnalY when it is populating the actions table. As Content extension relies on actions table to decide when to store the content of files, and some revisions of files will be missing.

Is the argument strong enough?

@jgbarah
Copy link
Member

jgbarah commented Mar 26, 2014

I'm jumping into this discussion, but I'm not sure I have all the elements to make up my mind, so please ignore it if you feel like it is just random noise... Anyway, if I understand the patch and the comments, it seems that:

  • For capturing all the information, @linzhp suggests a change in the database structure.
  • @sduenas is a bit reluctant to it, because that would break compatibility.

Could we explore the solutilon that @sduenas proposes, about using some other one-character-long actions for MM? If that were the only problem, we could easily use eg "2" or "X", for example, for those "change in revision number but no change in content" in actions.

This said, another possibility which maybe would break less the way CVSAnalY currently works would be creating another table, say actions_git, with all info in actions plus these "changes". The extension could either recreate it with the current code, or just copy actions into actions_git, and then populate the added information.

I find it very valuable to have this extension in CVSAnalY, since it solves a recurrent problem, so let's explore options...

@sduenas
Copy link
Member

sduenas commented Mar 26, 2014

@linzhp What I am saying it something similar to what @jgbarah proposes to do. These actions must be parsed and stored in the database but mapped to M action. AFAIK, MM means that a file was modified during a merge, so it should be mapped to M(odified) action.

@linzhp
Copy link
Contributor Author

linzhp commented Mar 26, 2014

Alright, now the database schema is preserved. As for merge actions, I am only interested in files modified for now, I convert all merge actions with 'M' into a single letter 'M'. You guys can decide what to do with actions such as 'AD', 'AA', 'DD' in future.

Please take another look

@sduenas
Copy link
Member

sduenas commented Mar 27, 2014

@linzhp Thanks for your work.
@andygrunwald What do you think? I think we should merge it.

@andygrunwald
Copy link
Contributor

Hey,

i love to see such discussions and i want to jump in for days. But you know. The time runs and runs. Thanks for mentioned @sduenas!
I`ve read the whole discussion and understand all points from you.

In my point of view we should change the database scheme and support more than one action char (e.g. 2 in this change). Why?

  • Loss of information: I don`t like to lose information. For some analysis / questions it might be useful to know the merge modified status (MM). Depends on the question
  • Used letters: I do not propose to map MM to another character, because we support more than one SCM (CVS, SVN, Git, Bz, ...). Do you know if the chosen mapped letter (e.g. X for "MM") is not used as a valid action in the supported SCMs? E.g. X is the letter for "external sources" in subversion. This is not a valid action in a commit (afaik), but in an update / switch of a repositoriy. See http://svnbook.red-bean.com/en/1.7/svn.advanced.externals.html. This is just a small example i know. Or what is if one of the SCMs added a new feature which are used the letter X? Then we are mapping one letter to another letter. This leads to my third point
  • Native understanding: If you analyze a repository and you have a look at the action tables and try to create some queries / source code you will recognize various letters which we used for mapping n letters (e.g. "MM") to one (e.g. "Y"). Then you will ask "What does "Y" mean?" and you must have a look at the documentation and build a mapping in your source code / queries as well. This (might be) lead to more complex code on all sides (cvsanaly, analysis code, etc.).

I understand the point from @sduenas that he got a huge number of installations which has to be updated as well, if he switch to the new versio of CVSAnaly. And of course this point is valid as hell :D
But i think we should not fear a database scheme change, because this will happen in the feature and were happen in the past. I think we should create a solution / find a way how to apply a database scheme change in a good manner.
Here are two proposals:

  • Versioning + manual migration guide: CVSAnaly is a tool with a version scheme (like every open source tool). "Just" tag a new version and release it. The version number can be chosen with semver (http://semver.org/). This means we dd a new feature or a breaking change and raise minor or major version. With every new release we can create a text file which is our migration guide how to update. This is a quite regular process for every tool. In this guide we add ALTER TABLE queries to update your instance.
  • Database migration scripts: Additional to the first poposal (version + migration guide) we can create a new cvsanaly command "migrate:databasescheme" or something similar. This command executes python code which is encapsulated in classes. The code executes ALTER TABLE statements full automated for every released cvsanaly version with possible rollback features if something goes wrong.

Both proposals can be combined.

Another solution to support multiple actions letters can be to create a second column in the action table. This is quite similiar to the authors date commit (364f67f). This change is valid for git but not for subversion.

We should not fear database changes. This kind of changes are necessary for to development of such a tool like cvsanaly. And of course this problem is the same for Bicho and so on, too.

Now i`d like to hear your feeback on this proposals :)

@linzhp
Copy link
Contributor Author

linzhp commented Mar 28, 2014

I prefer the schema change too--that's why I implemented that way initially--but I would like to see this PR merged as soon. As this extension could run without schema change, you could merge this PR and at the same time create another issue to revert changes in af146b5 regarding to action types, and do whatever it takes to change the schema.

sduenas added a commit that referenced this pull request Mar 31, 2014
@sduenas sduenas merged commit e2d2630 into MetricsGrimoire:master Mar 31, 2014
@andygrunwald
Copy link
Contributor

Thanks to all people who were involved!
I created an issue about the database scheme changes / migrations. This is a really important topic and i do not want to forget it.

See #73 for further details.
Thanks @linzhp for your contribution!

@linzhp linzhp deleted the feature/content_extension branch April 6, 2014 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants