Feature/content extension #62

linzhp · 2014-02-13T01:45:13Z

This is the refreshed pull request for the content extension.
If #60 is accepted, I will update this pull request to call fr.get_path(repo, path or repo.get_uri()) instead.

Please review.

andygrunwald · 2014-02-23T10:35:40Z

Hey @linzhp,

this PR is (sadly) not runnable or depends on another PR.
The output of my testrun:

Invalid extension Content (cannot import name SIZE)

linzhp · 2014-02-23T17:30:32Z

Sorry, I failed to mention this PR depends on MetricsGrimoire/RepositoryHandler#4

andygrunwald · 2014-03-23T17:28:35Z

pycvsanaly2/Database.py

@@ -394,7 +395,7 @@ def create_tables (self, cursor):
                            ")")
            cursor.execute ("CREATE TABLE actions (" +
                            "id integer primary key," +
-                            "type varchar(1)," +


Which action (Added, Modified, Deleted, etc.) got two character?
I dont see a change / select / insert on the actions table in this PR.

In merge commits, the actions often have two characters.

sduenas · 2014-03-25T01:13:10Z

pycvsanaly2/GitParser.py

@@ -64,7 +64,7 @@ def set_tail(self, tail):
    patterns['committer'] = re.compile("^Commit:[ \t]+(.*)[ \t]+<(.*)>$")
    patterns['date'] = re.compile(
        "^CommitDate: (.* [0-9]+ [0-9]+:[0-9]+:[0-9]+ [0-9][0-9][0-9][0-9]) ([+-][0-9][0-9][0-9][0-9])$")
-    patterns['file'] = re.compile("^([MAD])[ \t]+(.*)$")
+    patterns['file'] = re.compile("^([MAD]+)[ \t]+(.*)$")


I don't get why this is needed. I wouldn't have to change the database schema without a good explanation because this can cause collateral effects.

sduenas · 2014-03-25T01:17:13Z

I've tested your code with a CVSAnalY repo and it works perfectly but before merging it, I would like to know why it is needed to change the database schema. Are there any action with more than a letter to categorize it?

linzhp · 2014-03-25T16:41:16Z

Run this command on the master branch of CVSAnalY and you will see actions with two letters:

git show --name-status a3ea57e2096bfeb15c429206a7bcc7d21eed72f5

sduenas · 2014-03-26T00:49:12Z

Thanks for the clarification @linzhp.

From my point of view, we should keep the table as is. The idea of 'type' field was to homogenize or to establish a map for all the kind of actions that we can find in SCM systems. I know in MininGit you didn't need that because it only analyzes git repositories and the change in the structure of the table was because of that. But, I think the same behavior is not valid for CVSAnalY.

If there are actions with two letters we should try to map into an existing type, if possible. If not, find another letter. If this is not possible, then we will change the type field.

My personal reason here is I have a huge number of databases that follow the current schema. A change on the schema makes that I couldn't use the latest version of CVSAnalY anymore (unless I make the change by myself in all the databases). Other users of CVSAnalY will have the same problem. But this doesn't mean that we cannot change the schema. What I am saying is we have to find strong arguments to do it.

By the way, I think the code that changes the schema should go out of this patch. Reviewing the code of the extension I didn't find any place where it is needed.

Anyway, @linzhp thanks for your contribution and your patience with us.

linzhp · 2014-03-26T03:57:55Z

@sduenas Here is how this extension needs the schema change: this extension intends to store all revisions of all files in source repositories. However, taking a snapshot of all files at all commits will blow up the database. So it only stores the content of a file when it gets a new revision, i.e., it's changed in the commit. In a merge commit, a file can get a new revision different from the revisions before the merge, indicated by actions such as 'MM'. If you are reluctant to modify the database schema and the regular expression, those actions will be ignored by CVSAnalY when it is populating the actions table. As Content extension relies on actions table to decide when to store the content of files, and some revisions of files will be missing.

Is the argument strong enough?

jgbarah · 2014-03-26T08:41:40Z

I'm jumping into this discussion, but I'm not sure I have all the elements to make up my mind, so please ignore it if you feel like it is just random noise... Anyway, if I understand the patch and the comments, it seems that:

For capturing all the information, @linzhp suggests a change in the database structure.
@sduenas is a bit reluctant to it, because that would break compatibility.

Could we explore the solutilon that @sduenas proposes, about using some other one-character-long actions for MM? If that were the only problem, we could easily use eg "2" or "X", for example, for those "change in revision number but no change in content" in actions.

This said, another possibility which maybe would break less the way CVSAnalY currently works would be creating another table, say actions_git, with all info in actions plus these "changes". The extension could either recreate it with the current code, or just copy actions into actions_git, and then populate the added information.

I find it very valuable to have this extension in CVSAnalY, since it solves a recurrent problem, so let's explore options...

sduenas · 2014-03-26T11:57:03Z

@linzhp What I am saying it something similar to what @jgbarah proposes to do. These actions must be parsed and stored in the database but mapped to M action. AFAIK, MM means that a file was modified during a merge, so it should be mapped to M(odified) action.

linzhp · 2014-03-26T17:11:24Z

Alright, now the database schema is preserved. As for merge actions, I am only interested in files modified for now, I convert all merge actions with 'M' into a single letter 'M'. You guys can decide what to do with actions such as 'AD', 'AA', 'DD' in future.

Please take another look

sduenas · 2014-03-27T00:30:07Z

@linzhp Thanks for your work.
@andygrunwald What do you think? I think we should merge it.

andygrunwald · 2014-03-27T10:33:49Z

Hey,

i love to see such discussions and i want to jump in for days. But you know. The time runs and runs. Thanks for mentioned @sduenas!
I`ve read the whole discussion and understand all points from you.

In my point of view we should change the database scheme and support more than one action char (e.g. 2 in this change). Why?

Loss of information: I don`t like to lose information. For some analysis / questions it might be useful to know the merge modified status (MM). Depends on the question
Used letters: I do not propose to map MM to another character, because we support more than one SCM (CVS, SVN, Git, Bz, ...). Do you know if the chosen mapped letter (e.g. X for "MM") is not used as a valid action in the supported SCMs? E.g. X is the letter for "external sources" in subversion. This is not a valid action in a commit (afaik), but in an update / switch of a repositoriy. See http://svnbook.red-bean.com/en/1.7/svn.advanced.externals.html. This is just a small example i know. Or what is if one of the SCMs added a new feature which are used the letter X? Then we are mapping one letter to another letter. This leads to my third point
Native understanding: If you analyze a repository and you have a look at the action tables and try to create some queries / source code you will recognize various letters which we used for mapping n letters (e.g. "MM") to one (e.g. "Y"). Then you will ask "What does "Y" mean?" and you must have a look at the documentation and build a mapping in your source code / queries as well. This (might be) lead to more complex code on all sides (cvsanaly, analysis code, etc.).

I understand the point from @sduenas that he got a huge number of installations which has to be updated as well, if he switch to the new versio of CVSAnaly. And of course this point is valid as hell :D
But i think we should not fear a database scheme change, because this will happen in the feature and were happen in the past. I think we should create a solution / find a way how to apply a database scheme change in a good manner.
Here are two proposals:

Versioning + manual migration guide: CVSAnaly is a tool with a version scheme (like every open source tool). "Just" tag a new version and release it. The version number can be chosen with semver (http://semver.org/). This means we dd a new feature or a breaking change and raise minor or major version. With every new release we can create a text file which is our migration guide how to update. This is a quite regular process for every tool. In this guide we add ALTER TABLE queries to update your instance.
Database migration scripts: Additional to the first poposal (version + migration guide) we can create a new cvsanaly command "migrate:databasescheme" or something similar. This command executes python code which is encapsulated in classes. The code executes ALTER TABLE statements full automated for every released cvsanaly version with possible rollback features if something goes wrong.

Both proposals can be combined.

Another solution to support multiple actions letters can be to create a second column in the action table. This is quite similiar to the authors date commit (364f67f). This change is valid for git but not for subversion.

We should not fear database changes. This kind of changes are necessary for to development of such a tool like cvsanaly. And of course this problem is the same for Bicho and so on, too.

Now i`d like to hear your feeback on this proposals :)

linzhp · 2014-03-28T04:10:32Z

I prefer the schema change too--that's why I implemented that way initially--but I would like to see this PR merged as soon. As this extension could run without schema change, you could merge this PR and at the same time create another issue to revert changes in af146b5 regarding to action types, and do whatever it takes to change the schema.

Feature/content extension

andygrunwald · 2014-03-31T10:45:44Z

Thanks to all people who were involved!
I created an issue about the database scheme changes / migrations. This is a really important topic and i do not want to forget it.

See #73 for further details.
Thanks @linzhp for your contribution!

andygrunwald added enhancement and removed enhancement labels Feb 22, 2014

andygrunwald reviewed Mar 23, 2014
View reviewed changes

Add Content extension

6b45c6d

sduenas reviewed Mar 25, 2014
View reviewed changes

Convert merge actions with 'M'

af146b5

sduenas added a commit that referenced this pull request Mar 31, 2014

Merge pull request #62 from linzhp/feature/content_extension

e2d2630

Feature/content extension

sduenas merged commit e2d2630 into MetricsGrimoire:master Mar 31, 2014

andygrunwald mentioned this pull request Mar 31, 2014

Take care of database scheme changes / migrations #73

Open

linzhp deleted the feature/content_extension branch April 6, 2014 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/content extension #62

Feature/content extension #62

linzhp commented Feb 13, 2014

andygrunwald commented Feb 23, 2014

linzhp commented Feb 23, 2014

andygrunwald Mar 23, 2014

linzhp Mar 23, 2014

sduenas Mar 25, 2014

sduenas commented Mar 25, 2014

linzhp commented Mar 25, 2014

sduenas commented Mar 26, 2014

linzhp commented Mar 26, 2014

jgbarah commented Mar 26, 2014

sduenas commented Mar 26, 2014

linzhp commented Mar 26, 2014

sduenas commented Mar 27, 2014

andygrunwald commented Mar 27, 2014

linzhp commented Mar 28, 2014

andygrunwald commented Mar 31, 2014

Feature/content extension #62

Feature/content extension #62

Conversation

linzhp commented Feb 13, 2014

andygrunwald commented Feb 23, 2014

linzhp commented Feb 23, 2014

andygrunwald Mar 23, 2014

Choose a reason for hiding this comment

linzhp Mar 23, 2014

Choose a reason for hiding this comment

sduenas Mar 25, 2014

Choose a reason for hiding this comment

sduenas commented Mar 25, 2014

linzhp commented Mar 25, 2014

sduenas commented Mar 26, 2014

linzhp commented Mar 26, 2014

jgbarah commented Mar 26, 2014

sduenas commented Mar 26, 2014

linzhp commented Mar 26, 2014

sduenas commented Mar 27, 2014

andygrunwald commented Mar 27, 2014

linzhp commented Mar 28, 2014

andygrunwald commented Mar 31, 2014