Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` #179

chezou · 2024-08-13T19:47:20Z

Description

Introduce fit_partial for ImplicitALSWrapperModel and LightFMWrapperModel.

Expected usage:

dataset = ...

new_interactions = ...

new_dataset = dataset.rebuild_with_new_data(interactions_df=new_interactions)
model.fit_partial(new_dataset)

Due to the complexity of explicit user/item feature fitting of ALS, I don't support them for fit_partial to ALS with features for now.

Close #176

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Optimization

How Has This Been Tested?

Before submitting a PR, please check yourself against the following list. It would save us quite a lot of time.

Have you read the contribution guide?
- Yes
Have you updated the relevant docstrings? We're using Numpy format, please double-check yourself
- Yes
Does your change require any new tests?
- Yes
Have you updated the changelog file?
- Yes

feldlime

Thanks a lot for such a great feature!

I added some comments, if you don't agree with them, let's discuss. I'm not 100% sure I understand the use case, so may be some of the comments are wrong. But if you agree with them, then it's fine.

rectools/dataset/dataset.py

feldlime · 2024-08-14T07:52:41Z

rectools/dataset/dataset.py

@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo
        pd.DataFrame
        """
        return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
+
+    def construct_new_datasets(


Probably construct_new_dataset would be more reasonable since only one dataset is returned

I'd call it build_new_dataset_on_id_map or smth like it. But we need to think about the conception first.

Right now it's not working properly in the conceptions that we use. We have a rule that all hot users (users that have interactions) always go before warm users (users that have only features). And we use this assumption in some of our impertant internal logic.

New method can break this rule because new users with interactions will have ids after warm users from previous dataset state. So we really can't make this a public method in the library.

Options are:

Rewrite logic. We need to keep only hot users ids from previous state. Then add new hot users ids. Then add new warm user ids. I suppose we are dropping all previous warm users because we are totally changing the actual data. (All the same is for items) In this case I'd call it or rebuild_with_new_data or construct_with_new_data or replace_data

Just don't make it a public method. Move it to the tests if we need it there. But then again write a correct public method for required logic. Also tests will be needed for it anyway

Thanks for pointing it out. I wasn't aware of the warm users' assumption. And yeah, I expected this to be a controversial change. And that's why I haven't written the test yet 😅

Originally, I expected for RecTools' Dataset to do similar to LightFM's Dataset.fit_partial(), but I found that RecTools Dataset is frozen.

I'm okay to make it private, or I can rebuild IdMap by concatenating new and old users/items something like:

old_hot_user_id_map = IdMap.from_dict({e: i for e,i in zip(ds.user_id_map.convert_to_external(ds.get_hot_users()), ds.get_hot_users())}) old_hot_item_id_map = ... new_user_id = old_hot_user_id_map.add_ids(...

I'm not confident if it's okay to drop the previous warm user yet though,.

Hmm, I attempt to drop old warm ids, but I face UnknownIdError.

RecTools/rectools/dataset/features.py

Line 121 in f73d054

raise UnknownIdError("All ids in `df` must be present in `id_map`")

Maybe, storing hot/warm marks like hot_ids in IdMap can be an option like np.array([1, 0, 1]) where 1 represents hot and 0 represents warm id. Then, we can fetch hot or warm users/items explicitly.

The overall conception of ids in rectools Dataset are the following:

All user ids are row indexes in both user-interactions matrix and user features matrix and vector model embeddings matrix

Because we don't wont empty rows in user-interactions matrix we can't have warm user ids before hot ones. Empty rows are bad because some models will add them to loss calculation which is conceptually wrong.

So we always need to have all hot user ids -> then all warm user ids. They have the same order in user features matrix

That's why we don't need hot_ids flag. But we always need to make sure we keep the assumption of hot/warm ids order

As for keeping warm user ids in the dataset it is very simple. If those users don't have features in user features matrix in the final dataset then they need to be dropped from user_id_map. Cause this is the only application of those warm ids - to get features for them (during recommend method)
So if you are dropping feature values for some users - you should drop their ids from IdMap.
Models do not store any embeds for warm users so nothing will break.

As for the error, one possible workaround is to keep all (old) hot user features inside the dataset.
For sparse features we can drop the values for memory efficiency.
For dense features don' really see what we can do using current implementations. Need to think about it.

Actually, since we are not dropping "old hot" user embeddings from the models, why should we drop their features from dataset? For recommend method of incrementally trained model we still need all features in the dataset for all users that ever were hot.
So I think that no-hot-any-more users that are not present in new user_features_df should keep all their values in user_features matrix.
But all users (both warm and hot) that are present in new user_features_df should have all their values updated (real service scenario).

Now about corner cases. I think that for incremental training we need to require exactly the same number of features, feature names and feature types as were present in the original dataset. At least in this PR. Or we gonna dive too deep. So all this checks for correct data in features are needed. And cat_user_features and make_dense_user_features should not be passed to rebuild_with_new_data.

Items follow same logic as users.

Wow. That's a tough story :) What do you think?

Okay, let me revive my dropping code and see what it happens on CI. I'll share it later.

Fixed to drop old warm ids e8929c5

@blondered I think I completed implementing the "Rewrite logic" option, but let me know if you find anything else.

rectools/models/lightfm.py

rectools/models/implicit_als.py

tests/models/test_implicit_als.py

blondered · 2024-08-14T19:29:23Z

CHANGELOG.md

@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 - `Debias` mechanism for classification, ranking and auc metrics. New parameter `is_debiased` to `calc_from_confusion_df`, `calc_per_user_from_confusion_df` methods of classification metrics, `calc_from_fitted`, `calc_per_user_from_fitted` methods of auc and rankning (`MAP`) metrics, `calc_from_merged`, `calc_per_user_from_merged` methods of ranking (`NDCG`, `MRR`) metrics. ([#152](https://github.com/MobileTeleSystems/RecTools/pull/152))
 - `nbformat >= 4.2.0` dependency to `[visuals]` extra ([#169](https://github.com/MobileTeleSystems/RecTools/pull/169))
+- Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` ([#179](https://github.com/MobileTeleSystems/RecTools/pull/179))


I'm very excited about the new feature. Thank you for the PR!

Let's discuss what we're trying to achieve with fit_partial.
In implicit ALS has partial_fit_users and partial_fit_items methods that allow to update weights only for specific parts of users/items and also support feeding new users and items to already fitted model.
LightFM model has different logic. It has fit_partial method. From docs: "Unlike fit, repeated calls to this method will cause training to resume from the current model state". This method does not support adding new users/items. And it is not meant to update only parts of embeddings.

What exactly do we want to achieve with fit_partial in rectools? I would say that first step would be to follow LightFM logic. We are missing the ability to resume training, out fit always refits models from the beginning, just like LightFM fit.

As for partial_fit_users and partial_fit_items - these are very model specific ALS methods. I wouldn't call them fit_partial. I would move them to a separate issue and PR.

@chezou @feldlime what do you think?

Good point.

My motivation for adding fit_partial is to enable incremental training. For example, weekly batches for full training and daily incremental training to adopt new and existing user transactions.

I don't seriously care about preserving old users/items. Rather, shortening training time would be important.

Actually, LightFM has a known workaround for adding a new user/item. lyst/lightfm#347 (comment)
I included that workaround into the patch. ea33c05

Great feature for LigtFM, thank you!

So there are two major usage cases for fit_partial:

Incremental training and support for adding new items and features. This is executed when dataset is changing during calls.

Epochal training. This is executed when dataset is staying the same during calls. It just resumes training from the previous state like Lightfm.fit_partial

Right now in LightFM you implemented support for both cases but in ALS only for one of them (incremental). Could you please add support for ALS epochal training (with NotImplementedError if features are present)? Otherwise users would be very confused for different behaviour of the same method. And could you please add an optional epochs argument to fit_partial of ALS?

We would also need tests for both models. With fitting model for 10 epochs with fit and 10 times fitting models with fit_partial for 1 epoch. And finals embeds should be equal.

How do you feel about the whole story?

Sounds good. Let me add tests for re-training with the same dataset.

Added the tests 117c5c7

@blondered I think 117c5c7 shows how it works for the use case 2.

Please let me know if you'd like to add anything else.

Thanks! Sorry for the delay. It's summer and it was a vacation time :)

- Add know user test case LightFM - Add _resize() function to resize the user and item embeddings - Use deepcopy only if not model is not fitted - Allow to pass epochs to fit_partial() - Handle sample_weight for K-OS WARP loss ImplicitALS - Raise NotFittedError if fit_partial() is called before fit()

codecov · 2024-08-14T21:37:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (9b3992e) to head (dd84f10).
Report is 67 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##              main      #179     +/-   ##
===========================================
  Coverage   100.00%   100.00%             
===========================================
  Files           45        56     +11     
  Lines         2242      3520   +1278     
===========================================
+ Hits          2242      3520   +1278

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Based on discussion, we decided to drop the old warm id in previous interactions. This is by RecTools design, and we should follow it. MobileTeleSystems#179 (comment)

chezou · 2024-08-30T22:47:18Z

@blondered @feldlime I think all comments are addressed. Please have a look and let me know if you need anything.

blondered · 2024-08-15T09:12:30Z

rectools/dataset/dataset.py

@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo
        pd.DataFrame
        """
        return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
+
+    def construct_new_datasets(


As for the error, one possible workaround is to keep all (old) hot user features inside the dataset.
For sparse features we can drop the values for memory efficiency.
For dense features don' really see what we can do using current implementations. Need to think about it.

blondered · 2024-08-15T11:04:02Z

rectools/dataset/dataset.py

@@ -245,3 +258,72 @@ def get_raw_interactions(self, include_weight: bool = True, include_datetime: bo
        pd.DataFrame
        """
        return self.interactions.to_external(self.user_id_map, self.item_id_map, include_weight, include_datetime)
+
+    def construct_new_datasets(


Actually, since we are not dropping "old hot" user embeddings from the models, why should we drop their features from dataset? For recommend method of incrementally trained model we still need all features in the dataset for all users that ever were hot.
So I think that no-hot-any-more users that are not present in new user_features_df should keep all their values in user_features matrix.
But all users (both warm and hot) that are present in new user_features_df should have all their values updated (real service scenario).

Now about corner cases. I think that for incremental training we need to require exactly the same number of features, feature names and feature types as were present in the original dataset. At least in this PR. Or we gonna dive too deep. So all this checks for correct data in features are needed. And cat_user_features and make_dense_user_features should not be passed to rebuild_with_new_data.

Items follow same logic as users.

Wow. That's a tough story :) What do you think?

blondered · 2024-09-02T12:57:02Z

rectools/dataset/dataset.py

@@ -309,3 +330,78 @@ def _handle_features(
            item_features=item_features_new,
        )
        return filtered_dataset
+
+    def rebuild_with_new_data(


A lot of things could go wrong with features because we don't do any checks. And user will not know about wrong behaviour.
Let's add on_unconsistent_features argument with options "raise" and "ignore". If "raise" is specified, then after features creation there needs to be a check that old dataset feature names are exactly the same as new feature names (and they are also in the same order). If something is not right then ValueError is raised

blondered · 2024-09-02T13:01:29Z

rectools/models/implicit_als.py

@@ -90,6 +90,26 @@ def _fit(self, dataset: Dataset) -> None:  # type: ignore
                self.verbose,
            )

+    def _fit_partial(self, dataset: Dataset) -> None:


We've discussed it a lot. Let's not do ALS support for fit_partial for now.
Main reasons are:

ALS wrapper has different behaviour then LightFM wrapper in current implementation. ALS cannot do fit_partial on model that has not yet been fitted. We can't leave it as it is.

Features support should be here for ALS. So we're gonna have to rewrite some of our code for features support in fit to add support for fit_partial and also for epochs parameter. And it's a long story. It should be in another PR.

blondered · 2024-09-02T13:04:10Z

tests/models/test_lightfm.py

+            actual,
+        )
+
+    def test_fit_partial_with_same_dataset(self, dataset: Dataset) -> None:


Please make this a test where factors are being compared. Same recommendations on small toy data are not really a signal that fit_partial and fit are doing exactly the same logic. Please fit model for 10 epochs. Then fit_partial 10 times for one epoch on another copy of model. And then please compare that factors are the same

blondered · 2024-09-02T13:04:46Z

CHANGELOG.md

@@ -11,6 +11,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 - `Debias` mechanism for classification, ranking and auc metrics. New parameter `is_debiased` to `calc_from_confusion_df`, `calc_per_user_from_confusion_df` methods of classification metrics, `calc_from_fitted`, `calc_per_user_from_fitted` methods of auc and rankning (`MAP`) metrics, `calc_from_merged`, `calc_per_user_from_merged` methods of ranking (`NDCG`, `MRR`) metrics. ([#152](https://github.com/MobileTeleSystems/RecTools/pull/152))
 - `nbformat >= 4.2.0` dependency to `[visuals]` extra ([#169](https://github.com/MobileTeleSystems/RecTools/pull/169))
+- Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` ([#179](https://github.com/MobileTeleSystems/RecTools/pull/179))


Thanks! Sorry for the delay. It's summer and it was a vacation time :)

chezou · 2024-09-03T19:45:41Z

Thanks for dedicating your time and effort to my PR. Your support is invaluable and greatly appreciated.

I decided to close this PR because

There can still be substantial unwritten assumptions about user/item features that are unknown to me as an outsider. And it seems the gap is becoming a huge blocker to understand the intention of comments.
Implicit ALS's user/item features handling and Dataset structure may need drastic changes to adopt warm feature handling.

My assumption for user/item features was:

There should be user/item tables along with the interaction table.
Those tables' columns are additive during the incremental training phase. If a column is removed, we can do full training.
Looking at the Lightfm's model resize workaround, it looks like features are additive, not dropped.

In our case, we mainly handle interactions only with Implicit. I'm glad to know some code in this PR works without features for ALS.

I know I could be better off joining your Telegram channel but installing another chat tool is not what I want.

blondered · 2024-09-04T13:30:52Z

Thank you so much for your code and you time!

We do provide users with very simple interface for all of the models and datasets, but it requires us to do a lot of work under the hood. The story with features support in ALS is huge on its own. And support for incremental dataset there is also huge.
Another problem was with the mapping of features in the old dataset and in the new one. Our methods were not written to support incremental features so they could completely break the place of a specific feature in the dataset features table and in the model feature embeddings table. User would not know about this, but model performance would get hurt a lot. Because old embeddings would be mapped to completely different features from new dataset.

If you would like to contribute in the future and you don't want to communicate in telegram, we can make discussions in github issues, no problem.

Implement fit_partial

0caaf56

chezou changed the title ~~Implement fit_partial~~ Implement fit_partial() for ImplicitALSWrapperModel and LightFMWrapperModel Aug 13, 2024

Update changelog

144c6c5

feldlime reviewed Aug 14, 2024

View reviewed changes

blondered reviewed Aug 14, 2024

View reviewed changes

Add tests for Dataset.rebuild_with_new_data()

47db226

chezou force-pushed the partial-fit branch from c3d1625 to 47db226 Compare August 15, 2024 05:35

chezou added 8 commits August 14, 2024 23:14

Add more tests for LightFMWrapper

0b04ed9

Drop old warm id in previous interactions

e8929c5

Based on discussion, we decided to drop the old warm id in previous interactions. This is by RecTools design, and we should follow it. MobileTeleSystems#179 (comment)

Add fit_partial tests with the same dataset

117c5c7

Merge remote-tracking branch 'origin/main' into partial-fit

4c86904

Fix for lint error

b75dd62

Use np.arange for get_hot_users/items_internal

cd9ec52

Merge branch 'main' into partial-fit

43deafd

Merge branch 'main' into partial-fit

dd84f10

blondered reviewed Sep 2, 2024

View reviewed changes

chezou closed this Sep 3, 2024

chezou mentioned this pull request Sep 4, 2024

fit_partial function for ImplicitALSWrapperModel and LightFMWrapperModel #176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` #179

Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` #179

chezou commented Aug 13, 2024 •

edited

Loading

feldlime left a comment

feldlime Aug 14, 2024

blondered Aug 14, 2024 •

edited

Loading

chezou Aug 14, 2024

chezou Aug 15, 2024

blondered Aug 15, 2024 •

edited

Loading

blondered Aug 15, 2024

blondered Aug 15, 2024

chezou Aug 15, 2024

chezou Aug 15, 2024

chezou Aug 28, 2024

blondered Aug 14, 2024

chezou Aug 14, 2024

blondered Aug 15, 2024

chezou Aug 15, 2024

chezou Aug 15, 2024

chezou Aug 28, 2024

blondered Sep 2, 2024

codecov bot commented Aug 14, 2024 •

edited

Loading

chezou commented Aug 30, 2024

blondered Aug 15, 2024

blondered Aug 15, 2024

blondered Sep 2, 2024

blondered Sep 2, 2024

blondered Sep 2, 2024

blondered Sep 2, 2024

chezou commented Sep 3, 2024

blondered commented Sep 4, 2024

Implement fit_partial() for ImplicitALSWrapperModel and LightFMWrapperModel #179

Implement fit_partial() for ImplicitALSWrapperModel and LightFMWrapperModel #179

Conversation

chezou commented Aug 13, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

feldlime left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blondered Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blondered Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 14, 2024 • edited Loading

Codecov Report

chezou commented Aug 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chezou commented Sep 3, 2024

blondered commented Sep 4, 2024

Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` #179

Implement `fit_partial()` for `ImplicitALSWrapperModel` and `LightFMWrapperModel` #179

chezou commented Aug 13, 2024 •

edited

Loading

blondered Aug 14, 2024 •

edited

Loading

blondered Aug 15, 2024 •

edited

Loading

codecov bot commented Aug 14, 2024 •

edited

Loading