Release

cerlymarco · Mar 13, 2022 · 8d5beca · 8d5beca
1 parent e07c2d4
commit 8d5beca
Show file tree

Hide file tree

Showing 8 changed files with 313 additions and 214 deletions.
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@ Like in tree-based algorithms, the data are split according to simple decision r
 
 **Linear Forests** generalize the well known Random Forests by combining Linear Models with the same Random Forests. The key idea is to use the strength of Linear Models to improve the nonparametric learning ability of tree-based algorithms. Firstly, a Linear Model is fitted on the whole dataset, then a Random Forest is trained on the same dataset but using the residuals of the previous steps as target. The final predictions are the sum of the raw linear predictions and the residuals modeled by the Random Forest.
 
-**Linear Boosting** is a two stage learning process. Firstly, a linear model is trained on the initial dataset to obtains predictions. Secondly, the residuals of the previous step are modeled with a decision tree using all the available features. The tree identifies the path leading to highest error (i.e. the worst leaf). The leaf contributing to the error the most is used to generate a new binary feature to be used in the first stage. The iterations continue until a certain stopping criterion is met.
+**Linear Boosting** is a two stage learning process. Firstly, a linear model is trained on the initial dataset to obtain predictions. Secondly, the residuals of the previous step are modeled with a decision tree using all the available features. The tree identifies the path leading to highest error (i.e. the worst leaf). The leaf contributing to the error the most is used to generate a new binary feature to be used in the first stage. The iterations continue until a certain stopping criterion is met.
 
 **linear-tree is developed to be fully integrable with scikit-learn**. ```LinearTreeRegressor``` and ```LinearTreeClassifier``` are provided as scikit-learn _BaseEstimator_ to build a decision tree using linear estimators. ```LinearForestRegressor``` and ```LinearForestClassifier``` use the _RandomForest_ from sklearn to model residuals. ```LinearBoostRegressor``` and ```LinearBoostClassifier``` are available also as _TransformerMixin_ in order to be integrated, in any pipeline, also for  automated features engineering. All the models available in [sklearn.linear_model](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) can be used as base learner. 
 

diff --git a/lineartree/_classes.py b/lineartree/_classes.py
@@ -11,8 +11,6 @@
 
 from sklearn.base import is_regressor
 from sklearn.base import BaseEstimator, TransformerMixin
-
-from sklearn.utils import check_array
 from sklearn.utils.validation import has_fit_parameter, check_is_fitted
 
 from ._criterion import SCORING
@@ -123,7 +121,6 @@ def _parallel_binning_fit(split_feat, _self, X, y,
                     model_right = DummyClassifier(strategy="most_frequent")
 
             if weights is None:
-
                 model_left.fit(X[left_mesh], y[~mask])
                 loss_left = feval(model_left, X[left_mesh], y[~mask],
                                   **largs_left)
@@ -135,17 +132,14 @@ def _parallel_binning_fit(split_feat, _self, X, y,
                 wloss_right = loss_right * (n_right / n_sample)
 
             else:
-
                 if support_sample_weight:
-
                     model_left.fit(X[left_mesh], y[~mask],
                                    sample_weight=weights[~mask])
 
                     model_right.fit(X[right_mesh], y[mask],
                                     sample_weight=weights[mask])
 
                 else:
-
                     model_left.fit(X[left_mesh], y[~mask])
 
                     model_right.fit(X[right_mesh], y[mask])
@@ -400,9 +394,7 @@ def _grow(self, X, y, weights=None):
                 self._leaves[queue[-1]] = self._nodes[queue[-1]]
                 del self._nodes[queue[-1]]
                 queue.pop()
-
             else:
-
                 model_left, loss_left, wloss_left, n_left, class_left = \
                     left_node
                 model_right, loss_right, wloss_right, n_right, class_right = \
@@ -700,10 +692,16 @@ def apply(self, X):
         """
         check_is_fitted(self, attributes='_nodes')
 
-        X = check_array(
-            X, accept_sparse=False, dtype=None,
-            force_all_finite=False)
-        self._check_n_features(X, reset=False)
+        X = self._validate_data(
+            X,
+            reset=False,
+            accept_sparse=False,
+            dtype='float32',
+            force_all_finite=True,
+            ensure_2d=True,
+            allow_nd=False,
+            ensure_min_features=self.n_features_in_
+        )
 
         X_leaves = np.zeros(X.shape[0], dtype='int64')
 
@@ -733,10 +731,16 @@ def decision_path(self, X):
         """
         check_is_fitted(self, attributes='_nodes')
 
-        X = check_array(
-            X, accept_sparse=False, dtype=None,
-            force_all_finite=False)
-        self._check_n_features(X, reset=False)
+        X = self._validate_data(
+            X,
+            reset=False,
+            accept_sparse=False,
+            dtype='float32',
+            force_all_finite=True,
+            ensure_2d=True,
+            allow_nd=False,
+            ensure_min_features=self.n_features_in_
+        )
 
         indicator = np.zeros((X.shape[0], self.node_count), dtype='int64')
 
@@ -976,8 +980,17 @@ def transform(self, X):
             `n_out` is equal to `n_features` + `n_estimators`
         """
         check_is_fitted(self, attributes='base_estimator_')
-        X = check_array(X, dtype=np.float32, accept_sparse=False)
-        self._check_n_features(X, reset=False)
+
+        X = self._validate_data(
+            X,
+            reset=False,
+            accept_sparse=False,
+            dtype='float32',
+            force_all_finite=True,
+            ensure_2d=True,
+            allow_nd=False,
+            ensure_min_features=self.n_features_in_
+        )
 
         for tree, leaf in zip(self._trees, self._leaves):
             pred_tree = np.abs(tree.predict(X, check_input=False))