Boosting method implementation (XGBoost) #1209

RomanKharkovskoy · 2023-11-27T15:54:11Z

План

Как работает

Реализован интерфейс fit/predict в родительском классе FedotXGBoostImplementation

Код

class FedotXGBoostImplementation(ModelImplementation):
    __operation_params = ['n_jobs', 'use_eval_set']

    def __init__(self, params: Optional[OperationParameters] = None):
        super().__init__(params)

        self.model_params = {k: v for k, v in self.params.to_dict().items() if k not in self.__operation_params}
        self.model = None

    def fit(self, input_data: InputData):
        input_data = input_data.get_not_encoded_data()

        if self.params.get('use_eval_set'):
            train_input, eval_input = train_test_data_setup(input_data)

            train_input = self.convert_to_dataframe(train_input)
            eval_input = self.convert_to_dataframe(eval_input)

            train_x, train_y = train_input.drop(columns=['target']), train_input['target']
            eval_x, eval_y = eval_input.drop(columns=['target']), eval_input['target']

            if self.classes_ is None:
                eval_metric = 'rmse'
            elif len(self.classes_) < 3:
                eval_metric = 'auc'
            else:
                eval_metric = 'mlogloss'

            self.model.fit(X=train_x, y=train_y,
                           eval_set=[(eval_x, eval_y)], eval_metric=eval_metric)

        else:

            train_data = self.convert_to_dataframe(input_data)
            train_x, train_y = train_data.drop(columns=['target']), train_data['target']
            self.model.fit(X=train_x, y=train_y)

        return self.model

    def predict(self, input_data: InputData):
        input_data = self.convert_to_dataframe(input_data.get_not_encoded_data())
        train_x, _ = input_data.drop(columns=['target']), input_data['target']
        prediction = self.model.predict(train_x)

        return prediction

Интерфейс fit/predict не поддерживает работу с внутренним типом данных xgboost.DMatrix, поэтому необходимо было найти обходной путь. В данном случае был использован тип данных pandas.DataFrame.

Внутри интерфейса идёт преобразование InputData в pandas.DataFrame (categorical_idx становятся category, а numerical_idx становятся float

Код

@staticmethod
def convert_to_dataframe(data: Optional[InputData]):
    dataframe = pd.DataFrame(data=data.features, columns=data.features_names)
    dataframe['target'] = data.target

    if data.categorical_idx is not None:
        for col in dataframe.columns[data.categorical_idx]:
            dataframe[col] = dataframe[col].astype('category')

    if data.numerical_idx is not None:
        for col in dataframe.columns[data.numerical_idx]:
            dataframe[col] = dataframe[col].astype('float')

    return dataframe

Таблица сравнения метрик из Fedot'а и из без AutoML (inputer+кодирование категориальных данных)

Датасет	Метрика	Запуск без AutoML	FEDOT
Internet-Advertisements	ROC-AUC	0,97622	0,97157
adult	ROC-AUC	0,92792	0,88797
Amazon_employee_access	ROC-AUC	0,81353	0,82456
credit-g	ROC-AUC	0,76364	0,73071
blood-transfusion-service-center	ROC-AUC	0,68082	0,70955
kc1	ROC-AUC	0,77625	0,80227
bank-marketing	ROC-AUC	0,93192	0,92422
qsar-biodeg	ROC-AUC	0,92399	0,91268
electricity	ROC-AUC	0,96874	0,96539
ozone-level-8hr	ROC-AUC	0,91429	0,89595
car	LogLoss	0,04482	0,69721
vehicle	LogLoss	0,60539	0,55317
mfeat-factors	LogLoss	0,12889	0,16635
pendigits	LogLoss	0,03452	0,04912
cardiotocography	LogLoss	0,00291	0,00478
page-blocks	LogLoss	0,10433	0,11678
nursery	LogLoss	0,00611	0,18016
mfeat-karhunen	LogLoss	0,19698	0,21434
anneal	LogLoss	0,41601	0,42384
satimage	LogLoss	0,25953	0,24396

Метрики из коробки оказались лучше, чем метрики из Fedot'а, что может быть связано с внутренним препроцессингом внутри Fedot'а.

pep8speaks · 2023-11-27T15:54:18Z

Hello @RomanKharkovskoy! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-07-24 17:08:37 UTC

nicl-nno · 2023-11-27T18:47:46Z

Выставлять на ревью PR-ы без описания и с PEP8-дефектами - плохая практика.

Изменения не покрыты тестами (нужны в случае если поведение чем-то отличается от предыдущей реализации xgb).

codecov · 2023-12-05T13:39:39Z

Codecov Report

Attention: Patch coverage is 77.08333% with 22 lines in your changes missing coverage. Please review.

Project coverage is 80.10%. Comparing base (a7e4243) to head (f20d98e).
Report is 1 commits behind head on master.

Files	Patch %	Lines
...mplementations/models/boostings_implementations.py	76.84%	22 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1209      +/-   ##
==========================================
+ Coverage   79.96%   80.10%   +0.14%     
==========================================
  Files         146      146              
  Lines       10100    10190      +90     
==========================================
+ Hits         8076     8163      +87     
- Misses       2024     2027       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-01-10T13:01:13Z

All PEP8 errors has been fixed, thanks ❤️

Comment last updated at

andreygetmanov

Не совсем понял, как идёт сравнение с Федотом?
Берутся xgboost из коробки и федот с 1 моделью (xgboost)?
Тогда перед имплементацией стоит взять больше датасетов (хотя бы 10), запустить их минимум по 5 раз и усреднить
Возможно, @nicl-nno знаешь более оптимальные цифры или датасеты подскажешь?

fedot/core/operations/evaluation/operation_implementations/models/boostings_implementations.py

fedot/core/pipelines/tuning/search_space.py

fedot/core/repository/data/default_operation_params.json

RomanKharkovskoy · 2024-02-19T19:02:17Z

Не совсем понял, как идёт сравнение с Федотом? Берутся xgboost из коробки и федот с 1 моделью (xgboost)? Тогда перед имплементацией стоит взять больше датасетов (хотя бы 10), запустить их минимум по 5 раз и усреднить Возможно, @nicl-nno знаешь более оптимальные цифры или датасеты подскажешь?

Взял по 10 датасетов для классификации (заменил таблицу в шапке PR'а). XGBoost берётся из коробки + заполняются пропуски с помощью sklearn.SimpleInputer и кодируются категориальные признаки с помощью sklearn.OneHotEncoder. А для сравнения с FEDOT берётся predefined_model='xgboost'

andreygetmanov · 2024-02-20T14:01:22Z

Не совсем понял, как идёт сравнение с Федотом? Берутся xgboost из коробки и федот с 1 моделью (xgboost)? Тогда перед имплементацией стоит взять больше датасетов (хотя бы 10), запустить их минимум по 5 раз и усреднить Возможно, @nicl-nno знаешь более оптимальные цифры или датасеты подскажешь?

Взял по 10 датасетов для классификации (заменил таблицу в шапке PR'а). XGBoost берётся из коробки + заполняются пропуски с помощью sklearn.SimpleInputer и кодируются категориальные признаки с помощью sklearn.OneHotEncoder. А для сравнения с FEDOT берётся predefined_model='xgboost'

А усреднение по скольки запускам считаешь?

test/unit/api/test_assumption_builder.py

…eta for xgboost and xgbreg in model_repository.json

…ix unit test

aPovidlo · 2024-07-24T17:07:27Z

/fix-pep8

aPovidlo

-- wip --

RomanKharkovskoy requested review from aPovidlo and nicl-nno November 27, 2023 15:54

RomanKharkovskoy force-pushed the xgb_impl branch from 11a5395 to a9bd634 Compare December 26, 2023 14:17

RomanKharkovskoy force-pushed the xgb_impl branch from 37d7326 to f0c4098 Compare January 31, 2024 13:05

aPovidlo requested a review from andreygetmanov February 5, 2024 14:10

andreygetmanov requested changes Feb 7, 2024

View reviewed changes

RomanKharkovskoy force-pushed the xgb_impl branch from e770fc1 to dacd08e Compare February 19, 2024 12:50

RomanKharkovskoy requested a review from andreygetmanov February 19, 2024 12:51

andreygetmanov requested changes Feb 20, 2024

View reviewed changes

test/unit/api/test_assumption_builder.py Outdated Show resolved Hide resolved

RomanKharkovskoy added 15 commits July 23, 2024 16:44

first iteration

c6a2de5

added booster hyperparametr for xgboost in search_space.py, changed m…

1175797

…eta for xgboost and xgbreg in model_repository.json

Added XGBoost implementation without DMatrix

c1ac9e5

Edited default params for XGBoost

fd44563

Added L1 and L2 reg

ed27ff6

added 2 parametrs for tuning

e89e7f7

added tree_method

f67dd4d

changed xgbreg to xgboostreg

516fafd

pep8 fix

a673b38

convert to dataframe implemantation

55cba3a

regression params

cc887a9

changed fit/predict for regression

2c3ed2c

added eval metrics

49b27e8

added use_eval_set

f07b9c5

added xgboost to default operations

7e085b6

RomanKharkovskoy and others added 5 commits July 23, 2024 16:59

xgboost included in composing

71ef29f

pep8 fix

5a9ffce

changed test where xgboost was excluede

83499fb

pep8 fix

c5f4b72

Update after reabse

604b747

aPovidlo force-pushed the xgb_impl branch from dacd08e to 604b747 Compare July 23, 2024 14:11

aPovidlo added 8 commits July 23, 2024 17:57

Update feature importance

74545d0

Fixing bug & adding early stopping param

a0df6ce

Fixing unit test and integration tests

1640891

Separate setting eval_metric, fixes with gblinear booster strategy, f…

9225097

…ix unit test

fix unit test

743cb2c

bug fix

12ab217

fix feature importance

4590147

fix feature importance for catboost

5a5d0ea

Automated autopep8 fixes

f20d98e

aPovidlo requested a review from andreygetmanov July 24, 2024 17:10

aPovidlo reviewed Jul 24, 2024

View reviewed changes

aPovidlo self-assigned this Jul 24, 2024

andreygetmanov approved these changes Jul 26, 2024

View reviewed changes

aPovidlo merged commit 80eba8e into master Jul 26, 2024
7 checks passed

aPovidlo mentioned this pull request Jul 31, 2024

enh: Post-Improving Boosting Models #1315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boosting method implementation (XGBoost) #1209

Boosting method implementation (XGBoost) #1209

RomanKharkovskoy commented Nov 27, 2023 •

edited

Loading

pep8speaks commented Nov 27, 2023 •

edited

Loading

nicl-nno commented Nov 27, 2023 •

edited

Loading

codecov bot commented Dec 5, 2023 •

edited

Loading

github-actions bot commented Jan 10, 2024 •

edited

Loading

andreygetmanov left a comment

RomanKharkovskoy commented Feb 19, 2024

andreygetmanov commented Feb 20, 2024

aPovidlo commented Jul 24, 2024

aPovidlo left a comment

Boosting method implementation (XGBoost) #1209

Boosting method implementation (XGBoost) #1209

Conversation

RomanKharkovskoy commented Nov 27, 2023 • edited Loading

План

Как работает

pep8speaks commented Nov 27, 2023 • edited Loading

Comment last updated at 2024-07-24 17:08:37 UTC

nicl-nno commented Nov 27, 2023 • edited Loading

codecov bot commented Dec 5, 2023 • edited Loading

Codecov Report

github-actions bot commented Jan 10, 2024 • edited Loading

Comment last updated at

andreygetmanov left a comment

Choose a reason for hiding this comment

RomanKharkovskoy commented Feb 19, 2024

andreygetmanov commented Feb 20, 2024

aPovidlo commented Jul 24, 2024

aPovidlo left a comment

Choose a reason for hiding this comment

RomanKharkovskoy commented Nov 27, 2023 •

edited

Loading

pep8speaks commented Nov 27, 2023 •

edited

Loading

nicl-nno commented Nov 27, 2023 •

edited

Loading

codecov bot commented Dec 5, 2023 •

edited

Loading

github-actions bot commented Jan 10, 2024 •

edited

Loading