-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Maybe an incorrect number in the paper. #3
Comments
Hi, thanks for your interests at first. I may write the wrong number but I cannot remember it precisely. It is also possible we obtain different thresholds under difference pre-processing strategies. Generally, both 10 and 20 are good enough to filter out noise, so I think 10 is ok in your settings. If this number really matters, maybe you can check the minimum occurrence of features in my processed Avazu dataset, which can be found in README, where the low-frequency categories have already been dropped. If this number is verified to be 10, I will update this paper on arxiv. Thanks! |
Im sorry i do not reply in time, and here are the code and output while i use your data interface:https://github.com/Atomu2014/Ads-RecSys-Datasets import numpy as np
import pandas as pd
from datasets import Avazu
ava = Avazu()
ava.load_data('train')
ava.load_data('test')
df_avazu = pd.DataFrame(np.vstack([ava.X_train,ava.X_test]) , columns=ava.feat_names)
for field in ava.feat_names:
field_cnt = field+'_cnt'
gbdf = df_avazu.groupby(field).size().reset_index().rename(columns={0: field_cnt})
min_freq = gbdf.sort_values(field_cnt)[field_cnt].values[0]
print(f"{field}'s minimum feature frequence is {min_freq}") and the outputs are as follow (ignoring the dataloading infomation):
|
Thanks a lot! I will fix it later! |
' Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data' (TOIS'17)
In section 5.1.1Datasets of this paper, there says "We randomly split the public dataset into training and test sets at 4:1, and remove categories appearing less than 20 times to reduce dimensionality.",
but when i preprocessing the raw avazu dataset by my self, i found that if #categories=6*10^5 in avazu dataset, the threshold need to be 10 , not 20.
when i use a threshold 20, #categories< 4*10^5
Is it an incorrect threshold number in section 5.1.1 ?
The text was updated successfully, but these errors were encountered: