Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe an incorrect number in the paper. #3

Open
mOmUcf opened this issue Sep 10, 2019 · 3 comments
Open

Maybe an incorrect number in the paper. #3

mOmUcf opened this issue Sep 10, 2019 · 3 comments

Comments

@mOmUcf
Copy link

mOmUcf commented Sep 10, 2019

' Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data' (TOIS'17)

In section 5.1.1Datasets of this paper, there says "We randomly split the public dataset into training and test sets at 4:1, and remove categories appearing less than 20 times to reduce dimensionality.",
but when i preprocessing the raw avazu dataset by my self, i found that if #categories=6*10^5 in avazu dataset, the threshold need to be 10 , not 20.
when i use a threshold 20, #categories< 4*10^5

Is it an incorrect threshold number in section 5.1.1 ?

@Atomu2014
Copy link
Owner

Hi, thanks for your interests at first. I may write the wrong number but I cannot remember it precisely. It is also possible we obtain different thresholds under difference pre-processing strategies. Generally, both 10 and 20 are good enough to filter out noise, so I think 10 is ok in your settings.

If this number really matters, maybe you can check the minimum occurrence of features in my processed Avazu dataset, which can be found in README, where the low-frequency categories have already been dropped.

If this number is verified to be 10, I will update this paper on arxiv. Thanks!

@mOmUcf
Copy link
Author

mOmUcf commented Oct 5, 2019

Im sorry i do not reply in time, and here are the code and output while i use your data interface:https://github.com/Atomu2014/Ads-RecSys-Datasets

import numpy as np
import pandas as pd
from datasets import Avazu
ava = Avazu()
ava.load_data('train')
ava.load_data('test')
df_avazu = pd.DataFrame(np.vstack([ava.X_train,ava.X_test]) , columns=ava.feat_names)
for field in ava.feat_names:
    field_cnt = field+'_cnt'
    gbdf = df_avazu.groupby(field).size().reset_index().rename(columns={0: field_cnt})
    min_freq = gbdf.sort_values(field_cnt)[field_cnt].values[0]
    print(f"{field}'s minimum feature frequence is {min_freq}")

and the outputs are as follow (ignoring the dataloading infomation):

C1's minimum feature frequence is 5787
banner_pos's minimum feature frequence is 2035
site_id's minimum feature frequence is 10
site_domain's minimum feature frequence is 10
site_category's minimum feature frequence is 10
app_id's minimum feature frequence is 10
app_domain's minimum feature frequence is 10
app_category's minimum feature frequence is 16
device_id's minimum feature frequence is 10
device_ip's minimum feature frequence is 10
device_model's minimum feature frequence is 10
device_type's minimum feature frequence is 31
device_conn_type's minimum feature frequence is 42890
C14's minimum feature frequence is 10
C15's minimum feature frequence is 1621
C16's minimum feature frequence is 1621
C17's minimum feature frequence is 12
C18's minimum feature frequence is 2719623
C19's minimum feature frequence is 10
C20's minimum feature frequence is 23
C21's minimum feature frequence is 497
mday's minimum feature frequence is 3225010
hour's minimum feature frequence is 818771
wday's minimum feature frequence is 3225010

@Atomu2014
Copy link
Owner

Thanks a lot! I will fix it later!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants