Maybe an incorrect number in the paper. #3

mOmUcf · 2019-09-10T06:22:10Z

' Product-based Neural Networks for User Response Prediction over Multi-field Categorical Data' (TOIS'17)

In section 5.1.1Datasets of this paper, there says "We randomly split the public dataset into training and test sets at 4:1, and remove categories appearing less than 20 times to reduce dimensionality.",
but when i preprocessing the raw avazu dataset by my self, i found that if #categories=6*10^5 in avazu dataset, the threshold need to be 10 , not 20.
when i use a threshold 20, #categories< 4*10^5

Is it an incorrect threshold number in section 5.1.1 ?

Atomu2014 · 2019-09-10T17:22:29Z

Hi, thanks for your interests at first. I may write the wrong number but I cannot remember it precisely. It is also possible we obtain different thresholds under difference pre-processing strategies. Generally, both 10 and 20 are good enough to filter out noise, so I think 10 is ok in your settings.

If this number really matters, maybe you can check the minimum occurrence of features in my processed Avazu dataset, which can be found in README, where the low-frequency categories have already been dropped.

If this number is verified to be 10, I will update this paper on arxiv. Thanks!

mOmUcf · 2019-10-05T13:59:25Z

Im sorry i do not reply in time, and here are the code and output while i use your data interface:https://github.com/Atomu2014/Ads-RecSys-Datasets

import numpy as np
import pandas as pd
from datasets import Avazu
ava = Avazu()
ava.load_data('train')
ava.load_data('test')
df_avazu = pd.DataFrame(np.vstack([ava.X_train,ava.X_test]) , columns=ava.feat_names)
for field in ava.feat_names:
    field_cnt = field+'_cnt'
    gbdf = df_avazu.groupby(field).size().reset_index().rename(columns={0: field_cnt})
    min_freq = gbdf.sort_values(field_cnt)[field_cnt].values[0]
    print(f"{field}'s minimum feature frequence is {min_freq}")

and the outputs are as follow (ignoring the dataloading infomation):

C1's minimum feature frequence is 5787
banner_pos's minimum feature frequence is 2035
site_id's minimum feature frequence is 10
site_domain's minimum feature frequence is 10
site_category's minimum feature frequence is 10
app_id's minimum feature frequence is 10
app_domain's minimum feature frequence is 10
app_category's minimum feature frequence is 16
device_id's minimum feature frequence is 10
device_ip's minimum feature frequence is 10
device_model's minimum feature frequence is 10
device_type's minimum feature frequence is 31
device_conn_type's minimum feature frequence is 42890
C14's minimum feature frequence is 10
C15's minimum feature frequence is 1621
C16's minimum feature frequence is 1621
C17's minimum feature frequence is 12
C18's minimum feature frequence is 2719623
C19's minimum feature frequence is 10
C20's minimum feature frequence is 23
C21's minimum feature frequence is 497
mday's minimum feature frequence is 3225010
hour's minimum feature frequence is 818771
wday's minimum feature frequence is 3225010

Atomu2014 · 2019-10-05T17:49:59Z

Thanks a lot! I will fix it later!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maybe an incorrect number in the paper. #3

Maybe an incorrect number in the paper. #3

mOmUcf commented Sep 10, 2019 •

edited

Loading

Atomu2014 commented Sep 10, 2019

mOmUcf commented Oct 5, 2019 •

edited

Loading

Atomu2014 commented Oct 5, 2019

Maybe an incorrect number in the paper. #3

Maybe an incorrect number in the paper. #3

Comments

mOmUcf commented Sep 10, 2019 • edited Loading

Atomu2014 commented Sep 10, 2019

mOmUcf commented Oct 5, 2019 • edited Loading

Atomu2014 commented Oct 5, 2019

mOmUcf commented Sep 10, 2019 •

edited

Loading

mOmUcf commented Oct 5, 2019 •

edited

Loading