Skip to content

A huge rec dataset with raw text/audio/image/videos provided (Talk Invited at DeepMind).

Notifications You must be signed in to change notification settings

ShawnChenn/MicroLens

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Content-Driven Micro-Video Recommendation Dataset at Scale

Invited Talk by DeepMind (Slides)

Dataset

Download link: https://recsys.westlake.edu.cn/MicroLens-Dataset/

We temporarily release 50,000 users as well as their associated multi-modal data (with raw text/image/audio/video provided) through the given link. To acquire the complete MicroLens dataset, kindly reach out to the corresponding author via email. If you have an innovative idea for building a foundational recommendation model but require a large dataset and computational resources, consider joining our lab as an intern. We can provide access to 100 NVIDIA 80G A100 GPUs and a billion-level dataset of user-video/image/text interactions.

We also release a subset of our MicroLens for multimodal fairness recommendation, which can be downloaded from https://recsys.westlake.edu.cn/MicroLens-Fairness-Dataset/

Citation

If you use our dataset, code or find MicroLens useful in your work, please cite our paper as:

@article{ni2023content,
  title={A Content-Driven Micro-Video Recommendation Dataset at Scale},
  author={Ni, Yongxin and Cheng, Yu and Liu, Xiangyan and Fu, Junchen and Li, Youhua and He, Xiangnan and Zhang, Yongfeng and Yuan, Fajie},
  journal={arXiv preprint arXiv:2309.15379},
  year={2023}
}

Kind Reminder

Note that you should not perform secondary processing on the dataset provided in this paper and offer a new download. If you need to do so, we encourage you to open source only the code for data processing and cite our work for data downloading and local processing.

Code

We have released the codes for all algorithms, including VideoRec (which implements all 15 video models in this project), IDRec, and VIDRec. For more details, please refer to the following paths: "Code/VideoRec", "Code/IDRec", and "Code/VIDRec". Each folder contains multiple subfolders, with each subfolder representing the code for a baseline.

Special instructions on VideoRec

In VideoRec, if you wish to switch to a different training mode, please execute the following Python scripts: 'run_id.py', 'run_text.py', 'run_image.py', and 'run_video.py'. For testing, you can use 'run_id_test.py', 'run_text_test.py', 'run_image_test.py', and 'run_video_test.py', respectively. Please see the path "Code/VideoRec/SASRec" for more details.

Before running the training script, please make sure to modify the dataset path, item encoder, pretrained model path, GPU devices, GPU numbers, and hyperparameters. Additionally, remember to specify the best validation checkpoint (e.g., 'epoch-30.pt') before running the test script.

Note that you will need to prepare an LMDB file and specify it in the scripts before running image-based or video-based VideoRec. To assist with this, we have provided a Python script for LMDB generation. Please refer to 'Data Generation/generate_cover_frames_lmdb.py' for more details.

Special instructions on IDRec and VIDRec

In IDRec, see IDRec\process_data.ipynb to process the interaction data. Execute the following Python scripts: 'main.py' under each folder to run the corresponding baselines. The data path, model parameters can be modified by changing the yaml file under each folder.

Environments

python==3.8.12
Pytorch==1.8.0
cudatoolkit==11.1
torchvision==0.9.0
transformers==4.23.1

Baseline Evaluation

Video Understanding Meets Recommender Systems

News

The laboratory is hiring research assistants, interns, doctoral students, and postdoctoral researchers. Please contact the corresponding author for details.

实验室招聘科研助理,实习生,博士生和博士后,请联系通讯作者。

About

A huge rec dataset with raw text/audio/image/videos provided (Talk Invited at DeepMind).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.4%
  • Jupyter Notebook 2.8%
  • JavaScript 0.6%
  • CSS 0.2%
  • Batchfile 0.0%
  • Makefile 0.0%