Skip to content
/ Meccano Public

Official code repo for Ensemble Modeling for Multimodal Visual Action Recognition [ICIAP-W 2023]

Notifications You must be signed in to change notification settings

jkini/Meccano

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ensemble Modeling for Multimodal Visual Action Recognition

Official code repo for Ensemble Modeling for Multimodal Visual Action Recognition [ICIAP-W 2023 ${\color{red}Competition~Winner}$] Project and Arxiv

Installations

conda create -n mm python=3.11.4
conda activate mm
conda install pytorch=2.0.1 torchvision=0.15.2 torchaudio=2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c anaconda numpy    
conda install -c conda-forge matplotlib
conda install -c conda-forge tqdm
pip install opencv-python
pip install fvcore
pip install timm
pip install mmcv==1.3.11
pip install einops
pip install scikit-learn
pip install focal-loss-torch
pip install pandas
pip install seaborn

Dataset preparation

Download the following components of the Meccano dataset from the official website:
RGB frames
Depth frames
Action annotations
Update config.py [data_dir] to reflect the dataset location.

Training

We train individual modalities RGB and Depth.

Update config.py [train_run_id, train_modality, train_weights_dir, train_ss_wt_file] to reflect the relevant details.

Run:

python -u train.py

Testing ( individual modalities OR ensemble)

  1. Test individual modalities RGB and Depth.

    Update config.py [test_wt_file, test_modality] to reflect the relevant details.

    Run:
python -u test.py
  1. Obtain class probabilities averaged from RGB and Depth pathways (${\color{red}Competition~Result}$).

    Update config.py [test_wt_file_1, test_wt_file_2] to reflect the relevant details.

    Run:
python -u test_mm.py

Pre-trained weights

We use the Swin3D-B backbone, which is pre-trained on the SomethingSomething v2 dataset.
Swin3D-B with Something-Something v2 pre-training: Google Drive

The RGB frames and Depth maps are passed through two independently trained Swin3D-B encoders. The resultant class probabilities, obtained from each pathway, are averaged to subsequently yield action classes.
Ours (RGB) with Something-Something v2 pre-training: Google Drive
Ours (Depth) with Something-Something v2 pre-training: Google Drive

We Credit

Thanks to https://github.com/SwinTransformer/Video-Swin-Transformer, for the Swin3D-B implementation.

Citation

@article{kini2023ensemble,
  title={Ensemble Modeling for Multimodal Visual Action Recognition},
  author={Kini, Jyoti and Fleischer, Sarah and Dave, Ishan and Shah, Mubarak},
  journal={arXiv preprint arXiv:2308.05430},
  year={2023}
}

@article{kini2023egocentric,
  title={Egocentric RGB+ Depth Action Recognition in Industry-Like Settings},
  author={Kini, Jyoti and Fleischer, Sarah and Dave, Ishan and Shah, Mubarak},
  journal={arXiv preprint arXiv:2309.13962},
  year={2023}
}

Contact

If you have any inquiries or require assistance, please reach out to Jyoti Kini ([email protected]).

About

Official code repo for Ensemble Modeling for Multimodal Visual Action Recognition [ICIAP-W 2023]

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages