Ensemble Modeling for Multimodal Visual Action Recognition

Official code repo for Ensemble Modeling for Multimodal Visual Action Recognition [ICIAP-W 2023 ${\color{red}Competition~Winner}$] Project and Arxiv

Installations

conda create -n mm python=3.11.4
conda activate mm
conda install pytorch=2.0.1 torchvision=0.15.2 torchaudio=2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
conda install -c anaconda numpy    
conda install -c conda-forge matplotlib
conda install -c conda-forge tqdm
pip install opencv-python
pip install fvcore
pip install timm
pip install mmcv==1.3.11
pip install einops
pip install scikit-learn
pip install focal-loss-torch
pip install pandas
pip install seaborn

Dataset preparation

Download the following components of the Meccano dataset from the official website:
RGB frames
Depth frames
Action annotations
Update config.py [data_dir] to reflect the dataset location.

Training

We train individual modalities RGB and Depth.

Update config.py [train_run_id, train_modality, train_weights_dir, train_ss_wt_file] to reflect the relevant details.

Run:

python -u train.py

Testing ( individual modalities OR ensemble)

Test individual modalities RGB and Depth.

Update config.py [test_wt_file, test_modality] to reflect the relevant details.

Run:

python -u test.py

Obtain class probabilities averaged from RGB and Depth pathways (${\color{red}Competition~Result}$).

Update config.py [test_wt_file_1, test_wt_file_2] to reflect the relevant details.

Run:

python -u test_mm.py

Pre-trained weights

We use the Swin3D-B backbone, which is pre-trained on the SomethingSomething v2 dataset.
Swin3D-B with Something-Something v2 pre-training: Google Drive

The RGB frames and Depth maps are passed through two independently trained Swin3D-B encoders. The resultant class probabilities, obtained from each pathway, are averaged to subsequently yield action classes.
Ours (RGB) with Something-Something v2 pre-training: Google Drive
Ours (Depth) with Something-Something v2 pre-training: Google Drive

We Credit

Thanks to https://github.com/SwinTransformer/Video-Swin-Transformer, for the Swin3D-B implementation.

Citation

@article{kini2023ensemble,
  title={Ensemble Modeling for Multimodal Visual Action Recognition},
  author={Kini, Jyoti and Fleischer, Sarah and Dave, Ishan and Shah, Mubarak},
  journal={arXiv preprint arXiv:2308.05430},
  year={2023}
}

@article{kini2023egocentric,
  title={Egocentric RGB+ Depth Action Recognition in Industry-Like Settings},
  author={Kini, Jyoti and Fleischer, Sarah and Dave, Ishan and Shah, Mubarak},
  journal={arXiv preprint arXiv:2309.13962},
  year={2023}
}

Contact

If you have any inquiries or require assistance, please reach out to Jyoti Kini ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md
config.py		config.py
encoder.py		encoder.py
meccano.py		meccano.py
meccano_mm.py		meccano_mm.py
ss_swin.py		ss_swin.py
test.py		test.py
test_mm.py		test_mm.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ensemble Modeling for Multimodal Visual Action Recognition

Installations

Dataset preparation

Training

Testing ( individual modalities OR ensemble)

Pre-trained weights

We Credit

Citation

Contact

About

Releases

Packages

Languages

jkini/Meccano

Folders and files

Latest commit

History

Repository files navigation

Ensemble Modeling for Multimodal Visual Action Recognition

Installations

Dataset preparation

Training

Testing ( individual modalities OR ensemble)

Pre-trained weights

We Credit

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages