WarCov dataset

Dataset from popular social media platform containg posts associated with the subject of COVID-19 pandemic and war in the Ukraine in Polish language. Apart from the texts it also contains images attached to posts and proceduraly generated labels.

Download and usage

Dataset is available to download on Kaggle.

Additional representations

Upon request on the project's GitHub page, we declare that we will extract features from the original dataset using any indicated method if possible to run with the computational resources of our University. The obtained embeddings will be added to a publicly available data repository in such a case. We also encourage researchers to fork the solution for their projects.

Tasks

Dataset is dedicated to use in classification task in three main domains:

text classification (Natural Language Processing),
image classification,
multimodal classification with both representations.

More information can be found in the preprint article (submitted to the Thirty-eighth Annual Conference on Neural Information Processing Systems - NeurIPS 2024). Available on arXiv.

Repository content

Repository contains preprocessing flow and complete classification research. In main directory experiments and scripts for analysis purposes are stored.

utils.py - additional tools used in experiments
analyze_class_distribution.py - visualizations of the labels distribution for hashtagged texts and images
analyze_multimodal.py - calculating metric scores for Late Fusion approach to multimodal classification and visualization of the results
dataset_pca.py – Performs PCA with maximum number of components in order to preprare the dataset for publication.
exp_1.py - classification experiment on all texts in the dataset with usage of two multilabel and five basic classifiers
scores_1.py - based on predictions from exp_1.py metrics scores are calculated
analyze_1.py - visualizations of the first experiments results
exp_2.py - classification experiment on texts embeddings for posts which contained images with the classifier which performed the best in first experiment
exp_3.py - classification experiment on preprocessed images using ResNet18
analyze_3.py - visualizations of the third experiments results
exp_4.py - classifiying extracted IMG embeddings using GNB paierd with MultioutputClassifier and ClassifierChain
scores_4.py - metrics calculation based on predictions from exp_4.py
analyze_4.py - visualizations of the fourth experiments results

Sentence transformer flow

Directory sentence_transformer_flow contains preprocessing scripts for posts texts and label creation.

1_train_model.py - loads file hashtags.npy which stores NumPy array with all hashtags with their datastamps in the form ['COVID', '2022-01-01T00:58:43.000Z']; then it uses Sentence Transformer model to create embeddings of hashtags only (without dates); at the end embeddings are normalized and saved as data/st_embeddings_hashtags.npy; it additionaly creates file data/st_dates.npy for storing dates only.
2_pca.py - uses data/st_embeddings_hashtags.npy for analyzing of the explained variance depending on features number and creates representation limited to 80% of components saved as data/st_hashtags_pca.npy.
3_concatenate.py - uses data/st_dates.npy to get month, day, hour of post publication and minute; then adds them to data/st_hashtags_pca.npy; saves new representation as data/st_concatenated.npy .
4_clustering.py - performs KMeans clustering on representation from data/st_concatenated.npy to accumulate hashtags into 50 classification labels; predictions are saved as data/st_cluster_preds.npy.
5_clusters_to_labels - file data/st_labels_row.csv containing all hashtags (one row corresponds to one post) is used to map posts with new labels obtained from clustering; then labels are encoded to the binarized form with MultiLabelBinarizer and saved as data/txt_y.npy
6_posts_embeddings.py - loads data/extracted_texts.npy containing NumPy array with all posts (one element is one text) and creates embeddings with Sentence Transformer model; they are saved as data/texts_embeddings.npy.
7_extract_embeddings_img.py - loads data/multimodal_texts.npy containing texts of all posts which had also images and creates embeddings with Sentence Transformer model; they are saved as data/texts_embeddings_imgs.npyfor multimodal representation purposes.

Image preprocessing flow

Directory image_preprocessing_flow contains preprocessing scripts for image data.

1_img_to_npy.py - loads raw image data, pairing each image with its corresponding text and labels. Images are preprocessed using modified ResNet-18 transforms.
2_extract_img_embeddings_noft.py - extracts embeddings from preprocessed image data using ResNet-18 pretrained on ImageNet.
3_extract_img_embeddings_ft.py - extracts embeddings from 80% of preprocessed image data using ResNet-18 pretrained on ImageNet and additionally finetuned using remaining 20% of images. Data was splitted using stratified sampling.

License information and citing

Dataset on license Attribution-NonCommercial-ShareAlike 4.0 International. Additional information in LICENSE.txt file.

If you use dataset in a publication, we would appreciate citations to the following paper:

    @misc{borekmarciniec2024warcov,
        title={WarCov -- Large multilabel and multimodal dataset from social platform}, 
        author={Weronika Borek-Marciniec and Pawel Zyblewski and Jakub Klikowski and Pawel Ksieniewicz},
        year={2024},
        eprint={2406.10255},
        archivePrefix={arXiv}
    }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WarCov dataset

Download and usage

Additional representations

Tasks

Repository content

Sentence transformer flow

Image preprocessing flow

License information and citing

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
figures		figures
image_preprocessing_flow		image_preprocessing_flow
preds		preds
scores		scores
sentence_transformer_flow		sentence_transformer_flow
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
analyze_1.py		analyze_1.py
analyze_3.py		analyze_3.py
analyze_4.py		analyze_4.py
analyze_class_distribution.py		analyze_class_distribution.py
analyze_multimodal.py		analyze_multimodal.py
dataset_pca.py		dataset_pca.py
exp_1.py		exp_1.py
exp_2.py		exp_2.py
exp_3.py		exp_3.py
exp_4.py		exp_4.py
scores_1.py		scores_1.py
scores_4.py		scores_4.py
utils.py		utils.py

License

w4k2/warcow

Folders and files

Latest commit

History

Repository files navigation

WarCov dataset

Download and usage

Additional representations

Tasks

Repository content

Sentence transformer flow

Image preprocessing flow

License information and citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages