## CZI Software Mentions [Link](https://github.com/chanzuckerberg/software-mentions) Source: - biomedical papers - 2.4 million papers from NIH PubMed Central commercial subset - 1.4 million papers from NIH PubMed Central non-commercial subset - 4 million papers from CZI Publishers' collection - biomedical, but also some other areas - includes 1.7 million PubMed Central papers Method: - SciBERT extracts plain-text software mentions - disambiguation of mentions using DBSCAN - linking by exact-match query in Pip, SciPy, GitHub - GitHub links have high error / unclear rate in evaluation Content: - software mentions - context (2-3 lines) - disambiguated software name - link (if any) ## PLOS Open Scence Indicators [Link](https://plos.figshare.com/articles/dataset/PLOS_Open_Science_Indicators/21687686) Source: - 61,000 research articles published in PLOS (2019-2022) - 6,500 articles in non-PLOS journals (for comparison) Method: - analyse XML of published research articles - detect 3 OpenScience practices - sharing of research data (NLP, DataSeer) - sharing of code (NLP, DataSeer) - posting of preprints (CrossRef, DataCite) Content: - data and code generation and sharing rate - location of shared data and code ## Softcite dataset [Link](https://github.com/howisonlab/softcite-dataset) Source: - 5,000 open access research publications in life sciences and social sciences Method: - manually annotated (I think) Content: - XML publications with encoded annotations (used GROBID) - later used to train ML models - implemented in GROBID module for software mention recognition ## SoftwareKG [Link](https://data.gesis.org/softwarekg/) Source: - SoftwareKG_Social: 51,000 articals from social sciences - SoftwareKG_PubMed: 3M PubMed Central articles Contents: - knowledge graph - name of software mention - accessibility, URL, license, disambiguation ## French Open Science Monitor [Link](https://barometredelascienceouverte.esr.gouv.fr/software/general), [Methodology Paper](https://github.com/Barometre-de-la-Science-Ouverte/bso3-techdoc/blob/master/methodology/bso3.pdf) - select relevant papers using affiliation detector for CrossRef - identifies papers with at least one French author - uses controlled list of institutions for France - run Grobid, then software mention detection + dataset mention detection on the collection ## SoMeSci [Paper](https://peerj.com/articles/cs-835/), [data](https://data.gesis.org/somesci/) Source: - 1367 PubMed Central papers Contents: - Knowledge graph - version, developer, URL, citations