Open sources provide additional avenues for TDM, often with fewer barriers due to their open access nature. Popular sources include CrossRef, PubMed, arXiv.
arXiv is an open access repository of preprints, primarily in physics, mathematics, and related fields. Content is generally available under open licenses, such as Creative Commons, allowing TDM as long as users respect the licensing terms. Given its mission to disseminate scientific results, TDM is implicitly supported, though users should check individual paper licenses for any specific restrictions.
Terms of Use: https://info.arxiv.org/help/policies/submission_agreement.html
CrossRef’s TDM service allows researchers to easily access full-text documents from participating members through a standardized API, supporting both open access and subscription-based content. By using DOIs and metadata, it simplifies the process of harvesting large datasets for analysis.
Sources:
PubMed Central (PMC) provides several datasets for text mining, including the PMC Open Access Subset and the PMC Author Manuscript Dataset, which are accessible via cloud services, API, or FTP. However, not all articles in PMC are available for text mining, and users must adhere to the license terms of each article, which vary and may include Creative Commons licenses.
OpenAlex is an open-source bibliographic database, offering extensive metadata on academic publications, including authors, institutions, and citations. Currently, it indexes over 250 million scholarly works from 250k sources. Its data is freely available for download and use, including TDM, under open licenses, making it a valuable resource for researchers. Note that OpenAlex focuses on metadata rather than full text.
Users can access data through their API or direct download from the platform.
More about OpenAlex: https://help.openalex.org/hc/en-us