Skip to Main Content

Text and Data Mining (TDM) from HKUST Licensed Material

This guide is developed to help HKUST users learn what publishers permit text and data mining via their regular subscriptions.

Text and Data Mining - HKUST Subscriptions

Text and Data Mining (TDM) refers to the process of using automated tools and techniques to extract, analyze, and derive insights from large sets of text and data. 

The majority of publishers that support TDM offer the service free of charge. However, there are often rules and requirements that must be followed, and the methods for data mining and delivery can also be quite different.

Commonly Seen Terms and Conditions:

  • Use is restricted to non-commercial research purposes. 
  • Only subscribed and open access content can be text-mined. 
  • Follow the download limit, e.g. 3 requests per second.
  • Sharing the data with third parties is prohibited.
  • Delete the data once the project is completed.
  • Use APIs to extract data rather than crawling databases with web robots, spiders, etc. 

In autumn 2020, HKUST Library's Research Support Services did a small study on text and data mining (TDM) of Library subscribed resources. The findings appeared in a Research Bridge article, Text and Data Mining: Full-text Databases

Cambridge University Press

Machine Analysis (Text and Data Mining)

Cambridge allows users with lawful access to its content to perform text and data mining (TDM) for non-commercial purposes. Users can download, extract, store, and analyze content, provided a link to the original content on Cambridge's site is included. Any locally stored copies must be deleted once the research project ends. While TDM results can be shared publicly for research purposes, the use of Cambridge content or results for commercial purposes is strictly prohibited unless allowed by applicable law.

Content is provided "as is," and Cambridge does not guarantee its suitability for machine analysis or provide API access. Usage is monitored, and restrictions may be applied, including technical protection measures. For large-scale downloading, specific formats, or other inquiries, users are encouraged to contact openresearch@cambridge.org.

Read more: https://www.cambridge.org/core/legal-notices/terms

Elsevier

Elsevier allows a certain amount of TDM to subscribed content.

  • Non-Commercial Users (Researchers in Academic & Public Sector Institutions, Charities & Charitable Foundations):
    Most APIs (except SciVal and Embase APIs) are available for no charge, for non-commercial use, subject to Elsevier's policies and limits on usage.
  • Commercial Users (Researchers in Private Sector & Commercial Institutions):
    APIs are available (for commercial use), with an API license and subscription, please contact us here to discuss your request.​​​​​​

Key rules:  

  • API key required: Obtain an API key via Elsevier’s Developer Portal.
  • Access scope: HKUST researchers can access subscribed content + Open Access materials.
  • Open Access downloads: Users can download OA content without registering, though registration is recommended for accessing text-mining-friendly formats and receiving technical support.
  • Image mining supported: Use the Object Retrieval API to mine images.
  • Rate-limited downloads: There are no hard limits on the amount of content users can download, but reasonable rate limits ensure fair access, and abusive usage may lead to deactivation of the API key. 

Read more:

EIU Viewpoint

Text and Data Mining

EIU makes its data available for download and analysis with its “EIU Viewpoint add-in for Excel”. EIU provides a user guide to the Excel Add-in.

The terms of use allow you to do text and data mining, but any use in either university assignments or for publishing in academic articles must cite EIU as the data source. You must only use the Excel Add-in or API to access data for use in text and data mining.

Using AI Tools with EIU data

Generative AI and other AI models or products can be developed using EIU content for non-public research and teaching only and when taking specific content protection measures.

  • If you are using a third-party platform as a base for an AI model (e.g. OpenAI GPT models), your models must be trained in a secure and “ring-fenced manner”. “Ringfenced” means information that is self-contained to each organization/AI program and isn’t commingled with the rest of the world/internet.
  • HKUST’s enterprise version of ChatGPT (via Azure), which is fenced off from training from the underlying OpenAI model, is OK to use.
  • Public/non-paying versions of ChatGPT and other generative AI tools may NOT be used by HKUST users with the data from EIU.

EIU also offers additional and bespoke licences for AI use. If you would like to discuss these, please email them at licensing@eiu.com.

Factiva

Text & Data Mining with Factiva requires a separate license.

The Library can provide contact person for researchers to ask for quote.

Gale - Cengage

Gale Digital Scholar Lab

  • Is a platform for text analysis, data mining, and data visualization.
  • Users can create and analyze content sets from Using our licensed content in “Gale Primary Sources” collections,

Content includes:  

Gale (Cengage): Data Mining FAQs

A few of them are excluding due to the copyright holders not grant the right, including Financial Times Historical Archive, 1888-2021 and National Geographic Virtual Library.

IEEE

"IEEE permits non-commercial text and data mining of articles published open access with either the Open Access Publishing Agreement (OAPA) or the Creative Commons license (CC BY). No permission is required for non-commercial mining of open access articles.

Mining for commercial purposes or mining of non-open access content requires permission from IEEE. Contact pubs-permissions@ieee.org for further information."

Source: https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelines-and-policies/post-publication-policies/#tdm

JSTOR

Constellate

Provides Text and data mining tools and teaching with JSTOR, Portico, and other IThaka collections.

JSTOR Dataset Services
Anyone can request a dataset through either of the two services below.

  • Self-service: limit to 25,000 documents; does not cover full text.
  • Large/full-text request: by special request and requires an agreement about the use of the data.

Nexis Uni

The Nexis Uni subscription from HKUST is good for students to use it for research, but not crawling or downloading large volume of data. 

Lexis-Nexis has a section on their website where you can ask about using or purchasing their "Data as a Service" for larger datasets.

They also have a  LexisNexis Bulk Content API mining personal consultation service.

Project Muse

Project Muse supports TDM with prior approval. Check with library staff to obtain a publisher contact for more details.

Here is the relevant section from their standard journal license:

"...subject to prior notification and approval by Project MUSE, [you may] engage in text processing, which is any kind of analysis of natural language text. MUSE will make appropriate arrangements prior to the start of this activity to account for usage data and ensure continued access for the user. his may include but not be limited to a process by which information may be derived from text by identifying patterns and trends within natural language through text categorization, statistical pattern recognition, concept or sentiment extraction, and the association of natural language with indexing terms…" -  https://about.muse.jhu.edu/librarians/license-review/

ProQuest

ProQuest TDM Studio is a text and data mining solution designed to facilitate research across various disciplines by enabling users to analyze large sets of licensed content, including newspapers, scholarly articles, dissertations, and government databases. The platform provides two primary dashboards tailored to different user needs and skill levels: Visualizations and Workbench.

Researchers can pay extra to text and data mine ProQuest content that HKUST Library already owns or subscribes to via the ProQuest TDM Studio.

Read more:

SAGE

"Downloading articles from SAGE Journals for the purposes of text and data mining is expressly permitted in our standard licence agreements and our terms of use for no extra fee. You do not need to ask permission to systematically download articles provided that:

  • You only use the articles for non-commercial text and data mining.
  • You only download articles to which you have legitimate access, for example if they are open access or part of your institution's subscription. If you cannot view an article on SAGE Journals, you will not be able to download it.
  • You respect the following limits when downloading SJ content:
    • 1 request every 6 seconds – Monday to Friday between Midnight and Noon in the "America/Los_Angeles" timezone;
    • 1 request every 2 seconds - Monday to Friday between Noon and Midnight in the "America/Los_Angeles" timezone, and all day Saturday and Sunday."

Source: https://journals.sagepub.com/page/policies/text-and-data-mining

Springer Nature

Text and Data Mining at Springer Nature

  • Non-commercial use: Researchers affiliated with institutions that subscribe to Springer Nature's journals or books may perform TDM for non-commercial purposes. The use of Springer Nature’s TDM API incurs additional costs.
  • Commercial use: Commercial TDM projects are supported under standard TDM terms or via the TDM API, both of which are available for a fee.

Rules:

  • Access scope: Content, including both subscribed and open-access material, can be downloaded directly from Springer Nature's platforms.
  • Download limit: Researchers should restrict download requests to 1 request per second when using existing search tools (e.g., PubMed, Web of Science) or Springer Nature’s Metadata API.
  • Get API key: An API key can be requested to access the Springer Nature TDM APIs, which support more advanced querying and allow a higher bandwidth of up to 150 requests per min.
  • Image mining not available: Springer Nature does not currently provide an API for image mining.

Read more: https://www.springernature.com/gp/researchers/text-and-data-mining 

Web of Science

Web of Science provides APIs to access publication and citation data for integration with internal systems. HKUST's subscription enables institutional access for advanced API usage. 

  1. Register an account at Clarivate's Developer Portal
  2. Get API key through https://developer.clarivate.com/help/api-access#key_access

Available APIs: 

  • Web of Science Starter API
    • Scope: Bibliographic metadata (e.g., DOI, author, source title) from Web of Science Core Collection.
    • Limits: 5 requests per second, 5,000 requests per day (institutional users)
  • Web of Science API Expanded
    • Scope: Full item-level metadata, including times cited, contributor affiliations, funding data, and citation networks.
    • Access: Requires a paid license.
  • Full-text metadata is not accessible via these APIs, but DOIs in search results can facilitate text mining through CrossRef.


Read more: 

Wiley

Wiley Text and Data Mining

Academic subscribers can perform TDM under license (or in accordance with statutory rights under applicable legislation) on subscribed content for non-commercial purposes at no extra cost.

  • Access via API: TDM must be conducted through Wiley’s approved API system—other methods (e.g. web scraping) are not allowed. 
  • Wiley recommends using Crossref Text and Data Mining Service for TDM projects
  • API Token Requirement: To obtain an API token, users must accept a click-through license. First consult with your institution’s library to determine if a separate agreement is already in place

Read more: https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining 

WisersOne = 慧眼輿情

Text & Data Mining with WisersOne = 慧眼輿情  requires a separate license.

The Library can provide contact person for researchers to ask for quote.

© HKUST Library, The Hong Kong University of Science and Technology. All Rights Reserved.