Skip to Main Content

Web Scraping 101: Preparation

This guide is designed to facilitate the teaching of the library workshop "Web Scraping 101".

Prep for the workshop

Web scraping is a useful technique for researchers, as it allows them to quickly and automatically extract data from websites for analysis, making their research process more efficient and repeatable. This workshop is designed for non-technical researchers to choose the right tool for scraping content from the web. Module 1 of this workshop introduces the no-code approach to scraping, while Module 2 demonstrates how to use Python code to scrape content.

  Module 1 Module 2*
Learning Goals
  • Understand the basic concepts and principles of web scraping
  • Identify available web scraping tools and their appropriate uses
  • Use Power Query and Web Scraper to extract content of interest from the web
  • Make wise decisions when collecting data from the web
  • Understand when coding is necessary for web scraping
  • Use Python code to scrape content from the web
  • Use ChatGPT (Poe) as a learning aid to read, draft, and debug code

* Note: Module 2 is recommended for those who have completed Module 1 or have a basic understanding of web scraping. Prior knowledge of Python can be helpful, but it is not required. 

Prep for  workshop  We suggest using a Windows PC or laptop for Power Query as the "Get data from Web" function is not available on Mac (learn more). You can use a Library PC in the classroom. 
  • Register an account on Constellate (for sharing & practicing coding)  (see detailed instruction below)
  • Register a free account on Poe.com (to access ChatGPT)

Register new account & Log into Constellate

Constellate is a tool to help you learn coding and text analysis. Its coding platform is basically a jupyter notebook. We will be using it for in-class coding exercises, as it allows installation of additional Python libraries without requiring admin rights - friendly for Library PC environment. If you are using your own device and are comfortable with other IDEs, feel free to use any that you prefer.

 

Register & Login:

1. Access Constellate from the Library: https://lbdiscover.hkust.edu.hk/
bib/991013145259903412
 

 

2. Click "Constellate" under View it to access Constellate platform. If you are off-campus, log in with your HKUST credential. 

On the Constellate platform, click on "Log in" in the upper right corner.

3. Click on "register one" and create a new JSTOR account using your HKUST email (DO NOT register through Google). 

You will then receive a confirmation email. Click "Get started" to get into the platform.

4. Log in with your username and password. First time log in needs to through campus network. 


Troubleshooting for "Pair with your institution":

If you are on campus:

Try clear cache and login again. Your HKUST identity should be auto-recognized through IP address.

If you are off campus:

Make sure that you log into Constellate through campus network for the first time. 

(Learn more here)

5. Click “Open my lab” to get into Jupyter Notebook in Constellate.
© HKUST Library, The Hong Kong University of Science and Technology. All Rights Reserved.