Data Scraping & Processing Documentation

Guide for scraping Discourse posts (without API access) and a course content website, and saving them as structured files.

Scraping Overview
Process for extracting and saving structured content
  • Discourse Forum: Scrape posts from https://discourse.onlinedegree.iitm.ac.in/c/courses/tds-kb/34 (01 Jan – 14 Apr 2025)
  • HTML Source: Extract and convert https://tds.s-anand.net/#/2025-01/ content into Markdown
  • Storage Format: Save Discourse data in .json or .xml; HTML in .md
  • Data Cleaning: Remove PII and unnecessary tags before saving

Implementation Notes

  • Data Range: Only scrape posts between 01 Jan and 14 Apr 2025
  • PII Removal: Strip usernames, emails, or user identifiers if required
  • Error Handling: Retry on failed requests with exponential backoff
  • Storage: JSON and XML for Discourse; Markdown for HTML