Data Scraping & Processing Documentation

Guide for scraping Discourse posts (without API access) and a course content website, and saving them as structured files.

Not Full Proof

These are not the exact steps you should follow, but rather a general guide to help you understand the process and a starting point.

Scraping Overview

Process for extracting and saving structured content

Discourse Forum: Scrape posts from https://discourse.onlinedegree.iitm.ac.in/c/courses/tds-kb/34 (01 Jan – 14 Apr 2025)
HTML Source: Extract and convert https://tds.s-anand.net/#/2025-01/ content into Markdown
Storage Format: Save Discourse data in .json or .xml; HTML in .md
Data Cleaning: Remove PII and unnecessary tags before saving

Implementation Notes