Logo

This webpage contains all materials for the Methodology and Statistics master course Processing Complex Data (PCD). The materials on this website are CC-BY-4.0 licensed. Lecturer
Javier Garcia-Bernardo
Assistant Professor of Social Data Science
Department of Methodology & Statistics
Utrecht University

Messy Web Text Project: Green Claims and Climate Communication

Canonical course conventions live in project_guidelines.md. That file is the source of truth for the four required workflow files (week1_explore.qmd, week2_operationalize_clean.qmd, week3_model.qmd, week4_storytelling.qmd), the data/model_data.rds -> data/model_results.rds pipeline, the raw-data policy, quality-check requirements, decision logs, and contribution tracking. Read it before starting and treat anything below as project-specific guidance on top of those conventions.

Tutorial framing

Web text is complex because the data arrive wrapped in markup, navigation, scripts, boilerplate, duplicated page elements, and inconsistent page structure rather than as analysis-ready documents.

Students should learn three main things about these data:

  1. How web text is produced and represented through HTML, DOM trees, URLs, HTTP requests and responses, CSS, JavaScript, metadata, and page templates.
  2. How to turn raw pages into a clean corpus or analysis table by choosing a unit of analysis, extracting meaningful text, removing boilerplate, preserving source metadata, and documenting text-cleaning choices.
  3. How extraction, tokenization, repeated page elements, and publisher purpose affect linguistic features, models, visualizations, and the claims that can be made from a small web corpus.

Peer-teaching checklist

Dimension This project teaches
Data structure HTML documents, DOM trees, text corpus, nested page metadata, and document-feature or document-term representations (which are sparse matrices — the same data structure used for network adjacency).
Storage system Raw downloaded HTML files, with NoSQL/document-store storage such as MongoDB discussed as an optional comparison rather than a required implementation.
File formats HTML, JSON metadata or exports, TXT, and CSV/RDS-style clean analysis outputs.
Encoding UTF-8 text, HTML markup, and JSON serialization for metadata or document-style records.
Model Group comparison, logistic or linear regression, clustering, or another small interpretable model using transparent text features.
Key aspects to explain DOM structure, HTTP status codes, robots.txt, scraping legality and ethics, boilerplate removal, tokenization, document-term or TF-IDF features, and sensitivity to extraction choices.

Resources

Data sources

Knowledge sources

Week-by-week

Week 1

Inspect raw HTML, explain who published it and why, and identify the DOM structure and markup noise that matter for extraction.

Prepare for roundtable in week 2:

Week 2

Operationalize the question by turning raw pages into one analysis table with transparent text-cleaning choices.

Prepare for roundtable in week 3:

Week 3

Fit a small interpretable text model using hand-built features or simple document representations, and show sensitivity to extraction or tokenization decisions.

Prepare for roundtable in week 4:

Week 4

Visualize and tell a story about the contrast between sources, while making the limits of the corpus and preprocessing choices explicit.