Logo

This webpage contains all materials for the Methodology and Statistics master course Processing Complex Data (PCD). The materials on this website are CC-BY-4.0 licensed. Lecturer
Javier Garcia-Bernardo
Assistant Professor of Social Data Science
Department of Methodology & Statistics
Utrecht University

API Data Project: ClinicalTrials.gov and Scientific API Workflows

Canonical course conventions live in project_guidelines.md. That file is the source of truth for the four required workflow files (week1_explore.qmd, week2_operationalize_clean.qmd, week3_model.qmd, week4_storytelling.qmd), the data/model_data.rds -> data/model_results.rds pipeline, the raw-data policy, quality-check requirements, decision logs, and contribution tracking. Read it before starting and treat anything below as project-specific guidance on top of those conventions.

Tutorial framing

ClinicalTrials.gov API data are complex because study records arrive through paginated endpoints as nested, institutionally defined JSON rather than as one complete analysis-ready table.

Students should learn three main things about these data:

  1. How API data are represented through endpoints, query parameters, OpenAPI specifications, nested JSON, repeated fields, pagination tokens, status codes, and response metadata.
  2. How to collect API data responsibly by limiting requests, waiting between calls, caching raw responses, handling 429 Too Many Requests, and documenting exactly which query produced the dataset.
  3. How to turn nested scientific records into one analysis-ready table by choosing fields, flattening modules, handling repeated values, preserving provenance, and explaining what the API schema makes visible or invisible.

Peer-teaching checklist

Dimension This project teaches
Data structure Nested key-value records, arrays of repeated fields, response metadata, pagination tokens, and tabular data after extraction.
Storage system Remote API endpoint, local cache of raw JSON responses, and a small MongoDB (NoSQL/document-store) instance loaded from the cached responses for the relational-vs-document comparison.
File formats JSON responses, OpenAPI YAML specification, and clean table outputs such as CSV/RDS.
Encoding JSON over HTTP, URL query parameters, UTF-8 text, and structured API response metadata.
Model Logistic regression, linear regression, grouped comparison, or simple classifier using the flattened study-level table.
Key aspects to explain Endpoints, query parameters, pagination, rate limits, caching, status codes, schema-defined fields, repeated-field flattening, responsible API use, and when a document store (NoSQL/MongoDB) is preferable to a relational database for nested JSON.

Resources

Data sources

Main raw source for this project: JSON responses from the ClinicalTrials.gov v2 studies endpoint. Students should use a query that returns enough records to require pagination, for example a condition such as depression, diabetes, cancer, or Alzheimer disease.

Alternative API source if ClinicalTrials.gov becomes impractical: Wikimedia APIs. A possible research question is: Do Wikipedia pages about climate change, fossil fuels, and renewable energy differ in pageviews, edit activity, or revision patterns over time?

Knowledge sources

Week-by-week

Week 1

Start from the ClinicalTrials.gov endpoint, inspect the API documentation and raw JSON, and download a cached sample using a polite paginated request workflow.

Prepare for roundtable in week 2:

Week 2

Operationalize the research question by flattening the nested API records into one study-level analysis table.

Prepare for roundtable in week 3:

Week 3

Fit a small interpretable model using the saved Week 2 table, evaluate it, and show one sensitivity check to an API extraction or flattening choice.

Prepare for roundtable in week 4:

Week 4

Visualize and tell a story about the clinical-trials pattern while making the API workflow, waiting strategy, and flattening choices explicit.