Logo

This webpage contains all materials for the Methodology and Statistics master course Processing Complex Data (PCD). The materials on this website are CC-BY-4.0 licensed. Lecturer
Javier Garcia-Bernardo
Assistant Professor of Social Data Science
Department of Methodology & Statistics
Utrecht University

Relational Database Project: Formula 1 Race History

Canonical course conventions live in project_guidelines.md. That file is the source of truth for the four required workflow files (week1_explore.qmd, week2_operationalize_clean.qmd, week3_model.qmd, week4_storytelling.qmd), the data/model_data.rds -> data/model_results.rds pipeline, the raw-data policy, quality-check requirements, decision logs, and contribution tracking. Read it before starting and treat anything below as project-specific guidance on top of those conventions.

Tutorial framing

Formula 1 relational data are complex because race results are distributed across normalized tables for races, drivers, constructors, circuits, seasons, standings, and result views rather than arriving as one analysis-ready rectangular file.

Students should learn three main things about these data:

  1. How relational data are represented through tables, schemas, primary keys, foreign keys, one-to-many relationships, many-to-many relationships, SQL queries, views, indices, and database files or dumps.
  2. How to turn normalized racing tables into one analysis-ready object by defining the analytic unit, writing joins, filtering races, aggregating related rows, and documenting the assumptions built into the join pipeline.
  3. How schema design, views, missing keys, join type, aggregation level, era restrictions, sprint-era rules, transaction-oriented storage, and analytical columnar storage affect modeling, interpretation, and the claims that can be made from race-history data.

Peer-teaching checklist

Dimension This project teaches
Data structure Relational schema, normalized tables, primary and foreign keys, one-to-many and many-to-many relationships, and joined analysis tables.
Storage system PostgreSQL database from an SQL dump, with comparison to SQLite, CSV/Excel/RDS exports, and Parquet as analytical columnar storage.
File formats PostgreSQL SQL dump, SQLite database file where relevant, CSV, JSON, RDS, Excel, and Parquet.
Encoding SQL text dump, binary database storage, text delimited files, R-specific serialization, and columnar Parquet encoding.
Model Linear, logistic, count, or grouped comparison model on a constructed driver-race, constructor-race, or season-level table.
Key aspects to explain Schemas, joins, keys, views, indices (including hash indexes and hash joins as the under-the-hood mechanism for equality lookups), ACID, row duplication or loss, unit of analysis, row-oriented database storage vs. columnar Parquet, and why SQL is declarative.

Resources

Data sources

Alternative data sources only if F1DB becomes impractical: Lahman Baseball Database, CTU Financial, CTU Credit, Chinook, and Northwind.

Knowledge sources

Week-by-week

Week 1

Start from the raw PostgreSQL SQL dump, identify the schema, and explain why the data are stored relationally rather than as one flat file.

Prepare for roundtable in week 2:

Week 2

Operationalize the research question by writing one join pipeline that creates an analysis-ready object from the raw tables.

Prepare for roundtable in week 3:

Week 3

Fit a small model on the joined Week 2 data, evaluate it, and show one sensitivity check to a join, filter, or aggregation choice.

Prepare for roundtable in week 4:

Week 4

Visualize and tell a story about the joined data and model while making the schema and preprocessing choices explicit.