Logo

This webpage contains all materials for the Methodology and Statistics master course Processing Complex Data (PCD). The materials on this website are CC-BY-4.0 licensed. Lecturer
Javier Garcia-Bernardo
Assistant Professor of Social Data Science
Department of Methodology & Statistics
Utrecht University

Processing Complex Data

Lecturer: Javier Garcia-Bernardo, Assistant Professor of Social Data Science, Department of Methodology & Statistics, Utrecht University.

About the course

Contrary to what most introductory data science and statistics courses teach, real-world scientific data come in an enormous variety of formats, sizes, structures, and procedures — from simple tables to spatiotemporal arrays, normalized relational schemas, nested API responses, raw scraped web pages, networks, and domain-specific scientific standards. This course gives students hands-on experience with handling, processing, and modelling six families of complex data, in a hackathon-style format where each group goes deep on one data type and teaches the rest of the class.

The narrative spine of the course is from raw traces to defensible claims. Each group works through a single pipeline: raw source → operationalized clean object → baseline model with one sensitivity check → presentation.

Course materials

Lectures

Week Title Lecture
1 What Makes Data Complex? TBD
2 From Complex Data to Clean Data TBD
3 Scaling Up Modeling TBD
4 Communicating Research TBD
5 Presentations
6 Short Report

Group projects

Variant Data family Example research question
Geospatial projects/geospatial.md What is the relation between municipal land use and population composition?
Networks projects/networks.md What is the relationship between gender and cross-program relations in high school?
Messy web text projects/messy_web_text.md Do company sustainability pages differ linguistically from public-interest climate information pages?
Relational database projects/relational_database.md Which driver, constructor, grid, circuit, and season characteristics are associated with F1 finishing points?
Time series projects/time_series.md How does an fMRI signal change across NSD scan sessions?
API data projects/api_data.md Which study attributes are associated with completed versus ongoing clinical trials?