Lecturer: Javier Garcia-Bernardo, Assistant Professor of Social Data Science, Department of Methodology & Statistics, Utrecht University.
Contrary to what most introductory data science and statistics courses teach, real-world scientific data come in an enormous variety of formats, sizes, structures, and procedures — from simple tables to spatiotemporal arrays, normalized relational schemas, nested API responses, raw scraped web pages, networks, and domain-specific scientific standards. This course gives students hands-on experience with handling, processing, and modelling six families of complex data, in a hackathon-style format where each group goes deep on one data type and teaches the rest of the class.
The narrative spine of the course is from raw traces to defensible claims. Each group works through a single pipeline: raw source → operationalized clean object → baseline model with one sensitivity check → presentation.
| Week | Title | Lecture |
|---|---|---|
| 1 | What Makes Data Complex? | TBD |
| 2 | From Complex Data to Clean Data | TBD |
| 3 | Scaling Up Modeling | TBD |
| 4 | Communicating Research | TBD |
| 5 | Presentations | — |
| 6 | Short Report | — |
| Variant | Data family | Example research question |
|---|---|---|
| Geospatial | projects/geospatial.md | What is the relation between municipal land use and population composition? |
| Networks | projects/networks.md | What is the relationship between gender and cross-program relations in high school? |
| Messy web text | projects/messy_web_text.md | Do company sustainability pages differ linguistically from public-interest climate information pages? |
| Relational database | projects/relational_database.md | Which driver, constructor, grid, circuit, and season characteristics are associated with F1 finishing points? |
| Time series | projects/time_series.md | How does an fMRI signal change across NSD scan sessions? |
| API data | projects/api_data.md | Which study attributes are associated with completed versus ongoing clinical trials? |