A taxonomy of experiment runs. See Section “Runs, Sweeps, Groups, Tags & Types”.

Many ML research projects follow a structure of:

Create dataset → Finetune model → Evaluate on tasks → Make changes → Repeat.

This isn’t every project, but it’s common enough that it seems worthwhile to consolidate best practices for this sort of empirical research. I’m thinking of questions like

What’s a clean way to set up the code (incl. things like parameter configuration)?
What kind of compute and dev environment works best?
How do I efficiently launch and manage lots of experiments?

Best practices are often evolved independently and spread via word of mouth or collaboration. Existing tools and guides focus on each of these individual questions, but rarely a complete view.

I’d love to see an overview of a baseline workflow, stumbling blocks to be wary of, and useful tricks otherwise discovered only after lots of fumbling around. Here, I’ll start by sharing what I’ve found works for me, but is by no means a perfect workflow (see “Pain point” boxes). I’m hoping to share this with others so that we can trade notes and discover solutions that make everybody’s lives easier.

Quick note: My workflow is designed for the scale of a typical academic research paper, with small (<5) teams and several months of work. For a weekend project you may prefer less overhead; bigger projects may benefit from more heavy-duty tooling.

Demo

To help connect the dots, I’ve prepared a demo of some of the core ideas in this post. I’ll point to relevant parts as we go along.

https://github.com/JunShern/ml_workflow

Demo

File structure