Code

Your group's code lives in your group GitHub repository. There is no separate code submission and no individual code grade — we download the repository, run it, and confirm it reproduces the work described in your Final Report.

How code is graded

We check that your code matches the work described in your Final Report and that all of the expected steps were performed. No individual grade is associated with it. If the code is severely lacking, however, we will apply a deduction to your group's Final Report grade (see the Required Models and Predictors-per-Model penalties on the Final Report Guide).

What to upload

Your repository should clearly show:

Data sources being joined. The code that merges/joins your datasets from their different sources, so we can see how your combined dataset was assembled.
Any cleaning. All preprocessing — missing-value handling, type conversions, feature engineering, etc. — so we can reproduce your exact analysis-ready data.
A notebook for each model. One single, well-documented Jupyter notebook (.ipynb) per model (3 models), neatly formatted with markdown narration. We will run these.

What each model's notebook must include

Each model's notebook should walk through the full analysis, with all of these steps:

Loading, cleaning, and merging the data it uses
Exploratory Data Analysis (EDA)
Outlier screening and handling
Splitting the data into training, validation, and test sets
Variable selection (using the training and validation data)
Goodness-of-fit testing and model assumption checks
Model training, then final evaluation on the held-out test set
Statistical analysis and hypothesis testing

Models & predictors

A "model" is a specific instance of an analysis — not just a model family. Your 3 models could be MLR + Poisson + Logistic, or three flavors of MLR with different predictors and/or outcomes. Either is fine.

We expect 10+ predictors per model (a categorical variable counts as 1, regardless of its number of levels). The point is to see meaningful variable selection applied: start broad, then let the data guide you toward a parsimonious final model.

Train / validation / test split

Every model must use a training, validation, and test split. Fit on the training data, use the validation set for model selection (variable selection, comparing specifications, and threshold choice for logistic models), and report final metrics on the held-out test set. This holds for MLR, Poisson, and Logistic alike — only the metrics you report differ by family.

Make it runnable

Use seeds where appropriate so we can reproduce your results, and note any package dependencies. We should be able to open each notebook, run it top to bottom, and arrive at the results you report.