ISYE 6414
ISYE 6414 Final Project
Sign In

Datasets

Key Definitions

Dataset:A collection of data, typically a single file or table (e.g., a CSV file with unemployment rates by state).
Data Source:The organization or entity that aggregates, assembles, and publishes data (e.g., the Bureau of Labor Statistics).

Requirements

  • Your group needs at least 3 datasets from 3 different data sources, joined together
    Multiple datasets from the same organization count as only 1 source.
  • At least 1 dataset must have 10,000+ rows before filtering
    • This is the "core" dataset for your analysis.
    • Filter and clean it thoughtfully, but retain at least a few thousand rows afterward — you need enough data to split into training, validation, and test sets and still model meaningfully.
    • Datasets you join don't need similar cardinality — joining a smaller reference table (e.g., country-level median income with <1,000 rows) to your core dataset is fine.
    • You may aggregate granular data (e.g., school district → state) for your analysis.
  • Enough predictors for 10+ per model
    Your combined data must support 10+ predictors per model (a categorical variable counts as 1 regardless of its number of levels) so you can perform meaningful variable selection.
  • Don't worry about finding weak predictors
    You're graded on your analysis quality, not on finding strong correlations.
Rules of Thumb
  1. The problem shouldn't be trivial — your core dataset should be large enough with a decent number of predictors such that you're performing variable selection.
  2. The datasets you join to your core dataset should add meaningful information (but even just 1 additional predictor is fine).

Visual: The 3 Data Sources Requirement

flowchart TB subgraph Source1["<b>Data Source 1</b>"] D1["Dataset A<br/>(Employment Data)<br/><i>e.g., Bureau of Labor Statistics</i>"] end subgraph Source2["<b>Data Source 2</b>"] D2["Dataset B<br/>(Demographics)<br/><i>e.g., Census Bureau</i>"] end subgraph Source3["<b>Data Source 3</b>"] D3["Dataset C<br/>(Education Stats)<br/><i>e.g., Dept. of Education</i>"] end D1 --> JOIN["Combined Dataset<br/>(merged on shared keys)"] D2 --> JOIN D3 --> JOIN JOIN --> ANALYSIS["Your Individual Analysis"] style Source1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style Source2 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px style Source3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px style JOIN fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px style ANALYSIS fill:#fce4ec,stroke:#c2185b,stroke-width:2px

Example

A Group's Combined Dataset

Your group assembles one combined dataset by joining at least 3 datasets drawn from at least 3 different data sources. For example:

DatasetData SourceRole
Housing prices (50,000+ rows)Zillow ResearchCore dataset
Unemployment ratesBureau of Labor StatisticsJoined by region + date
Population dataUS Census BureauJoined by region
✓ Valid: 3 datasets from 3 distinct sources, joined, with a 10,000+ row core dataset