Train/Validation/Test Split Calculator
Calculate exact sample counts for each ML data split
Train/Validation/Test Split Calculator
Before training a machine learning model, you need to divide your dataset into subsets: a training set to fit the model, a validation set to tune hyperparameters and monitor overfitting, and a test set to provide an unbiased final performance estimate. This calculator shows exactly how many samples go into each split based on your dataset size and chosen percentages.
How to Use This Calculator
- Total samples — enter the number of rows or examples in your full dataset.
- Train % — the percentage reserved for training (commonly 60–80%).
- Validation % — the percentage for hyperparameter tuning (commonly 10–20%).
- Test % — the percentage for final evaluation (commonly 10–20%).
- The three percentages must sum to exactly 100. The calculator shows the sample count for each split, plus how many rows remain after integer rounding.
Common Split Ratios
70/15/15 — balanced split, suitable for medium datasets (10,000–500,000 samples).
80/10/10 — more training data, useful when data is limited (1,000–10,000 samples).
60/20/20 — larger validation and test sets, suitable when reliable metrics matter more than training data volume.
Why Three Data Splits?
Using only two splits (train/test) creates a subtle problem: if you tune hyperparameters based on test set performance, the test set becomes part of the training process and can no longer serve as an unbiased estimator. A dedicated validation set absorbs all hyperparameter tuning decisions, keeping the test set pristine until the very end.
When to Use Cross-Validation Instead
For small datasets (under ~5,000 samples), a single validation split may be too noisy. k-fold cross-validation partitions the training data into k equal folds, training k models and averaging the validation scores. The test set is still held out. This uses data more efficiently but requires more computation.
Stratified Splitting
If your dataset has class imbalance (e.g., 95% negative, 5% positive), a random split may accidentally put all rare-class examples in training. Stratified splitting preserves the class ratio in each split. Always stratify for classification tasks with imbalanced labels.