Math

Train/Validation/Test Split Calculator

Calculate exact sample counts for each ML data split

Train/Validation/Test Split Calculator

Before training a machine learning model, you need to divide your dataset into subsets: a training set to fit the model, a validation set to tune hyperparameters and monitor overfitting, and a test set to provide an unbiased final performance estimate. This calculator shows exactly how many samples go into each split based on your dataset size and chosen percentages.

How to Use This Calculator

  1. Total samples — enter the number of rows or examples in your full dataset.
  2. Train % — the percentage reserved for training (commonly 60–80%).
  3. Validation % — the percentage for hyperparameter tuning (commonly 10–20%).
  4. Test % — the percentage for final evaluation (commonly 10–20%).
  5. The three percentages must sum to exactly 100. The calculator shows the sample count for each split, plus how many rows remain after integer rounding.

Common Split Ratios

70/15/15 — balanced split, suitable for medium datasets (10,000–500,000 samples).
80/10/10 — more training data, useful when data is limited (1,000–10,000 samples).
60/20/20 — larger validation and test sets, suitable when reliable metrics matter more than training data volume.

Why Three Data Splits?

Using only two splits (train/test) creates a subtle problem: if you tune hyperparameters based on test set performance, the test set becomes part of the training process and can no longer serve as an unbiased estimator. A dedicated validation set absorbs all hyperparameter tuning decisions, keeping the test set pristine until the very end.

When to Use Cross-Validation Instead

For small datasets (under ~5,000 samples), a single validation split may be too noisy. k-fold cross-validation partitions the training data into k equal folds, training k models and averaging the validation scores. The test set is still held out. This uses data more efficiently but requires more computation.

Stratified Splitting

If your dataset has class imbalance (e.g., 95% negative, 5% positive), a random split may accidentally put all rare-class examples in training. Stratified splitting preserves the class ratio in each split. Always stratify for classification tasks with imbalanced labels.

Frequently Asked Questions

Why do I need three data splits?
Training data fits the model. Validation data tunes hyperparameters — if you make model decisions based on it, you have implicitly "trained" on it. The test set evaluates the final model with no prior influence. Using two splits (train/test) without a validation set leads to overfitting to the test set through hyperparameter tuning.
What is the difference between the validation set and the test set?
The validation set is used during model development: you try different architectures, learning rates, and regularization methods and pick the best based on validation performance. The test set is used exactly once — after all decisions are made — to report the final unbiased evaluation metric.
What split ratio should I use?
For large datasets (100k+ samples), 80/10/10 or 90/5/5 is fine — 5% still gives thousands of examples for validation and test. For medium datasets, 70/15/15 or 80/10/10 balances training signal and evaluation reliability. For small datasets (<5k), consider k-fold cross-validation instead of a fixed split.
Does the order of splitting matter?
For i.i.d. data, no — a random shuffle before splitting is fine. For time-series data, always split chronologically (earliest data for training, most recent for testing) to avoid data leakage. Never random-shuffle time series before splitting.
What does "remainder rows" mean in the output?
Since samples are whole numbers, integer rounding means train + validation + test may not add up to the total exactly. The remainder (typically 0–2 rows) can be added to the training set.
Should I normalize data before or after splitting?
After splitting. Compute normalization parameters (mean, std, min, max) on the training set only, then apply them to validation and test sets. Fitting the scaler on all data before splitting leaks test statistics into the training process.
What is stratified splitting?
Stratified splitting preserves the proportion of each class label in every split. If 5% of your data is class A, stratified splitting ensures 5% of train, validation, and test sets are also class A. This is critical for imbalanced classification tasks.
Is a validation set always needed?
Not always. If you use k-fold cross-validation, separate validation splits are built into the procedure. If you use a fixed algorithm with no hyperparameters (e.g., linear regression with OLS), a validation set adds less value and a simple train/test split may suffice.