Math

Train/Validation/Test Split Calculator

Calculate exact sample counts for each ML data split

Total dataset size (rows)

Train (%)

Validation (%)

Test (%)

Frequently Asked Questions

Why do I need three data splits?

Training data fits the model. Validation data tunes hyperparameters — if you make model decisions based on it, you have implicitly "trained" on it. The test set evaluates the final model with no prior influence. Using two splits (train/test) without a validation set leads to overfitting to the test set through hyperparameter tuning.

What is the difference between the validation set and the test set?

The validation set is used during model development: you try different architectures, learning rates, and regularization methods and pick the best based on validation performance. The test set is used exactly once — after all decisions are made — to report the final unbiased evaluation metric.

What split ratio should I use?

For large datasets (100k+ samples), 80/10/10 or 90/5/5 is fine — 5% still gives thousands of examples for validation and test. For medium datasets, 70/15/15 or 80/10/10 balances training signal and evaluation reliability. For small datasets (<5k), consider k-fold cross-validation instead of a fixed split.

Does the order of splitting matter?

For i.i.d. data, no — a random shuffle before splitting is fine. For time-series data, always split chronologically (earliest data for training, most recent for testing) to avoid data leakage. Never random-shuffle time series before splitting.

What does "remainder rows" mean in the output?

Since samples are whole numbers, integer rounding means train + validation + test may not add up to the total exactly. The remainder (typically 0–2 rows) can be added to the training set.

Should I normalize data before or after splitting?

After splitting. Compute normalization parameters (mean, std, min, max) on the training set only, then apply them to validation and test sets. Fitting the scaler on all data before splitting leaks test statistics into the training process.

What is stratified splitting?

Stratified splitting preserves the proportion of each class label in every split. If 5% of your data is class A, stratified splitting ensures 5% of train, validation, and test sets are also class A. This is critical for imbalanced classification tasks.

Is a validation set always needed?

Not always. If you use k-fold cross-validation, separate validation splits are built into the procedure. If you use a fixed algorithm with no hyperparameters (e.g., linear regression with OLS), a validation set adds less value and a simple train/test split may suffice.