We’re creating a free, easy-to-use tool to help researchers calculate how much data they need to build accurate and reliable prediction models using machine learning and long-term health records. Developed in partnership with patients, carers, and charities, the tool will support safer, fairer, and more effective use of data in healthcare.

Prediction models are already helping the NHS personalise care, plan treatments, and prevent illness before it starts. But many are built on too little data, leading to inaccurate results—especially for underrepresented groups. Our project addresses this problem by providing practical guidance on the data needed to develop models that are both powerful and fair.

This project is led by a team at King’s College London and funded by the National Institute for Health and Care Research as part of the Research for Patient Benefit Programme. To learn more, see below.

NIHR logo
King's College London logo

What is the ‘pmsims’ project about?

  • NHS services use prediction tools to help make decisions about patient care. These tools rely on data to estimate what might happen in the future, like someone’s risk of developing a health condition.

  • However, if the tools are built using too little data, their predictions can be inaccurate or unfair, particularly for people from underrepresented communities.

  • We’re building a free, easy-to-use tool to help researchers estimate how much data they need to build accurate and fair prediction models—including newer approaches that use complex methods like machine learning.

  • We’re working with patients and the public to understand their views, ensure accessibility, and raise awareness of the risks associated with using too little information. Our goal is to improve the fairness and accuracy of prediction tools used in healthcare.

How do prediction models work?

  • Prediction modelling uses statistics or computer algorithms to estimate what might happen in the future. More and more of these models are being created as larger datasets and better tools become available.

  • For example, the QRISK tool estimates a person’s risk of developing heart disease in the next 10 years. Doctors use it to decide who should be offered treatment to help prevent heart problems.

Why sample size matters in prediction

  • Many prediction models are developed using too little data. When this happens, the model can become too focused on the specific details of the sample it was trained on, rather than learning general patterns that apply more widely. This is known as overfitting. It means the model might appear accurate during development but make unreliable or unfair predictions in real-world settings.

  • Small samples are especially problematic when making predictions for minoritised groups, as limited representation can lead to inaccurate and unfair predictions, potentially resulting in incorrect or even harmful treatment recommendations.

How will this project help?

An essential first step in building a model is estimating how much data is needed.

  • We’re creating a simple, user-friendly tool that estimates the minimum amount of data needed for different types of prediction models. It does this by generating synthetic (artificial) data of different sizes and testing how well models perform with each one.
  • This will help researchers decide how complex their model should be and help prevent models being developed without enough data, which can lead to inaccurate or unfair predictions and wasted research effort.

Get in touch

We’re always keen to hear from you. Whether you have questions about the project, want to learn more, or are interested in collaborating—feel free to get in touch.

Join our mailing list

If you’re interested in how prediction models work, why data size matters, or how this tool could support better, fairer research, you can sign up for occasional updates.

You’ll receive no more than four emails per year, and you can unsubscribe at any time with a single click.