validation-of-colon-cancer-stage.jpgDeveloping the optimal model for identifying colorectal cancer stage in claims 

A simple model may be superior when it comes to interpretability and performance. 


Domain(s)
: Oncology, quality of care


Summary

Background

Cancer stage is an important factor in research and clinical care because it impacts treatment choices and survival. However, administrative claims lack explicit codes on cancer stage. The aim of this study was to generate models to identify patients with stage IV (metastatic) colorectal cancer (CRC) using claims data.
 

Methods

The study population included U.S. adults with commercial health insurance from the Healthcare Integrated Research Database (HIRD®) who were first diagnosed with CRC between July 2016 and April 2023. As part of a cancer care quality program (CCQP), individuals' oncologists recorded cancer stage, providing a reliable "gold standard" for the study.

Researchers developed seven statistical models to predict whether a patient had stage IV CRC (versus stages I-III), one based on metastatic cancer diagnosis codes alone, while others
included details like patient age and clinical variables such as other health conditions, symptoms, tests done, other diagnoses, doctor visits, and treatments received. These analyses estimated the importance of these factors in predicting a patient's cancer stage, using both simpler and more advanced statistical methods like logistic regression and machine learning models (such as elastic net, random forest, gradient boosted, and SuperLearner ensemble).

Models were developed using data from a random sample of 75% of patients (a "training" dataset) and then tested using the remaining 25%. Researchers measured how well the models performed by measuring sensitivity (the proportion of true cases identified by the model), specificity (the proportion of non-cases excluded by the model), and graphs to visualize the accuracy of the models using a receiver operating characteristic (ROC) curve. A score was calculated from the ROC curve from 0 to 1 (area under the curve, AUC), with higher scores indicating better model performance in accurately identifying stage IV cases compared to stages I-III.

Results
The cohort included 6,408 HIRD patients with CRC who had data in the CCQP (32% of the study's source population), among whom 3,498 (55%) were in stages I-III and 2,910 (45%) were in stage IV.

Within the claims data, the most important predictor of stage IV cancer was diagnosis of metastatic cancer at distant sites, with an AUC score of 0.87.

Adding additional information on symptoms, diagnostic tests, treatments, and survival enhanced our ability to distinguish between stage IV and other cancer stages, but a multivariable logistic regression model performed as well as the other machine learning models, achieving the highest AUC score across all models of 0.96.


Key Takeaways
  • Using administrative claims data with CCQP oncologist-recorded cancer stage as the gold standard, a diagnosis of metastatic cancer at distant sites was the strongest predictor of stage IV CRC (AUC score: 0.87).
  • While models that incorporated clinical variables improved model performance by 9 percentage points, the simpler logistic regression model was the most easily interpretable and performed as well as the more complex machine learning models.
  • This model development process can be applied to other health outcomes to identify the most accurate and interpretable models for use in managing costs and improving care.

Publications
  • Parlett LE.  Developing a Cancer Stage Model in Patients with Incident Colorectal Cancer Using Data from a Cancer Care Quality Program and Administrative Claims - Presented at ISPE 2025 in Washington, D.C.

Carelon Research project team: 
Valerie Haley, Maria I. Van Rompay, Joseph L. Smith*, Shiva Chaudhary, Kevin Schott, Michael Mack, Shiva K. Vojjala, Lauren E. Parlett
*Carelon Research associates at the time the study was conducted. 


For more information on a specific study or to connect with the Actionable Insights Committee,
contact us at [email protected].


Sponsor: Carelon Research, Inc., a subsidiary of Elevance Health.

Dissemination and sharing of the Newsletter is limited to Elevance Health and its subsidiaries, and included findings and implications are for Elevance Health and its affiliates’ internal use only.

Ready to get started? Sign up now!

Lorem ipsum dolor sit amet

2-Col, Left Image

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut.

  • Many addon features
  • Fully responsive & adaptive
  • SEO optimized
  • Attractive with a modern touch
  • Full Support

Consectetur adipiscing elit...

Joanna C.

"Et harum quidem rerum facilis est et expedita distinctio!"

Stanley T.

"Nam libero tempore, cum soluta nobis est eligendi."

Danielle W.

"Temporibus autem quibusdam et aut officiis debitis!"

Teams at Carelon Research