5 Characterization of the differences between M0 and M1

Author

Giacomo Biganzoli

I proceed with the second task. The new files I loaded are the following: L:\\GBW-0080_BC_Lab\\Data\\FAT-ILC\\Giacomo\\ILCdatabaseM0_cleaned0312.xls and L:\\GBW-0080_BC_Lab\\Data\\FAT-ILC\\Giacomo\\ILCdatabaseM1 20251209.xlsx.

I select the variables considered in the preview reports. The variable for M0 are patient_ID, method_of_detection, age_at_diagnosis, age_category, BMI, BMI_category, menopausal_status, smoking, alcohol_abuse, hypertension, hyperlipidemia, diabetes, oral_anticonceptive_use, pregnancy_P, hormone_replacement, familial_history_breast_ovary, visible_on_mammogram, TNM_cT_at_diagnosis, TNM_cN_at_diagnosis, neo_adjuvant_therapy, surgery_type_breast, surgery_type_axilla, TNM_pT_resection_specimen, TNM_pN_resection_specimen, diameter_pathology_resection_specimen, tumor_grade_resection_specimen, resection_margin_resection_specimen, ER_Interpretation, PR_Interpretation, HER2_Interpretation, presence_DCIS_resection_specimen, presence_LCIS_resection_specimen, radiotherapy, adjuvant_chemotherapy, adjuvant_HER2, adjuvant_endocrinetherapy, germline_mutation_testing_performed, germline_mutation_testing_result,germline_mutation_testing_year_most_recent_test, multifocality1

whereas for M1 are patient_ID, method_of_detection, age_at_diagnosis (y), age_category, BMI, BMI_category, menopausal_status, oral_anticonceptive_use,hormone_replacement, smoking, alcohol_abuse, hypertension, hyperlipidemia, diabetes, pregnancy_P, germline_mutation_testing_performed, germline_mutation_testing_result, germline_mutation_testing_year, familial_history_breast_ovary, visible_on_mammogram, TNM_cT_at_diagnosis, TNM_cN_at_diagnosis, tumor_grade_biopsy_breast, ER_Interpretation_biopsy_breast, PR_Interpretation_biopsy_breast, HER2_Interpreation_biopsy_breast,radiotherapy_primary, radiotherapy_1st_line_metastatic, meta_brain_nonleptomeningeal_first_metastases, meta_leptemeningeal_first_metastases, meta_bones_first_metastases, meta_skin_first_metastases, meta_lungs_first_metastases, meta_liver_first_metastases, meta_abdomen_extrahepatic_first_metastases, meta_reproductive_organs_first_metastases, meta_lymph_nodes_first_metastases, meta_other_first_metastases.

Load M0

Load M1

[1] "positive positive" "negative negative" "positive negative"

[1] <NA>     negative positive
Levels: negative positive


    HER2+ HR+/HER2-      TNBC 
       13       148        14

The variables not shared between the two files are age_at_diagnosis, neo_adjuvant_therapy, surgery_type_breast, surgery_type_axilla, TNM_pT_resection_specimen, TNM_pN_resection_specimen, diameter_pathology_resection_specimen, tumor_grade_resection_specimen, resection_margin_resection_specimen, ER_Interpretation, PR_Interpretation, HER2_Interpretation, presence_DCIS_resection_specimen, presence_LCIS_resection_specimen, radiotherapy, adjuvant_chemotherapy, adjuvant_HER2, adjuvant_endocrinetherapy, germline_mutation_testing_year_most_recent_test, multifocality.

The following variables are absent for M1 patients: neo_adjuvant_therapy, surgery_type_breast, surgery_type_axilla, TNM_pT_resection_specimen, TNM_pN_resection_specimen, diameter_pathology_resection_specimen, tumor_grade_resection_specimen, resection_margin_resection_specimen, presence_DCIS_resection_specimen, presence_LCIS_resection_specimen, adjuvant_chemotherapy, adjuvant_HER2, adjuvant_endocrinetherapy, multifocality. I homogenized the BMI categories in Underweight, Normalweight, Overweight, Obese. I created categories for pregnancy_P (0, 1, >1), and also categories for Diabetes (Ty2D, NoTy2D, Missing). There is one patient in M1 that has Diabetes but her type is not specified. for the moment she results with a missing value for that. It follows a brief description between M0 and M1 for the variables patient_ID, method_of_detection, age_at_diagnosis, age_category, BMI, BMI_category, menopausal_status, smoking, alcohol_abuse, hypertension, hyperlipidemia, diabetes, oral_anticonceptive_use, pregnancy_P, hormone_replacement, familial_history_breast_ovary, visible_on_mammogram, TNM_cT_at_diagnosis, TNM_cN_at_diagnosis, radiotherapy, germline_mutation_testing_performed, germline_mutation_testing_result, germline_mutation_testing_year_most_recent_test, HR, subty

	M0 (N=1367)	M1 (N=180)	Overall (N=1547)
method_of_detection
radiologically detected	548 (40.1%)	19 (10.6%)	567 (36.7%)
symptoms	774 (56.6%)	159 (88.3%)	933 (60.3%)
Missing	45 (3.3%)	2 (1.1%)	47 (3.0%)
age_at_diagnosis
Mean (SD)	61.5 (11.8)	66.3 (12.4)	62.0 (12.0)
Median [Min, Max]	61.0 [32.0, 95.0]	67.0 [33.0, 92.0]	62.0 [32.0, 95.0]
age_category
< 40	23 (1.7%)	4 (2.2%)	27 (1.7%)
40 - 49	210 (15.4%)	12 (6.7%)	222 (14.4%)
50 - 59	387 (28.3%)	39 (21.7%)	426 (27.5%)
60 - 69	397 (29.0%)	50 (27.8%)	447 (28.9%)
70 - 79	244 (17.8%)	44 (24.4%)	288 (18.6%)
≥ 80	106 (7.8%)	31 (17.2%)	137 (8.9%)
BMI
Mean (SD)	25.6 (4.85)	26.4 (5.22)	25.7 (4.89)
Median [Min, Max]	24.8 [14.9, 47.7]	25.7 [18.4, 41.6]	24.9 [14.9, 47.7]
Missing	17 (1.2%)	17 (9.4%)	34 (2.2%)
BMI_category
Underweight	29 (2.1%)	3 (1.7%)	32 (2.1%)
Normalweight	675 (49.4%)	68 (37.8%)	743 (48.0%)
Overweight	424 (31.0%)	56 (31.1%)	480 (31.0%)
Obese	222 (16.2%)	36 (20.0%)	258 (16.7%)
Missing	17 (1.2%)	17 (9.4%)	34 (2.2%)
menopausal_status
Postmenopausal	981 (71.8%)	149 (82.8%)	1130 (73.0%)
pre- and perimenopausal	344 (25.2%)	31 (17.2%)	375 (24.2%)
Missing	42 (3.1%)	0 (0%)	42 (2.7%)
smoking
active	188 (13.8%)	30 (16.7%)	218 (14.1%)
former	266 (19.5%)	20 (11.1%)	286 (18.5%)
no	912 (66.7%)	117 (65.0%)	1029 (66.5%)
Missing	1 (0.1%)	13 (7.2%)	14 (0.9%)
alcohol_abuse
no	1159 (84.8%)	144 (80.0%)	1303 (84.2%)
yes	205 (15.0%)	17 (9.4%)	222 (14.4%)
Missing	3 (0.2%)	19 (10.6%)	22 (1.4%)
hypertension
no	844 (61.7%)	96 (53.3%)	940 (60.8%)
yes	522 (38.2%)	84 (46.7%)	606 (39.2%)
Missing	1 (0.1%)	0 (0%)	1 (0.1%)
hyperlipidemia
no	1058 (77.4%)	124 (68.9%)	1182 (76.4%)
yes	309 (22.6%)	56 (31.1%)	365 (23.6%)
Ty2D
No	1283 (93.9%)	156 (86.7%)	1439 (93.0%)
Yes	80 (5.9%)	24 (13.3%)	104 (6.7%)
Missing	4 (0.3%)	0 (0%)	4 (0.3%)
oral_anticonceptive_use
active	181 (13.2%)	16 (8.9%)	197 (12.7%)
former	670 (49.0%)	61 (33.9%)	731 (47.3%)
no	427 (31.2%)	43 (23.9%)	470 (30.4%)
Missing	89 (6.5%)	60 (33.3%)	149 (9.6%)
pregnancy_P
>1	883 (64.6%)	98 (54.4%)	981 (63.4%)
0	188 (13.8%)	23 (12.8%)	211 (13.6%)
1	293 (21.4%)	50 (27.8%)	343 (22.2%)
Missing	3 (0.2%)	9 (5.0%)	12 (0.8%)
hormone_replacement
active	204 (14.9%)	16 (8.9%)	220 (14.2%)
former	167 (12.2%)	24 (13.3%)	191 (12.3%)
no	940 (68.8%)	127 (70.6%)	1067 (69.0%)
Missing	56 (4.1%)	13 (7.2%)	69 (4.5%)
familial_history_breast_ovary
no	825 (60.4%)	110 (61.1%)	935 (60.4%)
yes	533 (39.0%)	55 (30.6%)	588 (38.0%)
Missing	9 (0.7%)	15 (8.3%)	24 (1.6%)
visible_on_mammogram
no	123 (9.0%)	17 (9.4%)	140 (9.0%)
yes	1220 (89.2%)	133 (73.9%)	1353 (87.5%)
Missing	24 (1.8%)	30 (16.7%)	54 (3.5%)
TNM_cT_at_diagnosis
T1a	13 (1.0%)	1 (0.6%)	14 (0.9%)
T1b	140 (10.2%)	4 (2.2%)	144 (9.3%)
T1c	358 (26.2%)	19 (10.6%)	377 (24.4%)
T1mi	1 (0.1%)	0 (0%)	1 (0.1%)
T2	594 (43.5%)	47 (26.1%)	641 (41.4%)
T3	193 (14.1%)	43 (23.9%)	236 (15.3%)
T4a	3 (0.2%)	1 (0.6%)	4 (0.3%)
T4b	37 (2.7%)	26 (14.4%)	63 (4.1%)
T4c	2 (0.1%)	6 (3.3%)	8 (0.5%)
T4d	10 (0.7%)	22 (12.2%)	32 (2.1%)
Tis	10 (0.7%)	0 (0%)	10 (0.6%)
Tx	0 (0%)	7 (3.9%)	7 (0.5%)
Missing	6 (0.4%)	4 (2.2%)	10 (0.6%)
TNM_cN_at_diagnosis
N0	1124 (82.2%)	40 (22.2%)	1164 (75.2%)
N1	196 (14.3%)	57 (31.7%)	253 (16.4%)
N2	11 (0.8%)	19 (10.6%)	30 (1.9%)
N3a	18 (1.3%)	15 (8.3%)	33 (2.1%)
N3b	3 (0.2%)	11 (6.1%)	14 (0.9%)
N3c	6 (0.4%)	31 (17.2%)	37 (2.4%)
N3	0 (0%)	1 (0.6%)	1 (0.1%)
x	0 (0%)	1 (0.6%)	1 (0.1%)
Missing	9 (0.7%)	5 (2.8%)	14 (0.9%)
subty
HER2+	59 (4.3%)	13 (7.2%)	72 (4.7%)
HR+/HER2-	1281 (93.7%)	148 (82.2%)	1429 (92.4%)
TNBC	14 (1.0%)	14 (7.8%)	28 (1.8%)
Missing	13 (1.0%)	5 (2.8%)	18 (1.2%)

In the following representations, relative frequency of the categories of the categorical variable between M0 and M1 patients are reported. For the continuous covariates, empirical cumulative distribution function stratified by M0 and M1 are reported.

5.0.1 MCA

To perform the multivariate analysis we need complete information for every patient. This means that we have to analyse which variable is the most problematic in terms of missing values. In particular, we need to save as much as possible M1 patients since their limited number. The variables included in the matrix should be the following: age_category, BMI_category, smoking, alcohol_abuse, hypertension, hyperlipidemia, oral_anticonceptive_use, pregnancy_P, hormone_replacement, familial_history_breast_ovary, visible_on_mammogram, TNM_cT_at_diagnosis, TNM_cN_at_diagnosis, subty, M, Ty2D. I would exclude oral_anticonceptive_use and visible_on_mammogram. For M1 patients, alchohol_abuse is also frequently missing, but for the moment we leave it.

In the end we are left with 133 M1 patients. In the multiple correspondence analysis, actual inertia (variance explained) was recalculated considering the Benzecrie correction. The first five dimension explained 57.34, 21.03, 8.89, 4.35, 2.82, 1.58, 1.29, 0.84, 0.71, 0.54, 0.26, 0.23, 0.1, 0.03, 0.01, 0 of the variance respectively.

The following representation shows how the variables are correlated with the dimensions obtained. The first dimension is mainly represented by age, BMI, hypertension, hyperlipidemia, Ty2D, T and N. However, this latter are clearly separated in the second dimension, along with ER interpretation. Hormone replacement is clearly separated in the third dimension. However it is worth noting that the third dimension explains less than 10% of the variance.

This interactive plot represents the patients in the first three dimensions. The patients are colored depending on their M status. The cloud of M1 patients is well separated by the second and third dimension, less by the first.

In this table you can have a look of the characteristics of the patients:

Let’s have a look of patients 1389, 1267, 1280 and 1367

These are the representations of the planes identified by the MCA.

5.0.2 Models

The analysis would proceed as follow. Since the structure of correlation identified by the MCA, I would fit a model with the following structure: M~ age_category + BMI_category + TNM_cN_at_diagnosis + alcohol_abuse+pregnancy_P+TNM_cT_at_diagnosis + hypertension + hyperlipidemia + ER_Interpretation + HER2_Interpretation + smoking +Ty2D + hormone_replacement . For the moment I would leave familial_history_breast_ovary apart, since it was not separated on any of the first five dimensions. I will use grouped Lasso to see whether some coefficients of the model considered are shrunk to zero. Groupd lasso is slightly different to classical lasso because it applies a mixed form of penalization: L2 within the variable coefficients and L1 between the variable coefficients, so that if a variable, as a whole, does not contribute to classification, it is shrunk as a whole. I applied 10-fold cross validation, stratified so maintain the same ratio of cases vs non-cases in each fold, to obtain the classification error as a function of the L1 and L2 penalty to apply.

[1] "Running Stratified Cross-Validation..."

[1] "Optimal Lambda (min error): 0.000410618867831329"

[1] "Optimal Lambda (1-SE rule): 0.00472111915731369"