2 DataCheckM0
2.1 Data-check
I loaded the data from the directory L:\\GBW-0080_BC_Lab\\Data\\FAT-ILC\\Giacomo
Table 2.1 reports the number of unknown values for each variable.
Table 2.2 shows the summary of the available information. skim_type, skim_variable, n_missing, complete_rate indicate the type of the variable, the name of the variable, the number of missing values and the proportion of complete values for each variable. Date.min, Date.max, Date.median, Date.n_unique indicate the minimum, maximum, the median and the number of unique values for the date variables. factor.n_unique, factor.top_counts indicate the number of unique values and the values with top counts for the categorical variables. numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100 describe the percentiles of the numerical variable in the database.
As we will see later, several variables in the database result with a complete rate very low. The following variables have a complete rate of 0.
[1] "ER_H_score_biopsy, PR_H_score_biopsy, HER2_FISH_biopsy, HER2_ratio_biopsy, Ki67_biopsy, ER_H_score_biopsy_2nd_lesion, PR_H_score_biopsy_2nd_lesion, HER2_FISH_biopsy_2nd_lesion, HER2_ratio_biopsy_2nd_lesion, Ki67_biopsy_2nd_lesion, neo_adjuvant_chemotherapy_scheme, neo_adjuvant_chemotherapy_BSA_used, neo_adjuvant_chemotherapy_BSA_capping, neo_adjuvant_chemotherapy_completion, neo_adjuvant_HER2_therapy_scheme, neo_adjuvant_endocrinetherapy_scheme, neo_adjuvant_endocrinetherapy_duration, neo_adjuvant_other, residual_tumorbed, lobular_subtype, ER_H_score_resection_specimen_2nd_lesion, PR_H_score_resection_specimen_2nd_lesion, HER2_FISH_resection_specimen_2nd_lesion, HER2_ratio_resection_specimen_2nd_lesion, Ki67_resection_specimen_2nd_lesion, Antibody_E.cadherin, B.catenin, p120_catenin, GEP_outcome, adjuvant_HER2_scheme, chemotherapy_1st_line_metastatic, HER2_1st_line_metastatic, endocrinetherapy_1st_line_metastatic, treatment_1st_line_other_metastatic, treatment_reduction_1st_line_metastatic, clinical_response_1st_line_metastatic, chemotherapy_2nd_line_metastatic, HER2_2nd_line_metastatic, endocrinetherapy_2nd_line_metastatic, treatment_2nd_line_other_metastatic, treatment_reduction_2nd_line_metastatic, clinical_response_2nd_line_metastatic, second_progression_distant_disease_metastatic, radiotherapy_all_metastatic, chemotherapy_number_lines_all_metastatic, HER2_number_lines_all_metastatic, endocrinetherapy_number_lines_all_metastatic, treatment_other_all_metastatic, comments"
The following variables have a complete rate above 0% but below 5%.
[1] "comorbidities, age_menarche, oral_anticonceptive_duration, fertility_treatment, age_last_pregnancy, breast_feeding, breast_feeding_duration, age_menopause, Number_of_adenopathies_expected_on_other_imaging, tumor_grade_biopsy, ER_Allred_biopsy, PR_Allred_biopsy, HER2_IHC_score_biopsy, ER_Allred_biopsy_2nd_lesion, PR_Allred_biopsy_2nd_lesion, HER2_IHC_score_biopsy_2nd_lesion, number_of_suspected_foci, Number_of_adenopathies_expected_on_imaging, ER_Allred_resection_specimen_2nd_lesion, PR_Allred_resection_specimen_2nd_lesion, HER2_IHC_score_resection_specimen_2nd_lesion, E.cadherin, lymphatic_invasion_resection_specimen, GEP_type, adjuvant_chemotherapy_scheme, adjuvant_endocrinetherapy_scheme1, adjuvant_endocrinetherapy_scheme1_duration, adjuvant_endocrinetherapy_scheme2, adjuvant_endocrinetherapy_scheme2_duration, Total_duration_endocrine_treatment, adjuvant_other"
Instead, in ?tbl-skimsf are reported the variables that have a complete rate of at least 75%.
2.1.1 Missing values
Figure 2.1 displays in decreasing order the absolute frequency of the occurrence of missing values for each patient that has at least one missing value. For sake of simplicity, they are displayed separately depending on the number of the missing values. The same was performed for the variables, as displayed in Figure 2.2.
Table 2.3 reports the number of missing values for each patients.
Table 2.4 reports the number of missing values for each variable.
2.1.2 Event history check
We need to set an order of event. The order is the following: Date of birth —> Date of diagnosis —> Date of start of NAT —> Date of end of NAT —> Surgery Date —> Date of recurrences —> Date first progression metastatic —–> Date second progression metastatic ——–> Date of death / Date of last FU / Date of last FU in own center.
Generally, the Date of last follow-up is equivalent to Date of death if death occurred. In some instances, Date of last follow-up is greater than Date of death. Why? Is Date of last follow-up referred to the day is known that the patients died at their respective Date of death?
Usually in the database, a Date of last follow-up in Leuven preceed Date of last follow-up at own center. Sometimes this is not the case, and Date of last follow-up in Leuven is greater than Date of last follow-up at own center. Is there a particular reason?
Then, some patients have a Date of last follow-up in Leuven that is before other events. This can be explained by the fact that the patients then are followed in their own centers.
There is one patient that has a Date of recurrence before the Date of Surgery.
Some patients do not have a date of surgery.
Some patients have a date of diagnosis equal to the date of surgery. How is it possible?
We have patients with multiple recurrence at different times. We will need to choose what is the most relevant type of recurrence. Is the distant the most relevant?
Patient 84323500 has a date of distant recurrence equal to a date of first progression.
# A tibble: 1 × 5
patient_ID date_distant_recurrence date_first_progression_metastatic
<chr> <date> <date>
1 84323500 2006-11-09 2006-11-09
# ℹ 2 more variables: date_recurrence_contralateral_breast <date>,
# date_locoregional_recurrence <date>
2.1.3 Subset of variables : baseline characteristics
We now limit the analysis to the variables of interest. For the moment I will extract the following variables: method_of_detection, age_at_diagnosis, age_category, BMI, BMI_category, menopausal_status, body_surface_area, smoking, alcohol_abuse, hypertension, hyperlipidemia, diabetes, oral_anticonceptive_use, pregnancy_A, pregnancy_P, pregnancy_G, Age.FFTP, Interval.1st.FTP, hormone_replacement, familial_history_breast_ovary, visible_on_mammogram, TNM_cT_at_diagnosis, TNM_cN_at_diagnosis, TNM_cM_at_diagnosis, neo_adjuvant_therapy, surgery_type_breast, surgery_type_axilla, TNM_pT_resection_specimen, TNM_pN_resection_specimen, diameter_pathology_resection_specimen, tumor_grade_resection_specimen, resection_margin_resection_specimen, ER_Interpretation, PR_Interpretation, HER2_Interpretation, presence_DCIS_resection_specimen, presence_LCIS_resection_specimen, positive_ALN, radiotherapy, adjuvant_chemotherapy, adjuvant_HER2, adjuvant_endocrinetherapy.
Table 2.5 reports the first description of the variables included in the analysis.
| Overall (N=1367) |
|
|---|---|
| method_of_detection | |
| screening | 548 (40.1%) |
| symptoms | 774 (56.6%) |
| Missing | 45 (3.3%) |
| age_at_diagnosis | |
| Mean (SD) | 61.5 (11.8) |
| Median [Min, Max] | 61.0 [32.0, 95.0] |
| age_category | |
| < 40 | 23 (1.7%) |
| ≥ 80 | 104 (7.6%) |
| 40 - 49 | 210 (15.4%) |
| 50 - 59 | 387 (28.3%) |
| 60 - 69 | 397 (29.0%) |
| 70 - 79 | 244 (17.8%) |
| 80 - 89 | 2 (0.1%) |
| BMI | |
| Mean (SD) | 25.6 (4.85) |
| Median [Min, Max] | 24.8 [14.9, 47.7] |
| Missing | 17 (1.2%) |
| BMI_category | |
| < 25 | 705 (51.6%) |
| > 30 | 222 (16.2%) |
| 25 - 30 | 423 (30.9%) |
| Missing | 17 (1.2%) |
| menopausal_status | |
| Postmenopausal | 981 (71.8%) |
| pre- and perimenopausal | 344 (25.2%) |
| Missing | 42 (3.1%) |
| body_surface_area | |
| Mean (SD) | 1.74 (0.164) |
| Median [Min, Max] | 1.72 [1.17, 2.47] |
| Missing | 158 (11.6%) |
| smoking | |
| active | 188 (13.8%) |
| former | 266 (19.5%) |
| no | 912 (66.7%) |
| Missing | 1 (0.1%) |
| alcohol_abuse | |
| no | 1159 (84.8%) |
| yes | 205 (15.0%) |
| Missing | 3 (0.2%) |
| hypertension | |
| no | 844 (61.7%) |
| yes | 522 (38.2%) |
| Missing | 1 (0.1%) |
| hyperlipidemia | |
| no | 1058 (77.4%) |
| yes | 309 (22.6%) |
| diabetes | |
| MODY | 1 (0.1%) |
| no | 1282 (93.8%) |
| type 1 | 3 (0.2%) |
| type 2 | 80 (5.9%) |
| Missing | 1 (0.1%) |
| oral_anticonceptive_use | |
| active | 181 (13.2%) |
| former | 670 (49.0%) |
| no | 427 (31.2%) |
| Missing | 89 (6.5%) |
| pregnancy_A | |
| 0 | 1053 (77.0%) |
| 1 | 204 (14.9%) |
| 10 | 1 (0.1%) |
| 2 | 64 (4.7%) |
| 3 | 18 (1.3%) |
| 4 | 8 (0.6%) |
| 5 | 2 (0.1%) |
| 6 | 1 (0.1%) |
| Missing | 16 (1.2%) |
| pregnancy_P | |
| 0 | 188 (13.8%) |
| 1 | 293 (21.4%) |
| 10 | 2 (0.1%) |
| 12 | 1 (0.1%) |
| 2 | 530 (38.8%) |
| 3 | 231 (16.9%) |
| 4 | 83 (6.1%) |
| 5 | 25 (1.8%) |
| 6 | 8 (0.6%) |
| 9 | 3 (0.2%) |
| Missing | 3 (0.2%) |
| pregnancy_G | |
| 0 | 168 (12.3%) |
| 1 | 266 (19.5%) |
| 11 | 2 (0.1%) |
| 2 | 440 (32.2%) |
| 3 | 284 (20.8%) |
| 4 | 112 (8.2%) |
| 5 | 54 (4.0%) |
| 6 | 17 (1.2%) |
| 7 | 10 (0.7%) |
| 8 | 1 (0.1%) |
| 9 | 4 (0.3%) |
| Missing | 9 (0.7%) |
| Age.FFTP | |
| 16 | 2 (0.1%) |
| 17 | 7 (0.5%) |
| 18 | 20 (1.5%) |
| 19 | 32 (2.3%) |
| 20 | 51 (3.7%) |
| 21 | 61 (4.5%) |
| 22 | 79 (5.8%) |
| 23 | 76 (5.6%) |
| 24 | 120 (8.8%) |
| 25 | 121 (8.9%) |
| 26 | 85 (6.2%) |
| 27 | 104 (7.6%) |
| 28 | 65 (4.8%) |
| 29 | 61 (4.5%) |
| 30 | 51 (3.7%) |
| 31 | 160 (11.7%) |
| 32 | 6 (0.4%) |
| 33 | 6 (0.4%) |
| 34 | 6 (0.4%) |
| 35 | 2 (0.1%) |
| 36 | 6 (0.4%) |
| 37 | 2 (0.1%) |
| 38 | 2 (0.1%) |
| 45 | 1 (0.1%) |
| nulliparous | 188 (13.8%) |
| Missing | 53 (3.9%) |
| Interval.1st.FTP | |
| Mean (SD) | 35.5 (13.0) |
| Median [Min, Max] | 35.0 [5.00, 69.0] |
| Missing | 241 (17.6%) |
| hormone_replacement | |
| active | 205 (15.0%) |
| former | 170 (12.4%) |
| no | 936 (68.5%) |
| Missing | 56 (4.1%) |
| familial_history_breast_ovary | |
| no | 825 (60.4%) |
| yes | 533 (39.0%) |
| Missing | 9 (0.7%) |
| visible_on_mammogram | |
| no | 122 (8.9%) |
| yes | 1220 (89.2%) |
| Missing | 25 (1.8%) |
| TNM_cT_at_diagnosis | |
| T1a | 13 (1.0%) |
| T1b | 135 (9.9%) |
| T1c | 352 (25.7%) |
| T1mi | 1 (0.1%) |
| T2 | 587 (42.9%) |
| T3 | 188 (13.8%) |
| T4a | 3 (0.2%) |
| T4b | 18 (1.3%) |
| T4c | 2 (0.1%) |
| T4d | 32 (2.3%) |
| Tis | 9 (0.7%) |
| Missing | 27 (2.0%) |
| TNM_cN_at_diagnosis | |
| n0 | 1 (0.1%) |
| N0 | 1117 (81.7%) |
| N1 | 195 (14.3%) |
| N2 | 11 (0.8%) |
| N3 | 27 (2.0%) |
| Missing | 16 (1.2%) |
| TNM_cM_at_diagnosis | |
| M0 | 1367 (100%) |
| neo_adjuvant_therapy | |
| no | 1256 (91.9%) |
| yes | 111 (8.1%) |
| surgery_type_breast | |
| Mastectomy | 813 (59.5%) |
| Tumorectomy | 542 (39.6%) |
| Missing | 12 (0.9%) |
| surgery_type_axilla | |
| ALN | 523 (38.3%) |
| SLN | 679 (49.7%) |
| SLN + ALN | 148 (10.8%) |
| Missing | 17 (1.2%) |
| TNM_pT_resection_specimen | |
| T0 | 6 (0.4%) |
| T1a | 23 (1.7%) |
| T1b | 98 (7.2%) |
| T1c | 327 (23.9%) |
| T1mi | 1 (0.1%) |
| T2 | 571 (41.8%) |
| T3 | 318 (23.3%) |
| T4b | 7 (0.5%) |
| Tis | 4 (0.3%) |
| Missing | 12 (0.9%) |
| TNM_pN_resection_specimen | |
| N0(i-) | 727 (53.2%) |
| N0(i+) | 80 (5.9%) |
| N1a | 271 (19.8%) |
| N1mi | 77 (5.6%) |
| N2a | 88 (6.4%) |
| N3a | 102 (7.5%) |
| N3b | 1 (0.1%) |
| pN1a | 1 (0.1%) |
| Missing | 20 (1.5%) |
| diameter_pathology_resection_specimen | |
| Mean (SD) | 38.5 (29.7) |
| Median [Min, Max] | 30.0 [0, 220] |
| Missing | 15 (1.1%) |
| tumor_grade_resection_specimen | |
| 1 | 16 (1.2%) |
| 2 | 1212 (88.7%) |
| 3 | 134 (9.8%) |
| Missing | 5 (0.4%) |
| resection_margin_resection_specimen | |
| dubious (< 1 mm) | 145 (10.6%) |
| negative | 1133 (82.9%) |
| positive | 74 (5.4%) |
| Missing | 15 (1.1%) |
| ER_Interpretation | |
| negative | 25 (1.8%) |
| positive | 1183 (86.5%) |
| Positive | 157 (11.5%) |
| Missing | 2 (0.1%) |
| PR_Interpretation | |
| negative | 147 (10.8%) |
| Negative | 22 (1.6%) |
| positive | 1025 (75.0%) |
| Positive | 130 (9.5%) |
| Missing | 43 (3.1%) |
| HER2_Interpretation | |
| negative | 1150 (84.1%) |
| Negative | 145 (10.6%) |
| positive | 51 (3.7%) |
| Positive | 8 (0.6%) |
| Missing | 13 (1.0%) |
| presence_DCIS_resection_specimen | |
| no | 1199 (87.7%) |
| yes | 154 (11.3%) |
| Missing | 14 (1.0%) |
| presence_LCIS_resection_specimen | |
| no | 227 (16.6%) |
| yes, classical LCIS | 771 (56.4%) |
| yes, non classical LCIS | 355 (26.0%) |
| Missing | 14 (1.0%) |
| positive_ALN | |
| Mean (SD) | 2.09 (5.01) |
| Median [Min, Max] | 0 [0, 42.0] |
| Missing | 19 (1.4%) |
| radiotherapy | |
| no | 238 (17.4%) |
| yes | 1117 (81.7%) |
| Missing | 12 (0.9%) |
| adjuvant_chemotherapy | |
| no | 1005 (73.5%) |
| yes | 350 (25.6%) |
| Missing | 12 (0.9%) |
| adjuvant_HER2 | |
| no | 1316 (96.3%) |
| yes | 39 (2.9%) |
| Missing | 12 (0.9%) |
| adjuvant_endocrinetherapy | |
| no | 46 (3.4%) |
| yes | 1309 (95.8%) |
| Missing | 12 (0.9%) |
2.1.4 Number of events
I am excluding for the moment those patients that have a date of surgery equal to the date of diagnosis and all the other patients whose dates were unsure.
Figure 2.3 describes the event history of the patients. You just need to pick a starting state from the ‘from’ axis and select then transitioning state from the ‘to’ axis. In the corresponding cell, you find the absolute frequency of each transition.
Check the patients with a date of lost to follow-up before date of death
Look at the distribution of the variables through the years