2  DataCheckM0

Author

Giacomo Biganzoli

2.1 Data-check

I loaded the data from the directory L:\\GBW-0080_BC_Lab\\Data\\FAT-ILC\\Giacomo

Table 2.1 reports the number of unknown values for each variable.

Table 2.1

Table 2.2 shows the summary of the available information. skim_type, skim_variable, n_missing, complete_rate indicate the type of the variable, the name of the variable, the number of missing values and the proportion of complete values for each variable. Date.min, Date.max, Date.median, Date.n_unique indicate the minimum, maximum, the median and the number of unique values for the date variables. factor.n_unique, factor.top_counts indicate the number of unique values and the values with top counts for the categorical variables. numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100 describe the percentiles of the numerical variable in the database.

Table 2.2

As we will see later, several variables in the database result with a complete rate very low. The following variables have a complete rate of 0.

[1] "ER_H_score_biopsy, PR_H_score_biopsy, HER2_FISH_biopsy, HER2_ratio_biopsy, Ki67_biopsy, ER_H_score_biopsy_2nd_lesion, PR_H_score_biopsy_2nd_lesion, HER2_FISH_biopsy_2nd_lesion, HER2_ratio_biopsy_2nd_lesion, Ki67_biopsy_2nd_lesion, neo_adjuvant_chemotherapy_scheme, neo_adjuvant_chemotherapy_BSA_used, neo_adjuvant_chemotherapy_BSA_capping, neo_adjuvant_chemotherapy_completion, neo_adjuvant_HER2_therapy_scheme, neo_adjuvant_endocrinetherapy_scheme, neo_adjuvant_endocrinetherapy_duration, neo_adjuvant_other, residual_tumorbed, lobular_subtype, ER_H_score_resection_specimen_2nd_lesion, PR_H_score_resection_specimen_2nd_lesion, HER2_FISH_resection_specimen_2nd_lesion, HER2_ratio_resection_specimen_2nd_lesion, Ki67_resection_specimen_2nd_lesion, Antibody_E.cadherin, B.catenin, p120_catenin, GEP_outcome, adjuvant_HER2_scheme, chemotherapy_1st_line_metastatic, HER2_1st_line_metastatic, endocrinetherapy_1st_line_metastatic, treatment_1st_line_other_metastatic, treatment_reduction_1st_line_metastatic, clinical_response_1st_line_metastatic, chemotherapy_2nd_line_metastatic, HER2_2nd_line_metastatic, endocrinetherapy_2nd_line_metastatic, treatment_2nd_line_other_metastatic, treatment_reduction_2nd_line_metastatic, clinical_response_2nd_line_metastatic, second_progression_distant_disease_metastatic, radiotherapy_all_metastatic, chemotherapy_number_lines_all_metastatic, HER2_number_lines_all_metastatic, endocrinetherapy_number_lines_all_metastatic, treatment_other_all_metastatic, comments"

The following variables have a complete rate above 0% but below 5%.

[1] "comorbidities, age_menarche, oral_anticonceptive_duration, fertility_treatment, age_last_pregnancy, breast_feeding, breast_feeding_duration, age_menopause, Number_of_adenopathies_expected_on_other_imaging, tumor_grade_biopsy, ER_Allred_biopsy, PR_Allred_biopsy, HER2_IHC_score_biopsy, ER_Allred_biopsy_2nd_lesion, PR_Allred_biopsy_2nd_lesion, HER2_IHC_score_biopsy_2nd_lesion, number_of_suspected_foci, Number_of_adenopathies_expected_on_imaging, ER_Allred_resection_specimen_2nd_lesion, PR_Allred_resection_specimen_2nd_lesion, HER2_IHC_score_resection_specimen_2nd_lesion, E.cadherin, lymphatic_invasion_resection_specimen, GEP_type, adjuvant_chemotherapy_scheme, adjuvant_endocrinetherapy_scheme1, adjuvant_endocrinetherapy_scheme1_duration, adjuvant_endocrinetherapy_scheme2, adjuvant_endocrinetherapy_scheme2_duration, Total_duration_endocrine_treatment, adjuvant_other"

Instead, in ?tbl-skimsf are reported the variables that have a complete rate of at least 75%.

2.1.1 Missing values

Figure 2.1 displays in decreasing order the absolute frequency of the occurrence of missing values for each patient that has at least one missing value. For sake of simplicity, they are displayed separately depending on the number of the missing values. The same was performed for the variables, as displayed in Figure 2.2.

Figure 2.1

Table 2.3 reports the number of missing values for each patients.

Table 2.3
Figure 2.2

Table 2.4 reports the number of missing values for each variable.

Table 2.4

2.1.2 Event history check

We need to set an order of event. The order is the following: Date of birth —> Date of diagnosis —> Date of start of NAT —> Date of end of NAT —> Surgery Date —> Date of recurrences —> Date first progression metastatic —–> Date second progression metastatic ——–> Date of death / Date of last FU / Date of last FU in own center.

Generally, the Date of last follow-up is equivalent to Date of death if death occurred. In some instances, Date of last follow-up is greater than Date of death. Why? Is Date of last follow-up referred to the day is known that the patients died at their respective Date of death?

Usually in the database, a Date of last follow-up in Leuven preceed Date of last follow-up at own center. Sometimes this is not the case, and Date of last follow-up in Leuven is greater than Date of last follow-up at own center. Is there a particular reason?

Then, some patients have a Date of last follow-up in Leuven that is before other events. This can be explained by the fact that the patients then are followed in their own centers.

There is one patient that has a Date of recurrence before the Date of Surgery.

Some patients do not have a date of surgery.

Some patients have a date of diagnosis equal to the date of surgery. How is it possible?

We have patients with multiple recurrence at different times. We will need to choose what is the most relevant type of recurrence. Is the distant the most relevant?

Patient 84323500 has a date of distant recurrence equal to a date of first progression.

# A tibble: 1 × 5
  patient_ID date_distant_recurrence date_first_progression_metastatic
  <chr>      <date>                  <date>                           
1 84323500   2006-11-09              2006-11-09                       
# ℹ 2 more variables: date_recurrence_contralateral_breast <date>,
#   date_locoregional_recurrence <date>

2.1.3 Subset of variables : baseline characteristics

We now limit the analysis to the variables of interest. For the moment I will extract the following variables: method_of_detection, age_at_diagnosis, age_category, BMI, BMI_category, menopausal_status, body_surface_area, smoking, alcohol_abuse, hypertension, hyperlipidemia, diabetes, oral_anticonceptive_use, pregnancy_A, pregnancy_P, pregnancy_G, Age.FFTP, Interval.1st.FTP, hormone_replacement, familial_history_breast_ovary, visible_on_mammogram, TNM_cT_at_diagnosis, TNM_cN_at_diagnosis, TNM_cM_at_diagnosis, neo_adjuvant_therapy, surgery_type_breast, surgery_type_axilla, TNM_pT_resection_specimen, TNM_pN_resection_specimen, diameter_pathology_resection_specimen, tumor_grade_resection_specimen, resection_margin_resection_specimen, ER_Interpretation, PR_Interpretation, HER2_Interpretation, presence_DCIS_resection_specimen, presence_LCIS_resection_specimen, positive_ALN, radiotherapy, adjuvant_chemotherapy, adjuvant_HER2, adjuvant_endocrinetherapy.

Table 2.5 reports the first description of the variables included in the analysis.

Table 2.5
Overall
(N=1367)
method_of_detection
screening 548 (40.1%)
symptoms 774 (56.6%)
Missing 45 (3.3%)
age_at_diagnosis
Mean (SD) 61.5 (11.8)
Median [Min, Max] 61.0 [32.0, 95.0]
age_category
< 40 23 (1.7%)
≥ 80 104 (7.6%)
40 - 49 210 (15.4%)
50 - 59 387 (28.3%)
60 - 69 397 (29.0%)
70 - 79 244 (17.8%)
80 - 89 2 (0.1%)
BMI
Mean (SD) 25.6 (4.85)
Median [Min, Max] 24.8 [14.9, 47.7]
Missing 17 (1.2%)
BMI_category
< 25 705 (51.6%)
> 30 222 (16.2%)
25 - 30 423 (30.9%)
Missing 17 (1.2%)
menopausal_status
Postmenopausal 981 (71.8%)
pre- and perimenopausal 344 (25.2%)
Missing 42 (3.1%)
body_surface_area
Mean (SD) 1.74 (0.164)
Median [Min, Max] 1.72 [1.17, 2.47]
Missing 158 (11.6%)
smoking
active 188 (13.8%)
former 266 (19.5%)
no 912 (66.7%)
Missing 1 (0.1%)
alcohol_abuse
no 1159 (84.8%)
yes 205 (15.0%)
Missing 3 (0.2%)
hypertension
no 844 (61.7%)
yes 522 (38.2%)
Missing 1 (0.1%)
hyperlipidemia
no 1058 (77.4%)
yes 309 (22.6%)
diabetes
MODY 1 (0.1%)
no 1282 (93.8%)
type 1 3 (0.2%)
type 2 80 (5.9%)
Missing 1 (0.1%)
oral_anticonceptive_use
active 181 (13.2%)
former 670 (49.0%)
no 427 (31.2%)
Missing 89 (6.5%)
pregnancy_A
0 1053 (77.0%)
1 204 (14.9%)
10 1 (0.1%)
2 64 (4.7%)
3 18 (1.3%)
4 8 (0.6%)
5 2 (0.1%)
6 1 (0.1%)
Missing 16 (1.2%)
pregnancy_P
0 188 (13.8%)
1 293 (21.4%)
10 2 (0.1%)
12 1 (0.1%)
2 530 (38.8%)
3 231 (16.9%)
4 83 (6.1%)
5 25 (1.8%)
6 8 (0.6%)
9 3 (0.2%)
Missing 3 (0.2%)
pregnancy_G
0 168 (12.3%)
1 266 (19.5%)
11 2 (0.1%)
2 440 (32.2%)
3 284 (20.8%)
4 112 (8.2%)
5 54 (4.0%)
6 17 (1.2%)
7 10 (0.7%)
8 1 (0.1%)
9 4 (0.3%)
Missing 9 (0.7%)
Age.FFTP
16 2 (0.1%)
17 7 (0.5%)
18 20 (1.5%)
19 32 (2.3%)
20 51 (3.7%)
21 61 (4.5%)
22 79 (5.8%)
23 76 (5.6%)
24 120 (8.8%)
25 121 (8.9%)
26 85 (6.2%)
27 104 (7.6%)
28 65 (4.8%)
29 61 (4.5%)
30 51 (3.7%)
31 160 (11.7%)
32 6 (0.4%)
33 6 (0.4%)
34 6 (0.4%)
35 2 (0.1%)
36 6 (0.4%)
37 2 (0.1%)
38 2 (0.1%)
45 1 (0.1%)
nulliparous 188 (13.8%)
Missing 53 (3.9%)
Interval.1st.FTP
Mean (SD) 35.5 (13.0)
Median [Min, Max] 35.0 [5.00, 69.0]
Missing 241 (17.6%)
hormone_replacement
active 205 (15.0%)
former 170 (12.4%)
no 936 (68.5%)
Missing 56 (4.1%)
familial_history_breast_ovary
no 825 (60.4%)
yes 533 (39.0%)
Missing 9 (0.7%)
visible_on_mammogram
no 122 (8.9%)
yes 1220 (89.2%)
Missing 25 (1.8%)
TNM_cT_at_diagnosis
T1a 13 (1.0%)
T1b 135 (9.9%)
T1c 352 (25.7%)
T1mi 1 (0.1%)
T2 587 (42.9%)
T3 188 (13.8%)
T4a 3 (0.2%)
T4b 18 (1.3%)
T4c 2 (0.1%)
T4d 32 (2.3%)
Tis 9 (0.7%)
Missing 27 (2.0%)
TNM_cN_at_diagnosis
n0 1 (0.1%)
N0 1117 (81.7%)
N1 195 (14.3%)
N2 11 (0.8%)
N3 27 (2.0%)
Missing 16 (1.2%)
TNM_cM_at_diagnosis
M0 1367 (100%)
neo_adjuvant_therapy
no 1256 (91.9%)
yes 111 (8.1%)
surgery_type_breast
Mastectomy 813 (59.5%)
Tumorectomy 542 (39.6%)
Missing 12 (0.9%)
surgery_type_axilla
ALN 523 (38.3%)
SLN 679 (49.7%)
SLN + ALN 148 (10.8%)
Missing 17 (1.2%)
TNM_pT_resection_specimen
T0 6 (0.4%)
T1a 23 (1.7%)
T1b 98 (7.2%)
T1c 327 (23.9%)
T1mi 1 (0.1%)
T2 571 (41.8%)
T3 318 (23.3%)
T4b 7 (0.5%)
Tis 4 (0.3%)
Missing 12 (0.9%)
TNM_pN_resection_specimen
N0(i-) 727 (53.2%)
N0(i+) 80 (5.9%)
N1a 271 (19.8%)
N1mi 77 (5.6%)
N2a 88 (6.4%)
N3a 102 (7.5%)
N3b 1 (0.1%)
pN1a 1 (0.1%)
Missing 20 (1.5%)
diameter_pathology_resection_specimen
Mean (SD) 38.5 (29.7)
Median [Min, Max] 30.0 [0, 220]
Missing 15 (1.1%)
tumor_grade_resection_specimen
1 16 (1.2%)
2 1212 (88.7%)
3 134 (9.8%)
Missing 5 (0.4%)
resection_margin_resection_specimen
dubious (< 1 mm) 145 (10.6%)
negative 1133 (82.9%)
positive 74 (5.4%)
Missing 15 (1.1%)
ER_Interpretation
negative 25 (1.8%)
positive 1183 (86.5%)
Positive 157 (11.5%)
Missing 2 (0.1%)
PR_Interpretation
negative 147 (10.8%)
Negative 22 (1.6%)
positive 1025 (75.0%)
Positive 130 (9.5%)
Missing 43 (3.1%)
HER2_Interpretation
negative 1150 (84.1%)
Negative 145 (10.6%)
positive 51 (3.7%)
Positive 8 (0.6%)
Missing 13 (1.0%)
presence_DCIS_resection_specimen
no 1199 (87.7%)
yes 154 (11.3%)
Missing 14 (1.0%)
presence_LCIS_resection_specimen
no 227 (16.6%)
yes, classical LCIS 771 (56.4%)
yes, non classical LCIS 355 (26.0%)
Missing 14 (1.0%)
positive_ALN
Mean (SD) 2.09 (5.01)
Median [Min, Max] 0 [0, 42.0]
Missing 19 (1.4%)
radiotherapy
no 238 (17.4%)
yes 1117 (81.7%)
Missing 12 (0.9%)
adjuvant_chemotherapy
no 1005 (73.5%)
yes 350 (25.6%)
Missing 12 (0.9%)
adjuvant_HER2
no 1316 (96.3%)
yes 39 (2.9%)
Missing 12 (0.9%)
adjuvant_endocrinetherapy
no 46 (3.4%)
yes 1309 (95.8%)
Missing 12 (0.9%)

2.1.4 Number of events

I am excluding for the moment those patients that have a date of surgery equal to the date of diagnosis and all the other patients whose dates were unsure.

Figure 2.3 describes the event history of the patients. You just need to pick a starting state from the ‘from’ axis and select then transitioning state from the ‘to’ axis. In the corresponding cell, you find the absolute frequency of each transition.

Figure 2.3

Check the patients with a date of lost to follow-up before date of death

Look at the distribution of the variables through the years