Machine learning-selected variables associated with CD4 T cell recovery under antiretroviral therapy in very advanced HIV infection

A considerable portion of the HIV pandemic is composed of people under antiretroviral therapy, many of whom get a late diagnosis. Patients starting antiretroviral therapy (ART) at a very advanced stage of HIV disease attain a low recovery of CD4 T cells. Factors associated with poor recovery are incompletely described. This study aimed at finding variables associated with CD4 T cell recovery in late-presenting HIV patients. We studied a cohort of HIV+ patients initiating ART with very low basal CD4 T cell counts. We defined immune recovery as the net increase in circulating CD4 T cell counts after one year on ART. We analyzed diverse routine laboratory determinations at different times using Least Absolute Shrinkage and Selection Operator (LASSO), adaptive LASSO and Conditional Inference Random Forest. CD4/CD8 ratio, % CD4 T cells and CD8 T cell counts at different times were the main recovery correlates, validated by all approaches. Unexpectedly, basal hematocrit was a consistent predictor. Additionally, week 24 creatinine had a high lasso coefficient, and alkaline phosphatase had a high conditional inference random forest coefficients, although neither was verified by other tests. CD4 T cell proportions are associated with CD4 T cell recovery, independently of cell counts. Inflammation-related variables could also affect reconstitution. These accessible variables may reflect underlying mechanisms and could improve the follow up of patients starting ART with an advanced HIV infection.


Introduction
Currently, about 21 million HIV + patients are accessing antiretroviral therapy (ART) [1] and face important health challenges. Even if they reach an undetectable blood viral load (VL) with ART, some of them attain a poor recovery of CD4 T cells in blood [2,3]. These patients remain mostly free of opportunistic infections, but they have an increased incidence of non-AIDS defining illnesses [4]. CD4 T cell count at ART initiation is today the most usual predictor of the magnitude of recovery of CD4 T cells under ART [5,6]. It is thus likely that a person with HIV initiating ARV at a late phase of infection will recover CD4 T cells deficiently. This is aggravated by the high rate of late diagnosis in different countries and regions [7][8][9][10].
With the aim of finding variables, other than basal CD4 T cells count, associated with immune reconstitution, we analyzed data from a cohort of 67 patients that initiated antiretroviral therapy with very low CD4 T cell counts, using the machine learning models Least Absolute Shrinkage and Selection Operator (LASSO), adaptive LASSO and Conditional Inference Random Forest. The latter, less biased by correlations among independent variables.

Patients and methods
We utilized data from a previous study cohort [11,12] of patients starting antiretroviral therapy (ART) and were followed up one year. The study was approved by our Institutional Research and Ethics Board (code C04-08), and all patients signed informed consent complying with the Helsinki Declaration.
CD4 T cell and CD8 Tcell counts (TriTest kit, FACS-Calibur system, Becton Dickinson, San Jose, CA, USA), and plasma VL (Cobas Amplicor-PCR-Monitor HIV-1 system, Roche Diagnostics, Indianapolis, IN, USA) were measured at the Center for Research in Infectious Diseases of our Institute. We included blood biometry, blood chemistry, and liver function tests, performed at the Clinical Laboratory of our Institute [11].
We used Least Absolute Shrinkage and Selection Operator (LASSO) as implemented in the R package "glmnet" with default parameters [13,14] to find the main linear independent predictors of CD4 T cell recovery, which was defined as the net increment of CD4 T cells/mm 3 blood (delta CD4 T cells) in the first year of treatment (week 52 CD4 T cell counts minus basal counts). This measure was chosen because a great portion of patients did not achieve counts above 200 or 350 CD4 T cells/mm 3 blood) at week 52, which are thresholds used to define satisfactory reconstitution.
LASSO achieves model fitting and feature selection simultaneously [15]. It was implemented using a Leave One Out Cross validation (LOOC), based on training the machine learning algorithm with N-1 samples (N = total of samples), and evaluating the adjustment with a sample not yet known by the model. The implementation of "glmnet" standardizes data by default and rescales the coefficients to be returned. Scaling is required only by this model. LASSO analysis allows no missing values, so raw data from 99 patients (Supplementary File 1_Raw Data) were processed in order to obtain a data set with the maximal number of linear variables having no missing values. Data processing yielded 67 patients with 53 linear variables used as independent variables in the model (Supplementary File 2_Clean Data, and Supplementary File 3_Variable identification). All chosen patients had reached undetectable VL at week 52, and had available CD4 T cell counts at weeks 0 (basal), 8,12,24,39, and 52 after treatment initiation. We included two patients with detectable low viral loads at week 52, because they corresponded to viral blips, according to accepted definitions [16]. Both had reached undetectable viral loads on week 39, just before week 52, and had subsequent undetectable viral loads after 104 weeks on treatment.
The 53 variables comprised mainly standard routine laboratory tests, in addition to HIV treatment follow-up variables determined at fixed time points (weeks 0, 8, 24, 39, and 52 weeks). Basal age and body-mass index were also included. We excluded week 52 and basal CD4 T cell count because delta CD4 T cell (our dependent variable) is arithmetically obtained with them. Therefore, our study yields predictors additional to nadir CD4 T cells.
A linear regression was done using the 20 variables with the greatest LASSO coefficients (top 20 contributing to the model), as in previous studies [17,18], with the purpose of assessing statistical significance, thus providing a simple output that is easy to interpret. The linear model used all samples at once (N = 67), while LASSO performed N iterations using N-1 different  variable sets (N-1 = 66). This provided additional validation to LASSO results. To evaluate possible over-fitting, the linear regression used leave-one-out cross validation (given the small simple size). As with LASSO, we performed N linear regressions, training with n-1 samples, plotting the predictions for the sample that was unknown to the model. LASSO tends to group correlated variables, and randomly selects one of them and shrinks the others to zero, which may lead to over-fitting. To address this limitation, we generated a model using partial correlations with adaptive LASSO using "adalasso.net", which mitigates overfitting. We also generated a Conditional Inference Random Forest with "cforest" function of the R Package "Party", which is suitable for models involving multicollinearity. It can measure variable importance in combination (with "varimp" function), and it avoids overfitting. While LASSO assumes a linear relation, conditional inference random forest can model non-linear relationships between variables. Data analyses were performed using R 3.6.1 [14]. The scripts can be found at https://github.com/caramirezal/vihCohort/blob/master/ scripts/article_figures.R Additionally, we performed univariate regressions of week 24 CD4/CD8 ratio with percentages of subpopulations of CD4 and CD8 T cells. This analysis was done using SPSS software.

Results
Basal features of patients are displayed in Table 1.
All patients reached undetectable VL at week 52 after initiating treatment (Fig. 1 A). Sixty-four patients were male and 3 female. Patients' basal CD4 T cell counts ranged from 3 to 275 CD4 T cells/mm 3 blood (Table 1). Only 17 patients had basal counts above 100, and only 4 above 200 (Fig. 1 B).
Twelve of the major LASSO-selected variables were also selected by adaptive lasso. They comprised week 52 and 39 CD4/CD8 ratio, week 24 creatinine, week 8 CD4/CD8 ratio, basal %CD4 T cells, week 52% CD4 T cells, week 52% CD8 T cells, basal hematocrit, week 24 CD/CD8 ratio, basal leucocyte count, week 52 viral load and week 52 CD8 T cell counts. The coefficients of all variables are listed in Supplementary File 6.
Conditional inference random forest analysis yielded 26 variables with positive importance coefficients (Fig. 4). Among them, eight corresponded to major variables selected by both LASSO and adaptive LASSO: CD4/CD8 ratio at weeks 24, 39 and 52, % CD4 T cells at weeks 0 and 52, week 52 CD8 T cell count and percentage, as well as basal hematocrit. These eight variables are thus independently validated. Conditional inference random forest also assigned a positive coefficient to the occurrence of immune reconstitution inflammatory syndrome (IRIS) (importance coefficient = 81.06), which had been  CD4/CD8 ratio values remained below 1 (known as an inverted coefficient [19]), even though they increased consistently during the whole follow up (Fig. 5).
In univariate correlations, week 24 CD4/CD8 ratio correlated negatively with the basal (week 0) percentage of CD28 − cells among effector memory CD4 T cells, basal blood CD8 T cell count, week 24 percentage of activated naive CD8 T cells, and week 24 percentage of CD28 − naive CD4 T cells (Table 2). Week 24 CD4/CD8 ratio positively correlated with week 24 percentage of memory CD8 T cells (Table 2).

Discussion
In many studies reconstitution is indicated by an end point count of CD4 T cells after initiation of antiretroviral treatment (ART) [20][21][22]. Other studies classify patients as good or poor responders according to a fixed arbitrary increase in CD4 T cell counts [23]. It is well described that initial counts predict end point ones [5,6]. In contrast to end-point counts, our work studies the net increment in CD4 T cell counts (delta CD4 T cells), which also determines reconstitution. In the current late-presenting cohort, we found that initial CD4 T cell counts did not predict the net gain of CD4 T cells. By using a set of LASSO-selected variables, we built a model with a significant fit to this reconstitution variable. These variables are important because they may reflect mechanisms underlying CD4 T cell recovery. Among them, those at ART initiation or early thereafter may be further evaluated as potential predictors of CD4 T cell gain.
Our findings extend the broad context of CD4/CD8 ratio studies. In very elderly people (> 80 years) without HIV, an inverted CD4/CD8 ratio (lower than 1) was associated with increased mortality [24], increased activation of CD8 T cells, a shrunken T cell repertoire [25], low CD4 T cell percentages, decreased T cell response to mitogens, and a greater frequency of cytomegalovirus and Epstein Barr infections [24,26]. In the setting of untreated HIV infection, CD4/CD8 inversion was a predictor of untreated HIV infection progression [19]. In HIV + patients starting ART, CD4/CD8 was a predictor of AIDS clinical events and death [27]. It would be expectable to find a higher overall morbidity in the present cohort, which started antiretroviral treatment with a great loss of CD4 T cells [28].
CD4/CD8 ratio is associated with non-AIDS events in people with controlled viremia under ART [29,30]. In these patients, CD4/CD8 is also correlated with activation and senescence of T cells [29,31]. In our cohort, week 24 CD4/CD8 ratio (the earliest predictive variable validated by the three methods) correlated negatively with the  Week 24%CD38 + HLADR + of naive CD8 T cells -,397 0.04963967 25 Week 24%memory of total CD8 T cells ,408 0.04281341 25 Week 24%CD28 − of naive CD4 T cells -,428 0.04679031 22 Basal naive CD8 T cells/mm 3 blood -,487 0.0183949 contemporaneous percent of activated naive CD8 T cells and the frequency of differentiated (CD28 − ) naive CD4 T cells. It also correlated negatively with the basal percentages of CD28 − (senescent) effector memory CD4 T cells. This positive correlation with frequencies of less differentiated subpopulations, and a negative correlation with T cell activation, concur with previous findings [29,32]. Our findings extend the predictive capacity of CD4/ CD8 ratio to net CD4 T cell recovery under ART. There are two previous studies addressing this variable. A study of 30 patients with relatively preserved CD4 T cell counts found a correlation of CD4/CD8 ratio with CD4 T cell increment two years after ART initiation [33]. Another study did not find any association six months after ART initiation [34]. In our study of patients starting ART with profound CD4 T cell loss, CD4/CD8 ratio at weeks 24, 39 and 52 were associated with absolute CD4 T cell gain, according to three machine learning methods.
It is noteworthy that in our advanced-stage cohort, proportions of CD4 T cells were major correlates of CD4 T cell gain. Percent CD8 T cells and % CD4 T cells, as well as basal %CD4 T cells were consistently associated with CD4 T cell recovery by all models. Moreover, week 52 CD8 T cell counts, which determine CD4/CD8 ratio, were among these variables. In this regard, CD8 T cells are expanded in HIV disease [35][36][37]. These findings suggest a possible independence of CD4 T cell proportions as correlates of immune recovery. In line with this possible independence, CD4/CD8 ratio correlated with CD4 T cell increment after two years on ART, even though basal CD4 T cell counts did not correlate with CD4 T cell increment [33]. Interestingly, the patients in this study had basal CD4 T cell counts significantly higher than our cohort (median 380, IQR 310-500, Mann Whitney U = 81, p < 0.001), suggesting a broad predictive capacity of CD4/CD8 ratio. Our findings extend the disadvantage of a low CD4/CD8 ratio, which is already known to predict the incidence of non-AIDS-defining illnesses, even when CD4 T cells counts are normal [38], and even when the analysis is adjusted for CD4 T cell counts [29].
The occurrence of immune reconstitution inflammatory syndrome (IRIS) was selected by adaptive LASSO and by conditional inference random forest (81.1). Also indicating inflammation, week 24 alkaline phosphatase was selected by conditional inference random forest, but was not selected by LASSO models. In univariate regression, it correlated negatively with delta CD4 T cells (Spearman's Rho = − 0.25, p = 0.04). A deleterious effect of inflammation on immune reconstitution under ART has been described [39,40], and could be related with CD4/CD8 ratio [31].
A limitation of our study is the lack of additional measures of model performance, like OOB error for Conditional Inference Random Forest. Nevertheless, the consistency between different algorithms supports the importance of our variables.

Conclusion
Among patients initiating ART in a very advanced stage of HIV infection, proportions of CD4 T cells are associated with the immunological response to antiretroviral therapy, possibly reflecting reconstitution determinants other than initial CD4 T cell counts. Together with IRIS and alkaline phosphatase, they comprise a set of accessible variables that could be used to predict the response to antiretroviral therapy.