Spinal Cord Injury AIS Predictions Using Machine Learning

Visual Abstract


Introduction
Spinal cord injury (SCI) profoundly changes a patient's life. Effects range from impaired motor function, up to and including paralysis of the limbs, as well as mental health effects such as depression or suicide. Patient outcomes are highly sensitive to where and to what extent the injury is on the spinal cord. In general, an injury closer to the brainstem has a greater impact. Among the impaired motor functions are the following: paralysis, loss of sensation, increased chance of developing pressure ulcers, bladder dysfunction, neurogenic bowel, muscle atrophy, autonomic dysreflexia, and impaired sexual function (Sezer et al., 2015).
The American Spinal Injury Association Impairment Scale (AIS) classifies the motor-sensory abilities of a patient with an SCI (Ho et al., 2007). There are five lettergrade categories, as follows: AIS grade A is a complete injury with no retention of motor control or sensory function below the point of injury; and AIS grade E is an injury with minimal impact on the patient. Clinicians use the AIS to classify SCI and quantify SCI recovery, for example, an improvement from grade B to C or a deterioration from grade D to C.
One key question among SCI patients is how the severity of their injury, as measured by the AIS, will improve or deteriorate during the course of SCI recovery. This varies by the type of injury and demographic differences between patients.
In a traditional clinical setting, the main factors identified for prognostication of SCI recovery include patient age, patient gender, length of inpatient stay, type of inpatient discharge, type of SCI, time to procedure, procedure type, and comorbidities (Chay and Kirshblum, 2020). SCI prognosis is primarily conducted by either standard of care diagnostics, including heuristic bedside evaluation and magnetic resonance imaging (MRI), or traditional clinical analysis, such as an odds ratio statistic (Burns et al., 2012). Thus, there is an opportunity to support SCI recovery by adding a machine learning-based framework of SCI prognosis using big data and precision medicine as one of the clinician's tools for improving SCI patient outlook.
Researchers have developed many tools on SCI treatment; however, after literature review, it was seen that there is a wide gap in the use of machine learning algorithms to predict SCI recovery in a contemporary precision medicine context, especially with regard to feature importance and using a very large dataset (Snoek et al., 2004;Munce et al., 2014). One study attempted to predict discharge location using an ensemble model and used area under the curve as an outcome (Fan et al., 2021), another study made use of convolutional neural nets (CNNs) on MRI charts to achieve an accuracy of 71.4% (Okimatsu et al., 2022), while two other studies greatly limited the complexity to specific AIS scores of A (Buri et al., 2022) and D/E (Inoue et al., 2020). The authors in the study by Chou et al. (2022) conducted a study similar to the one presented here, but their sample size consisted of 74 patients.
The research carried forward was based on the National Spinal Cord Injury Statistical Center (NSCISC) database, which includes details from patients across the United States (Chen et al., 2016). Several different machine learning models were used to predict AIS level on patient discharge for data recorded between 1972 and 2016, and the best model was further examined to extract feature importance information. The ground truth AIS scores at discharge were supplied as part of the dataset.
The analysis of feature importance serves two purposes. One is to verify the importance of AIS classification at the time of hospital admission as a critical feature, using a data-driven approach. The other purpose is to identify demographic features that also play a crucial role in determining recovery.

Materials and Methods
Computational implementation was conducted using Python version 3.8, shap version 0.40.0, and scikit-learn version 1.0.1.

Data preparation
The NSCISC database comprises .29,000 traumatic SCIs since 1973 for patients treated at any regional model SCI system within the first year of injury and who have signed a consent form for inclusion (DeVivo et al., 2002). The patient details within the database have also been stripped of all identifiers defined by the HIPAA (Health Insurance Portability and Accountability Act of 1996). The NSCISC dataset was loaded from the published CSV format, and it had 417 raw features. A custom data mapper was used to translate the raw data headers and values into more recognizable features. For example, the label AWghtRhb was translated to "Weight at Admission." Any nonrecorded values in the dataset were assigned a value of "Unknown." There were far more features in the dataset than are relevant to the machine learning models designed, so the data used for training was limited to patient information that is known at or before hospital admission. Exploratory data analysis showed that some features had .90% missingness, and these were excluded from consideration as model inputs. The other reasons for excluding certain variables within the dataset was whether they were specific to a certain area of the body or spine such as the sensory level of the left side during hospital admission. Including all of these would have greatly increased the sparsity of the eventual feature vector, leading to risks involved with the curse of dimensionality. Univariate analysis was also performed on possible features to look at maximum/minimum values, counts, and outliers. Table 1 shows the final features chosen as well as their mapped versions for input into machine learning model construction, along with imputations performed for missing values.
Generally, imputation followed the format of using the mode as the chosen mapped value or an "Unknown" label was assigned instead if it had already been found in the set of values of a feature. The choice to use mode over creating a new Unknown label when not already available was decided because of two primary reasons. The first reason was to avoid creating a value that very few rows have, which could have the side effect of incorrectly flagging these patients as disproportionately important during training time. The second was to avoid inflating feature dimensionality for features such as education level at injury because the introduction of an Unknown value would require a transition from an ordinal feature to a one-hot encoded feature. Sex and age at injury were the only features that were fully populated, whereas AIS score and level of injury at hospital admission were the only features where a missing value resulted in a dropped row.
From analysis, the typical profile of an SCI patient was found to be a male, between 19 and 29 years of age, white, and never married, although the dataset showed plenty of variation from this modal profile.
Of the dataset features, three variables were determined as the most suitable for measuring SCI recovery through the patient's course of treatment, as follows: AIS score, neurologic disposition at hospital discharge, and patient's level of injury. These three features were suitable because they could capture patient physical improvements throughout the entire body. The AIS score at discharge was ultimately chosen because of its widespread use in the literature (Roberts et al., 2017;Inoue et al., 2020;Buri et al., 2022;Chou et al., 2022;Okimatsu et al., 2022). Using this target variable meant that a fiveclass classification modeling approach was to be designed, where each class is one of A, B, C, D, or E. With this definition, there was a possibility that some patient predictions could include worsening of the AIS score as well. In the dataset, only 329 of these cases were found in total, 275 in the eventual training set and 54 in the eventual testing set, for a combined ;1.6% of all samples. As a result, the vast majority of predictions was focused on recovery.
The finalized dataset after analysis that was used for model training is described in Table 2. Four hundred seventeen raw NSCISC features were reduced to 18, and later mapped to 53 model-ready features. The train/test split of ;90:10 was reached after trialing different ratios and evaluating test accuracy. Modeling and feature importance There were eight different machine learning models tuned after the dataset was prepared, and test set prediction accuracy was the decider in determining the best model. The process of data preparation through to model selection is outlined in Figure 1.
To extract feature importance from the top model, Shapely values were used (Parsa et al., 2020). These provide a quantitative measurement of how strong a certain feature is in predicting a specific class output. The magnitude of the Shapely value was used so feature strength for or against a specific class is captured. These values were computed per sample and taking the mean over all samples can provide an average importance. The strength of each of the 53 features with respect to a given class is thus given as follows: where f, c, and N correspond to feature, class, and number of samples, respectively. Feature strength can be summed across all classes to give an overall importance metric with respect to the model as a whole, as seen in the following: where m and C are the model and the set of all classes, respectively.

Results
The results after training the data on all tuned models are described in Table 3. With a multiclass test accuracy of 73.6%, it was found from the machine learning model metrics that the best performing model was Ridge Classifier over the NSCISC dataset for SCI recovery prognostication. SVM, Elastic Net, and Logistic Regression closely followed with 73.5%, 73.2%, and 73.2%, respectively. These results are very promising, given that they are only for information discovered or provided on an initial assessment. Taking the Ridge Classifier and applying it to the dataset once again, but with removing the 329 patients who had lower AIS scores at hospital discharge than at admission gave a higher multiclass test accuracy of 75.3%.
To determine what inputs are most crucial to predicting recovery, Shapely values were applied, as per Equation 1 and Equation 2, over the Ridge Classifier model. The top 20 most important features are visualized in Figure 2 and are recorded in Table 4. It has been previously shown that the best indicator of AIS score at hospital discharge is generally the AIS score at hospital admission and neurologic category at admission (Chay and Kirshblum, 2020), and the results confirm this.
In terms of demographic details playing a role, these were less impactful. From previous research, it was expected that age would be among the most important   (Seel et al., 2001;Wilson et al., 2012Wilson et al., , 2014. While it was, there were a number of other features, such as sex and race, which were also around the same level of absolute mean importance. Interestingly, age was of a disproportionately higher importance for AIS grade C predictions. And, unexpectedly, marital status showed to be even more important than age.

Discussion
The results showed promising results in predicting AIS improvement. A 73.6% test accuracy can be considered a benchmark for improvement. Given the high number of data samples, there is also the option of more complex models to trial, such as deep neural networks. The use of more engineered features can also be used as a tool to add more insight while reducing dimensionality. For example, height and weight features can instead be replaced with a body mass index measure. Finally, while the SCI recovery predictions is the result of one machine learning model in the pipeline, additional improvements in model performance could be gained by the creation of a model of models in which multiple submodels optimize for predicting features of importance to SCI recovery, which then feed into an overall model for prognosticating the patient.
One attempt made to increase performance was by dropping the 1.6% of patients whose conditions deteriorated in terms of AIS score. The boost ended up being 1.7% to test the accuracy to put it at 75.3%. This was likely because cases where patient scores deteriorated were difficult for the model to appropriately fit. There is the opportunity to use this as a secondary model if patients have shown signs of AIS recovery before hospital discharge. Otherwise, the original model is more appropriate since it makes no assumptions about progression. Nonetheless, the amount of increase in test accuracy exceeded expectations, considering that this difference between Ridge Classifier and the next three best performing models of this metric in the original dataset, was, at most, 0.4%.
Feature strength gave a better understanding of which areas most affected recovery, both when it came to validating the importance of the hospital admission AIS score and with regard to understanding the role that demographics play. The surprising result of marital status exceeding the importance of age was an important outcome of the study. The presence of a support system for a patient may be a critical component of recovery success, though this would need to be examined more closely in

Rank
Machine learning feature 1 AIS-admission_A 2 AIS-admission_D 3 AIS-admission_B 4 Neurologic category-admission_complete paraplegic 5 Neurologic category-admission_incomplete tetraplegic 6 AIS-admission_C 7 Neurologic category-admission_complete tetraplegic 8 Neurologic category-admission_incomplete paraplegic 9 Level of Injury-admission 10 Marital status-injury_never married 11 Marital status-injury_married 12 Occupation status-injury_working 13 Primary insurance_unknown 14 Occupation status-injury_student or infant 15 Age-injury 16 Marital status-injury_divorced  17  Race_white  18  Sex  19 Occupation status-injury_retired 20 Education-injury future work. However, an important note for the Shapely analysis is to remember that the values computed from real data on SCI recovery show that there may be socioeconomic factors at play that act as social determinants of health among disadvantaged and underserved groups. As a result, whether these demographic details are highlighted as strong or weak features may actually be partially or completely because of societal complexities.
Comparing the results with those found in the study by Okimatsu et al. (2022), the test accuracy here of 73.6% is an improvement over the MRI CNN accuracy of 71.4%. Furthermore, the Ridge Classifier is much easier to interpret and is a more time-efficient model to train. The inclusion of imagery from MRI as a set of features is a possible route of future research that could further bolster the Ridge Classifier as well. In contrast to the research performed in studies by Inoue et al. (2020), Fan et al. (2021), Buri et al. (2022), Chou et al. (2022), and Okimatsu et al. (2022) as a whole, the study conducted here uses a patient base one to two times larger, while including a comprehensive review of feature importance as well. To add on, the results are, overall, comparable or better while considering all AIS classes and using a very lightweight model that can be much more easily deployed.
To extend the machine learning research into SCI recovery outlined in this article, the codebase found at https://github. com/kapoor1992/spinal_cord_injury_recovery and also available as the Extended Data 1 can be augmented by including more models or altering input features before metrics are recomputed.
There are a few limitations with the dataset that can be a point of future research. First, the features did not look at information past hospital admission time. The inclusion of new variables during an inpatient stay may be important to evaluate how much these initial features vary in importance. For example, the amount of weekly physical therapy, the amount of counseling that patients receive, and others could drastically alter results. The scope of the current article only captured a snapshot of prognosis at initial intake and evaluation of the newly injured patient. Also, in terms of model performance and tuning as more data are extracted from the NSCISC dataset, there is a growing opportunity to measure accuracy or other metrics in a longitudinal study. This can help to identify the limitation that the underlying NSCISC data are a static data source, which does not have the capability for either batch or streaming updates to the data. Therefore, the quality of the machine learning model predictions may decay over time as the SCI patient population experiences underlying demographic shifts.