SURC 2025 Student Presentations
SUNY Undergraduate Research Conference Student Presentations

Data cleaning and preprocessing from EHR during COVID-19 using R

Authors: Patrick Zhang, Haimonti Dutta

SUNY Campus: SUNY Buffalo

Presentation Type: Art Exhibition

Location: Old Union Hall

Presentation #: 21

Timeslot: Session D 3:00-4:00 PM

Abstract: Electronic Health Records (EHRs) have been widely used in the face of public health crises, especially during COVID-19. Accurate data records are essential to advance data-driven medical decision-making. This study focuses on the preprocessing of EHR data collected in Canadian hospitals during COVID-19, especially the challenges of unstructured records of other comorbidities, the variability of smoking history, and the incorporation of dosage and frequency of medication. This study uses R language for preprocessing. Since the data records of other comorbidities were consisted of free text, we adopted an extensive data cleaning -- using custom-defined medical dictionary along with existing English dictionaries to remove all non-year numbers and non-standard terms. We were able to extract the year of occurrence of medical events and this served as an important marker in the patient's medical history. The smoking history was systematically encoded into categorical and numerical formats to quantify the time of smoking cessation (until 2025). All of the above pre-processing steps generated multiple data reports which were eventually merged into one large one-hot encoded matrix. Finally, this processed data will be used for regression analysis. By converting complex text clinical records into analyzable data sets, it can provide a reference for improving clinical resource management and formulating strategies.