Title: | Creates a Harmonized Dataset Based on a Set of Instructions |
---|---|
Description: | Functions which facilitate harmonization of data from multiple different datasets. Data harmonization involves taking data sources with differing values, creating coding instructions to create a harmonized set of values, then making those data modifications. 'psHarmonize' will assist with data modification once the harmonization instructions are written. Coding instructions are written by the user to create a "harmonization sheet". This sheet catalogs variable names, domains (e.g. clinical, behavioral, outcomes), provides R code instructions for mapping or conversion of data, specifies the variable name in the harmonized data set, and tracks notes. The package will then harmonize the source datasets according to the harmonization sheet to create a harmonized dataset. Once harmonization is finished, the package also has functions that will create descriptive statistics using 'RMarkdown'. Data Harmonization guidelines have been described by Fortier I, Raina P, Van den Heuvel ER, et al. (2017) <doi:10.1093/ije/dyw075>. Additional details of our R package have been described by Stephen JJ, Carolan P, Krefman AE, et al. (2024) <doi:10.1016/j.patter.2024.101003>. |
Authors: | John Stephen [aut, cre] |
Maintainer: | John Stephen <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.5 |
Built: | 2025-02-15 05:10:52 UTC |
Source: | https://github.com/nudacc/psharmonize |
Code modify function. To be called by the harmonization function.
code_modify_func( data = temp_dataset, instruction = code_instruct, old_var = source_item_long )
code_modify_func( data = temp_dataset, instruction = code_instruct, old_var = source_item_long )
data |
Dataframe to be modified |
instruction |
Coding instruction from harmonization sheet |
old_var |
Name of original variable |
Vector of length equal to old_var
# Allows the user to define a function with `instruction` # The `old_var` in `data` will be used in place of `x` in `instruction`. code_modify_func(data = cohort_a, instruction = 'x + 5', old_var = 'age')
# Allows the user to define a function with `instruction` # The `old_var` in `data` will be used in place of `x` in `instruction`. code_modify_func(data = cohort_a, instruction = 'x + 5', old_var = 'age')
Code modify function (multiple variables). To be called by the harmonization function.
code_modify_func_multi( data = temp_dataset, instruction = code_instruct, old_var = source_item_long, user_args = source_item, sourcedataset = source_dataset, subdomain = subdomain, visit = visit, cohort = cohort )
code_modify_func_multi( data = temp_dataset, instruction = code_instruct, old_var = source_item_long, user_args = source_item, sourcedataset = source_dataset, subdomain = subdomain, visit = visit, cohort = cohort )
data |
Dataframe to be modified |
instruction |
Coding instruction from harmonization sheet |
old_var |
Name of original variable |
user_args |
Character vector of input variables |
sourcedataset |
Dataframe created so far |
subdomain |
Category of variable |
visit |
Visit number |
cohort |
Cohort name |
Vector of length equal to old_var
# Example calculating BMI in cohort a for visit 1 code_modify_func_multi(data = cohort_a, instruction = '(x1 / 2.205)/((x2 / 39.37)**2)', old_var = 'weight_1; height_1', user_args = c('weight_1','height_1'), sourcedataset = 'cohort_a', subdomain = 'clinical', visit = 1, cohort = 'cohort_a')
# Example calculating BMI in cohort a for visit 1 code_modify_func_multi(data = cohort_a, instruction = '(x1 / 2.205)/((x2 / 39.37)**2)', old_var = 'weight_1; height_1', user_args = c('weight_1','height_1'), sourcedataset = 'cohort_a', subdomain = 'clinical', visit = 1, cohort = 'cohort_a')
Code modify recode. To be called by harmonization function.
code_modify_recode( data = temp_dataset, instruction = code_instruct, old_var = source_item_long, new_var = item, na_string = NULL )
code_modify_recode( data = temp_dataset, instruction = code_instruct, old_var = source_item_long, new_var = item, na_string = NULL )
data |
Dataframe to be modified |
instruction |
Coding instruction from harmonization sheet |
old_var |
Name of original variable |
new_var |
Name of new variable |
na_string |
Character string of final recode value to be set to NA. |
Returns vector of new variable after recoding as needed.
test_data <- data.frame(val = c('a','b','c','d')) code_modify_recode(data = test_data, instruction = 'a = apple; c = carrot', old_var = 'val', new_var = 'new')
test_data <- data.frame(val = c('a','b','c','d')) code_modify_recode(data = test_data, instruction = 'a = apple; c = carrot', old_var = 'val', new_var = 'new')
Example dataset of cohort data
cohort_a
cohort_a
A data frame with 10,000 rows and 9 columns
Simulated data
Example dataset of cohort data
cohort_b
cohort_b
A data frame with 5,000 rows and 5 columns
Simulated data
Example dataset of cohort data
cohort_c
cohort_c
A data frame with 7,000 rows and 5 columns
Simulated data
Created dataframe "shell" of IDs, study/cohort name, and visit. Harmonized variables will be joined onto this dataset.
cohort_shell_func(sheet)
cohort_shell_func(sheet)
sheet |
Harmonization sheet |
Data.frame with IDs, study/cohort name, and visit.
# Using example harmonization sheet cohort_shell_func(harmonization_sheet_example)
# Using example harmonization sheet cohort_shell_func(harmonization_sheet_example)
Continuous or categorical
cont_or_cat(data, var)
cont_or_cat(data, var)
data |
Data frame |
var |
Variable |
Returns "continuous" or "categorical"
# Function can help determine which kind of output is # most appropriate cont_or_cat(data = cohort_a, var = 'height_1') cont_or_cat(data = cohort_a, var = 'education')
# Function can help determine which kind of output is # most appropriate cont_or_cat(data = cohort_a, var = 'height_1') cont_or_cat(data = cohort_a, var = 'education')
This function will create an RMarkdown error log. It takes the harmonization object as the input, and will knit an RMarkdown html file to the path specified.
Note: The error log will only be able to detect "processing" errors, and not "content" errors. For example, if the user enters coding instructions that are nonsensical or incorrect, but are still able to be executed, this function will not be able to detect it.
create_error_log_report( harmonization_object, path = "./", file = "error_log_report.html" )
create_error_log_report( harmonization_object, path = "./", file = "error_log_report.html" )
harmonization_object |
Harmonization object |
path |
Path of output R Markdown report |
file |
Filename of output R Markdown report |
Does not return an object, but instead knits html RMarkdown report to specified path and file name.
# Examples not run # Creating harmonized object using harmonization sheet with errors. # harmonized_obj <- harmonization(harmonization_sheet = error_harmonization_sheet_example) # Knitting error log report # create_error_log_report(harmonization_object = harmonized_obj, # path = './', # file = 'example_output.html)
# Examples not run # Creating harmonized object using harmonization sheet with errors. # harmonized_obj <- harmonization(harmonization_sheet = error_harmonization_sheet_example) # Knitting error log report # create_error_log_report(harmonization_object = harmonized_obj, # path = './', # file = 'example_output.html)
This function is usually not called by the user. Instead it is usually
called by harmonization()
function.
create_long_dataset( vars_interest, subdomain, previous_dataset, error_log, na_string, verbose = TRUE )
create_long_dataset( vars_interest, subdomain, previous_dataset, error_log, na_string, verbose = TRUE )
vars_interest |
Variable currently being harmonized |
subdomain |
Category of variable |
previous_dataset |
Dataframe created so far |
error_log |
Error log |
na_string |
Character string of final recode value to be set to NA. |
verbose |
(TRUE/FALSE) Should the function print the current progress to the console? |
The function takes the harmonization sheet, and input dataframe, and creates a dataframe with the harmonized variable.
Returns a list with the harmonized long dataset, and error log.
# Example sheet test_sheet <- harmonization_sheet_example[harmonization_sheet_example$study == 'Cohort A',] # Example dataset test_data <- cohort_a # create error log test_error_log <- test_sheet[,c('item','study','visit','possible_range')] test_error_log$completed_status <- NA_character_ test_error_log$completed_reason <- NA_character_ test_error_log$range_set_to_na <- NA_integer_ test_error_log$range_out_of_range_warning <- NA long_dataset <- create_long_dataset(vars_interest = test_sheet, subdomain = 'age', previous_dataset = test_data, error_log = test_error_log, na_string = 'NA', verbose = TRUE)
# Example sheet test_sheet <- harmonization_sheet_example[harmonization_sheet_example$study == 'Cohort A',] # Example dataset test_data <- cohort_a # create error log test_error_log <- test_sheet[,c('item','study','visit','possible_range')] test_error_log$completed_status <- NA_character_ test_error_log$completed_reason <- NA_character_ test_error_log$range_set_to_na <- NA_integer_ test_error_log$range_out_of_range_warning <- NA long_dataset <- create_long_dataset(vars_interest = test_sheet, subdomain = 'age', previous_dataset = test_data, error_log = test_error_log, na_string = 'NA', verbose = TRUE)
Summary report creation
create_summary_report( harmonization_object, path = "./", file = "summary_report.html", compare = FALSE )
create_summary_report( harmonization_object, path = "./", file = "summary_report.html", compare = FALSE )
harmonization_object |
Harmonization object |
path |
Path of output R Markdown report |
file |
Filename of output R Markdown report |
compare |
Creates summary report with comparison of raw values with modified values |
Does not return an object, but instead knits html RMarkdown report to specified path and file name.
# Examples not run # Creating harmonized object # harmonized_obj <- harmonization(harmonization_sheet = harmonization_sheet_example) # Knitting summary report # create_summary_report(harmonization_object = harmonized_obj, # path = './', # file = 'example_output.html) # Use `compare` option to create comparison summary report. # create_summary_report(harmonization_object = harmonized_obj, # path = './', # file = 'example_output.html, # compare = TRUE)
# Examples not run # Creating harmonized object # harmonized_obj <- harmonization(harmonization_sheet = harmonization_sheet_example) # Knitting summary report # create_summary_report(harmonization_object = harmonized_obj, # path = './', # file = 'example_output.html) # Use `compare` option to create comparison summary report. # create_summary_report(harmonization_object = harmonized_obj, # path = './', # file = 'example_output.html, # compare = TRUE)
Example of harmonization sheet. This harmonization sheet has a few typos present (incorrect name of variable and/or dataset). This can be used to demonstrate how errors are presented in the error log and/or the summary method.
error_harmonization_sheet_example
error_harmonization_sheet_example
A data frame with 16 rows and 12 columns
Created data
This is the main function in the psHarmonize package. Takes a harmonization sheet as input, and returns a harmonization object (list with S3 class of 'psHarmonize'). Requires source data.frames to be in the global environment.
harmonization( harmonization_sheet, long_dataset = TRUE, wide_dataset = TRUE, error_log = TRUE, source_variables = TRUE, na_string = "NA", verbose = TRUE )
harmonization( harmonization_sheet, long_dataset = TRUE, wide_dataset = TRUE, error_log = TRUE, source_variables = TRUE, na_string = "NA", verbose = TRUE )
harmonization_sheet |
Harmonization sheet input. Set of coding instructions |
long_dataset |
(TRUE/FALSE) Should the function return a long dataset? |
wide_dataset |
(TRUE/FALSE) Should the function return a wide dataset? |
error_log |
(TRUE/FALSE) Should the function return an error log? |
source_variables |
(TRUE/FALSE) Should the output datasets contain the original non modified values? |
na_string |
Character string of final recode value to be set to missing.
Default is 'NA'. For example, if you use |
verbose |
(TRUE/FALSE) Should the |
Note: psHarmonize evaluates and runs code entered in the harmonization sheet. Make sure to only use harmonization sheets from authors you trust.
List of return objects with S3 class of 'psHarmonize'. Can be used
as input for report function create_summary_report()
and
create_error_log_report()
.
# Running harmonization function with example harmonization sheet harmonization_obj <- harmonization(harmonization_sheet = harmonization_sheet_example) # Extracting harmonized long dataset (each row is a visit) long_dataset <- harmonization_obj$long_dataset # Extracting harmonized wide dataset (each row is a person) # Visits are expressed in multiple columns wide_dataset <- harmonization_obj$wide_dataset
# Running harmonization function with example harmonization sheet harmonization_obj <- harmonization(harmonization_sheet = harmonization_sheet_example) # Extracting harmonized long dataset (each row is a visit) long_dataset <- harmonization_obj$long_dataset # Extracting harmonized wide dataset (each row is a person) # Visits are expressed in multiple columns wide_dataset <- harmonization_obj$wide_dataset
Example of a harmonization sheet. This serves as the input file for the harmonization function.
harmonization_sheet_example
harmonization_sheet_example
A data frame with 16 rows and 12 columns
Created data
Range function. To be called by harmonization function.
range_function( data = temp_dataset, min_max_range = possible_range, new_var = item )
range_function( data = temp_dataset, min_max_range = possible_range, new_var = item )
data |
Data to be modified |
min_max_range |
Range of allowed values |
new_var |
New variable |
Returns a list with the new vector (values outside of range set to NA), and the number of values set to NA.
test_data <- data.frame(val = 1:10) range_function(data = test_data, min_max_range = '[2,8]', new_var = 'val')
test_data <- data.frame(val = 1:10) range_function(data = test_data, min_max_range = '[2,8]', new_var = 'val')
Possible values for categorical variables. To be called by harmonization function.
range_function_cat( data = temp_dataset, possible_vals_cat = possible_vals, new_var = item )
range_function_cat( data = temp_dataset, possible_vals_cat = possible_vals, new_var = item )
data |
data to be modified |
possible_vals_cat |
vector of possible values |
new_var |
new variable |
Returns a list with the new vector (values outside of set to NA), and the number of values set to NA.
test_data <- data.frame(val = c('a','b','j','k','c','d')) range_function_cat(data = test_data, possible_vals_cat = c('a','b','c','d'), new_var = 'val')
test_data <- data.frame(val = c('a','b','j','k','c','d')) range_function_cat(data = test_data, possible_vals_cat = c('a','b','c','d'), new_var = 'val')
Reorder factors
reorder_factors(data, sheet)
reorder_factors(data, sheet)
data |
Harmonization object, or harmonized data.frame. |
sheet |
Factor reorder sheet. |
Returns harmonization object, or harmonized data.frame.
# Running harmonization function with example harmonization sheet harmonization_obj <- harmonization(harmonization_sheet = harmonization_sheet_example) long_dataset <- harmonization_obj$long_dataset table(long_dataset$education) # College # 5643 # # Graduate/Professional # 1287 # # High school # 7562 # # No education/grade school # 7508 # Creating factor reorder sheet edu_order <- data.frame( variable = 'education', values = c('No education/grade school', 'High school', 'College', 'Graduate/Professional'), order = 1:4 ) # Reorder factors harmonization_obj <- reorder_factors(data = harmonization_obj, sheet = edu_order) long_dataset <- harmonization_obj$long_dataset table(long_dataset$education) # No education/grade school # 7508 # # High school # 7562 # # College # 5643 # # Graduate/Professional # 1287
# Running harmonization function with example harmonization sheet harmonization_obj <- harmonization(harmonization_sheet = harmonization_sheet_example) long_dataset <- harmonization_obj$long_dataset table(long_dataset$education) # College # 5643 # # Graduate/Professional # 1287 # # High school # 7562 # # No education/grade school # 7508 # Creating factor reorder sheet edu_order <- data.frame( variable = 'education', values = c('No education/grade school', 'High school', 'College', 'Graduate/Professional'), order = 1:4 ) # Reorder factors harmonization_obj <- reorder_factors(data = harmonization_obj, sheet = edu_order) long_dataset <- harmonization_obj$long_dataset table(long_dataset$education) # No education/grade school # 7508 # # High school # 7562 # # College # 5643 # # Graduate/Professional # 1287
Reorder factors data.frame
reorder_factors_df(data, sheet)
reorder_factors_df(data, sheet)
data |
Harmonized data.frame |
sheet |
Factor reorder sheet |
Returns harmonized data.frame.
# Creating example dataframe of variables, the order, and the values # The function will reorder the factor using these values in the order # provided. # This would typically be created in an excel or CSV file outside of R, # and then imported into R. test_sheet <- data.frame( variable = c(rep('Education',4),rep('Class',3)), order = c(1,2,3,4,1,2,3), values = c('None','Grade','HS','College','A','B','C') ) # I'm creating some test data to demonstrate set.seed(1234) test_data <- data.frame( ID = 1:20, Education = sample(c('None','Grade','HS','College'), size = 20, replace = TRUE), Class = sample(c('A','B','C'), size = 20, replace = TRUE) ) # Creating factors in the test data test_data$Education <- factor(test_data$Education) test_data$Class <- factor(test_data$Class) table(test_data$Education, useNA = 'ifany') table(test_data$Class, useNA = 'ifany') # Now reordering factors based on the sheet test_data_mod <- reorder_factors_df(data = test_data, sheet = test_sheet) table(test_data_mod$Education, useNA = 'ifany') table(test_data_mod$Class, useNA = 'ifany')
# Creating example dataframe of variables, the order, and the values # The function will reorder the factor using these values in the order # provided. # This would typically be created in an excel or CSV file outside of R, # and then imported into R. test_sheet <- data.frame( variable = c(rep('Education',4),rep('Class',3)), order = c(1,2,3,4,1,2,3), values = c('None','Grade','HS','College','A','B','C') ) # I'm creating some test data to demonstrate set.seed(1234) test_data <- data.frame( ID = 1:20, Education = sample(c('None','Grade','HS','College'), size = 20, replace = TRUE), Class = sample(c('A','B','C'), size = 20, replace = TRUE) ) # Creating factors in the test data test_data$Education <- factor(test_data$Education) test_data$Class <- factor(test_data$Class) table(test_data$Education, useNA = 'ifany') table(test_data$Class, useNA = 'ifany') # Now reordering factors based on the sheet test_data_mod <- reorder_factors_df(data = test_data, sheet = test_sheet) table(test_data_mod$Education, useNA = 'ifany') table(test_data_mod$Class, useNA = 'ifany')
psHarmonize summary method
## S3 method for class 'psHarmonize' summary(object, ..., verbose = FALSE)
## S3 method for class 'psHarmonize' summary(object, ..., verbose = FALSE)
object |
psHarmonize object |
... |
Can provide additional arguments |
verbose |
T/F. When TRUE, will list variables for each section. |
Doesn't return object. Prints status of harmonization (# of harmonizated variables, etc.)
harmonization_obj <- harmonization(harmonization_sheet_example) summary(harmonization_obj) # Use verbose option to see more details summary(harmonization_obj, verbose = TRUE)
harmonization_obj <- harmonization(harmonization_sheet_example) summary(harmonization_obj) # Use verbose option to see more details summary(harmonization_obj, verbose = TRUE)