Introduction

The Mother of all Spreadsheets (MOAS) data is a large data.frame/spreadsheet combining all information from participants in most of LCBC’s studies. The spreadsheet is over 2000 columns wide, and 4000 rows long, and growing constantly. Navigating this file is tedious at times, and requires a lot of programming skills to handle.

The MOAS R-package is created witht he intent to share common operations/functions people may use on the MOAS, in order to get the data into the shape they need. This vignette is intended to showcase som of the functions currently available in the package that people may wish to use.

How to use the package

Basic usage

After the package is installed, the package needs to be called in every script or R instance to make it’s functions available.

library(MOAS)

The package functions mostly requires the actual MOAS data to work, or data created by subsetting the MOAS. The MOAS may be found in the MOAS data-folder in the Lagringshotell: LCBC/Projects/Cross_projects/MOAS/Data/MOAS.RData

Each project contained in the MOAS also has it’s own MOAS-derived files at the top level of their data folder, in this type of format: LCBC/Projects/11_MemP/Data/11_MemP_Imaging_Long.RData These types of files are compatible with all functions in the package.

Loading the MOAS or MOAS-derived files

To work with the file, you must first get it into R’s memory. For loading .RData files, which is an R-native file type, the simplest way is to use the load() function. Remember to use the path to the file from your current working directory. Because .RData files are saved with the object name they had when saved, .RData can have very different names than expected when loading them. To handle this, we assign the loading of the file to an object nm, which will capture the name of the object. We will then use the get() function to grab the data with name nm and assign it to the object data, so we have the data in an object with the name of our chosing.

You now have MOAS-data in your environment stored in the object data.

Function documentation

Like all R-functions, the MOAS package functions have documentation that may be accessed by typing a question mark in the console, and the function you want to know more about.

The launch_LCBCshiny() function

This function will launch the LCBC shiny explorer in your default browser. It works “out-of-the-box” and will immediately launch the app in your browser without any extra inputs. You can then use the file browser in the application to upload the file you want to work on.

If you have already loaded in the MOAS or a MOAS generated file in your R instance, you can supply it directly through R. We already loaded in the MOAS data into the object data in the previous section. We provide this information to the shiny launcher, and the data will automatically be loaded into the application.

The widen() function

This function will create a wider data.frame of the data you have, given some specification. This is often necessary if you want to use the data in SPSS, or for running ANOVAs etc. To view the manual page of this package, simply type:

widen() has two mandatory inputs, the data to widen, and the column name to widen by.

There are four options to widen the data by:
1. “Subject_Timepoint”
1. “Project_Wave”
1. “Site_Name”
1. “Site_Number”

To widen data, simply type:

widened_data = widen(data, by="Site_Name")

Using by="Subject_Timepoint" or by="Project_Wave" the data must have only one line per participant and timepoint. For double and triple scanned, this is not the case. In order to widen by these two options, you need to either run widen() with by="Site_Name" or by="Site_Number" first, run the filter_site() function, or provide widen() with the keep argument which will feed it to filter_site for you.

widened_data1 = widen(data, by="Site_Name")
widened_data2 = widen(widened_data1, by="Project_Wave")

widened_data1 = widen(data, by="Site_Name", keep="ousAvanto")

The filter_site() function

The filter_site() function is intended for use when you wish to reduce double and triple scanned data to a single line from one of the scanners. This is necessary for all analyses not using the site as a variable of interest. To run the function, you must choose which scanner you wish to keep data from.

There are four options:
1. “long” - finds which scanner there is most data from, and keeps those (default) 2. “ousAvanto” - keeps Avanto data
3. “ousSkyra” - keeps Skyra data
4. “ourPrisma” - keeps Prisma data

If you choose the ‘long’ option, you also may specify the tie option, in case there is a tie for how many scans of each a participant has. By default the tie is set to “interval” where is will look for the scanner that has been used for the longest period of time. Again, here you may specify any one scanner and that will be picked.

Lastly, if there is a tie for scanner also when taking into account the interval, site_order takes a string vector specifying the order of priority for scanners. By default it is c(“ousPrisma”, “ousSkyra”, “ousAvanto”), meaninggiven a tie still, it will firstly pick a Prisma scan if its there, if no Prisma it will choose Skyra, and lastly Avanto.

Note: the operation only affects double and triple scan timepoints. All participants retain the same amount of time points, but number of rows per double/triple scans is reduced to one.

simple_data = filter_site(data, "long", tie = "interval", site_order = c("ousPrisma", "ousSkyra", "ousAvanto"))
simple_data = filter_site(data, "ousAvanto")
simple_data = filter_site(data, "ourPrisma")

The fs_lmm() function

This function will return a data.frame containing formatted data for use in Freesurfers LME models. There are specific requirements to how the data should look, and also there are operations done on the data which help our data engineer to set up analyses.

The function will run even without much input, but the output will not necessarily make muc sense. You must supply the function with the data you want to use. Secondly, you must decide which numeric covariates and categorical (grouping) variables you want your model include.

grouping.var - a vector of strings with the names of the columns for your categorical groups. Usually, “Sex” and “Site_Name” are added here, like so: grouping.var = c("Sex","Site_Name").numeric.var- a vector of strings with the names of the columns for you numeric covariates. Here we often add cognitive scores like:numeric.var = "CVLT_A_Total`.

Example:

fs_lmm(data, grouping.var = c("Sex","Site_Name"), numeric.var="CVLT_A_Total")

There are several other options to help you do different things. the keep option is passed to the filter_site() function to handle double and triple scanned data. By default is uses the value “long”, which will choose the scanner from a subject with the most data (in case of a tie, it picks the Skyra).

Please see the documentation of this function for more options:

Pipe ( %>% ) compatibility

All MOAS functions are pipe compatible, and should thus easily be incorporated into tidyverse-type syntax.

simple_data = data %>% 
  filter(Project_Name %in% c("MemC","MemP")) %>% 
  na_col_rm() %>% # see section on Utility functions
  filter_site("ousPrisma") %>% 
  widen("Project_Wave")

Built in data

There are certain variables easily accessible after loading the MOAS package, to help you navigate the data.

##   Site_Number    Site_Name Site_Tesla
## 1          11    ousAvanto        1.5
## 2          12     ousSkyra        3.0
## 3          13    ousPrisma        3.0
## 4          20   ntnuAvanto        1.5
## 5          21 curatoAvanto        1.5
##   Project_Number     Project_Name Project_Experimental
## 1             10             NDev                 <NA>
## 2             11             MemP                 <NA>
## 3             12              NCP      Memory training
## 4             13             MoBa                 <NA>
## 5             14             Loci      Memory training
## 6             15             MemC                 <NA>
## 7             16             ACon                 <NA>
## 8             17              S2C Memory training (VR)
## 9             90 Novel_biomarkers                 <NA>

The most useful is likely the variables data.frame is the most useful, which gives you a table of all the non-imaging variables in the MOAS, with some information about their content and types. The Values column in this data.frame denotes the allowed values in the column. For numeric data it denotes the min and max values, while for factors it should provide the factor levels and labels.

##                MOAS                             Label     Type   Class
## 1   CrossProject_ID                Subject identifier Constant integer
## 2        Birth_Date                     Date of Birth Constant    Date
## 3               Sex                    Biological sex Constant  factor
## 4 Subject_Timepoint Sequential participant time point Variable integer
## 5      Project_Wave             Project – Wave number Variable integer
##              Values
## 1 1000000 ; 9999999
## 2              <NA>
## 3 1=Female ; 2=Male
## 4            1 ; 10
## 5            1 ; 10

Utility functions

Some functions in the MOAS package are what we call utility functions. They do not exclusively work on MOAS data, they will work well on other data.frames or vectors. These funtions perform operations to help clean up data or perform simple and useful operations.

The na_col_rm() function

This function locates any column in a data.frame that has no observations (only contains NA or NaN), and removes it from the data.frame. This particularly convenient if you have subsetted the rows of data to specific observations, and want to quickly remove any columns that no longer are of consequence for the data you have.

data2 = data %>% filter(Project_Name %in% "NCP") %>% na_col_rm()