Welcome to The Carpentries Etherpad! This pad is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents. Use of this service is restricted to members of The Carpentries community; this is not for general purpose use (for that, try https://etherpad.wikimedia.org). Users are expected to follow our code of conduct: https://docs.carpentries.org/topic_folders/policies/code-of-conduct.html All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/ ---------------------------------------------------------------------------- * COMPUTER SETUP * If possible, use a computer/laptop on which you have administrator rights. Otherwise, you might run into problems when installing the software. * Make sure you have the necessary software installed and up to date on your laptop. We will code in RStudio, which requires both R and RStudio installed. For installation, follow the instructions here: https://datacarpentry.org/r-socialsci/#setup-instructions . If you have those already installed, make sure you have R version 4.0 or newer (and, if necessary, update to the latest version). * If you have any trouble, send us an email by Monday (26 Feb) morning, so we can help you sort this out before the course starts on Tuesday. If need be we can set up a call to help you sort out any issues. * The lesson also requires the “BiocManager" package installed, and all necessary data downloaded and saved on your computer. You can copy and execute the code in R/RStudio (https://carpentries-incubator.github.io/high-dimensional-stats-r/setup.html) before the workshop. However, as we still work on ways to best deliver this workshop, we will also provide the setup code a day before the workshop starts in our collab document https://pad.carpentries.org/2024-02-27_ed-dash_high-dim-stats. Additionally, the first workshop hour will deal with the setup and possible issues stemming from it. * ZOOM METING DETAILS * We will be using Zoom for this workshop. If you do not have it already installed, please download Zoom from here: https://zoom.us/download. * Zoom meeting link: https://ed-ac-uk.zoom.us/j/85716381279 * Meeting ID: 857 1638 1279 * Time: 10:00 AM - 1:00 PM (UK TIME) * Passcode: in the email *R CODE NEEDED TO INSTALL ALL PACKAGES AND DOWNLOAD DATA # last minute change in the first download link used, instead of gh-pages (as in the website) we use main *install.packages("BiocManager") download.file( * "https://raw.githubusercontent.com/carpentries-incubator/high-dimensional-stats-r/main/dependencies.csv", * destfile = 'dependencies.csv') table <- read.table('dependencies.csv') BiocManager::install(table[[1]]) *dir.create("data", showWarnings = FALSE) data_files <- c( * "cancer_expression.rds", * "coefHorvath.rds", * "methylation.rds", * "scrnaseq.rds", * "prostate.rds") for (file in data_files) { * download.file( * url = file.path( * "https://raw.githubusercontent.com/carpentries-incubator/high-dimensional-stats-r/gh-pages/data", * file * ), * destfile = file.path("data", file) * )} # additonally, we will be using some elements of tidymodels that may not be in the original libraries list install.packages("tidymodels") *ALTERNATIVE INSALLATION METHOD 1. Create a new project in RStudio 2. Install renv with install.packages or the RStudio GUI 3. Download the DESCRIPTION file https://drive.google.com/file/d/1UVEiiHe4wVWXGuclyKEA6nCYN7dMbXp8/view?usp=drive_link and drop it the top level of your R project directory 4. Run renv::init(bioconductor = TRUE) 5. Select option 1 to install from the description file 6. run the code to get remaining data packages: *dir.create("data", showWarnings = FALSE)data_files <- c( * "cancer_expression.rds", * "coefHorvath.rds", * "methylation.rds", * "scrnaseq.rds", * "prostate.rds") for (file in data_files) { * download.file( * url = file.path( * "https://raw.githubusercontent.com/carpentries-incubator/high-dimensional-stats-r/gh-pages/data", * file * ), * destfile = file.path("data", file) * )} *WEBSITES WORKSHOP ORGANIZATIONAL WEBSITE: https://edcarp.github.io/2024-02-27_ed-dash_high-dim-stats/ THE LESSON: https://carpentries-incubator.github.io/high-dimensional-stats-r/ EDCARP PAD: https://pad.carpentries.org/2024-02-27_ed-dash_high-dim-stats *Day 4 (1st March) *ATTENDANCE: Andrzej Romaniuk (Host) Aleksandra Chybowska (helper) Alan O'Callaghan (helper) Karim Abu Nahia Nneka Nnadi Siva kumar Josie Robertson Manal Alharbi Juan Zeng Shikha Vashisht Maciej Migdał *LIVE CODE *#How to perform PCA #first, get the data prostate <- readRDS(here("data/prostate.rds")) *#the data rows represent unique patients head(prostate) *#creating a substed of data we are interested in pros2 <- prostate[, c("lcavol", "lweight", "lbph", "lcp", "lpsa")] head(pros2) #Are there are differences in variances betwen data? apply(pros2, 2, var) par(mfrow = c(1, 2)) hist(pros2$lweight, breaks = "FD") hist(pros2$lbph, breaks = "FD") #When perofming pcy, we state the data source (pros2), with scaling = TRUE due to values having differences in variances, centering on 0 *pca.pros <- prcomp(pros2, scale = TRUE, center = TRUE) pca.pros # how many axes we need? What percent of variability each one explains? *varExp <- (pca.pros$sdev^2) / sum(pca.pros$sdev^2) * 100 # calculate percentage variance explained using output from the PCA varDF <- data.frame(Dimensions = 1:length(varExp), varExp = varExp) *plot(varDF) #Cteating a biplot *stats::biplot(pca.pros, xlim = c(-0.3, 0.3)) # *library("PCAtools") *library("SummarizedExperiment") cancer <- readRDS(here::here("data/cancer_expression.rds")) mat <- assay(cancer) metadata <- colData(cancer) *View(mat) *View(metadata) #Whether columns match *all(colnames(mat) == rownames(metadata)) #CHALLENGE 3 pc <- pca(mat, metadata = metadata) #Many PCs explain a very small amount of the total variance in the data #Remove the lower 20% of PCs with lower variance pc <- pca(mat, metadata = metadata, removeVar = 0.2) #Explore other arguments provided in pca pc$rotated[1:5, 1:5] * #Exploring loadings pc$loadings[1:5, 1:5] *which.max(pc$loadings[, 1]) *pc$loadings[49, ] *pc$loadings[27, ] #Challenge 4 *screeplot(pc, axisLabSize = 5, titleLabSize = 8) *DAY 3 (29 Feb) *ATTENDANCE: Andrzej Romaniuk (Host) Aleksandra Chybowska (helper) Alan O'Callaghan (helper) Karim Abu Nahia Nneka Nnadi siva kumar Maciej Migdał Shikha Vashisht Manal Juan Zeng Josie Robertson *LIVE CODE # best to start from loading libraries we previously used *library("here") library("minfi") #Setup for the Regularised Regression *coef_horvath <- readRDS(here::here("data/coefHorvath.rds")) methylation <- readRDS(here::here("data/methylation.rds")) *library("SummarizedExperiment") age <- methylation$Age methyl_mat <- t(assay(methylation)) *coef_horvath <- coef_horvath[1:20, ] features <- coef_horvath$CpGmarker horvath_mat <- methyl_mat[, features] *#Generate an index to split the data set.seed(42)train_ind <- sample(nrow(methyl_mat), 25) #CHALLENGE 3 #first, we split data *train_mat <- horvath_mat[train_ind, ] train_age <- age[train_ind] test_mat <- horvath_mat[-train_ind, ] test_age <- age[-train_ind] *# as.data.frame() converts train_mat into a data.frame fit_horvath <- lm(train_age ~ ., data = as.data.frame(train_mat)) *mean(residuals(fit_horvath)^2) #We use glmnet to show how things change as we restrict/increase pentalty *library("glmnet") *## glmnet() performs scaling by default, supply un-scaled data: horvath_mat <- methyl_mat[, features] # select the first 20 sites as before train_mat <- horvath_mat[train_ind, ] # use the same individuals as selected before test_mat <- horvath_mat[-train_ind, ] * * *ridge_fit <- glmnet(x = train_mat, y = train_age, alpha = 0) plot(ridge_fit, xvar = "lambda") abline(h = 0, lty = "dashed") #Challenge 4 *#ridge regression works better on unseen data min_err_ridge *err_lm #ridge shows lesser spread *all <- c(pred_lm, test_age, pred_min_ridge) lims <- range(all)par(mfrow = 1:2) plot(test_age, pred_lm, * xlim = lims, ylim = lims, * pch = 19) abline(coef = 0:1, lty = "dashed") plot(test_age, pred_min_ridge, * xlim = lims, ylim = lims, * pch = 19) abline(coef = 0:1, lty = "dashed") #Challenge 5 *fit_lasso <- glmnet(x = methyl_mat, y = age, alpha = 1) #Challenge 6 *#fitting an elastic model is like fitting a lasso model elastic <- glmnet(methyl_mat[, -1], age, alpha = 0.5) plot(elastic) #model selection similar *elastic_cv <- cv.glmnet(methyl_mat[, -1], age, alpha = 0.5) plot(elastic_cv) * # coefficients similar with lasso, but more conservative for elastic *coefe <- coef(elastic_cv, elastic_cv$lambda.1se) sum(coefe[, 1] == 0) *sum(coefl[, 1] == 0) *plot( * coefl[, 1], coefe[, 1], * pch = 16, * xlab = "LASSO coefficients", * ylab = "Elastic net coefficients")abline(0:1, lty = "dashed") *DAY 2 (28 Feb) *ATTENDANCE: Andrzej Romaniuk (Host) Aleksandra Chybowska (helper) Alan O'Callaghan (helper) Karim Abu Nahia Siva Kumar Nneka Nnadi Maciej Migdał Manal Shikha Vashisht Juan Zeng Josie Robertson *LIVE CODE # tidy() helps extracting info on coef and assoc hypoth. details # remember to run code from previous day if you do not have an object lm_age_methyl1 in your envrionment library("broom") tidy(lm_age_methyl1) # we are interested in coefficients coef_age_methyl1 <- tidy(lm_age_methyl1)[2, ] coef_age_methyl1 # however, obtained p-values are qite high ## HOW DOES HYPOTHESIS TESTING FOR A SUCH A MoODEL WORK? # saving all the tidy data as a separate object table_age_methyl1 <- tidy(lm_age_methyl1) # creating t-statistic & estimating standard error using all.equal() tvals <- table_age_methyl1$estimate / table_age_methyl1$std.error all.equal(tvals, table_age_methyl1$statistic) # same, but for p-values pvals <- 2 * pt(abs(tvals), df = lm_age_methyl1$df, lower.tail = FALSE) all.equal(table_age_methyl1$p.value, pvals) ## SHARING INFORMATION ACROSS OUTCOME VARIABLES # we will use a package called limma library("limma") design_age <- model.matrix(~age) dim(design_age) head(design_age) # how data() function works, on cars dataset data(cars) head(cars) # fit regression model using speed as a predictor (from cars) mod1 <- lm(dist ~ speed, data=cars) # the model matrix contains two columns: intercept and speed head(model.matrix(mod1)) fit_age <- lmFit(methyl_mat, design = design_age) fit_age <- eBayes(fit_age) toptab_age <- topTable(fit_age, coef = 2, number = nrow(fit_age)) # order by effect size (absolute log-fold change) orderEffSize <- rev(order(abs(toptab_age$logFC))) head(toptab_age[orderEffSize, ]) plot(toptab_age$logFC, -log10(toptab_age$P.Value), xlab = "Effect size", ylab = bquote(-log[10](p-value)), pch = 19 ) #CHALLENGE 4 #First, we extract a matrix from methylation data design_smoke <- model.matrix(~methylation$smoker) #Fitting data fit_smoke <- lmFit(methyl_mat, design = design_smoke) fit_smoke <- eBayes(fit_smoke) toptab_smoke <- topTable(fit_smoke, coef = 2, number = nrow(fit_smoke)) #finally, we create a plot plot(toptab_smoke$logFC, -log10(toptab_smoke$P.Value), xlab = "Effect size", ylab = bquote(-log[10](p)), pch = 19 ) #THE PROBLEM OF MULTIPLE OUTCOMES *age_perm <- age[sample(ncol(methyl_mat), ncol(methyl_mat))]design_age_perm <- model.matrix(~age_perm) *fit_age_perm <- lmFit(methyl_mat, design = design_age_perm)fit_age_perm <- eBayes(fit_age_perm)toptab_age_perm <- topTable(fit_age_perm, coef = 2, number = nrow(fit_age_perm)) *plot(toptab_age_perm$logFC, -log10(toptab_age_perm$P.Value), * xlab = "Effect size", ylab = bquote(-log[10](p)), * pch = 19)abline(h = -log10(0.05), lty = "dashed", col = "red") #BONFERRONI CORRECTION #most conservative way of adjusting multiple tests to take into account false positvies and negatives *p_raw <- toptab_age$P.Valuep_fwer <- p.adjust(p_raw, method = "bonferroni") #we will usse ggplot for visualisation library("ggplot2") ggplot() + * aes(p_raw, p_fwer) + * geom_point() + * scale_x_log10() + scale_y_log10() + * geom_abline(slope = 1, linetype = "dashed") + * geom_hline(yintercept = 0.05, linetype = "dashed", col = "red") + * geom_vline(xintercept = 0.05, linetype = "dashed", col = "red") + * labs(x = "Raw p-value", y = "Bonferroni p-value") # CHALLENGE 6 *#adjusting p values p_fdr <- p.adjust(p_raw, method = "BH")ggplot() + * aes(p_raw, p_fdr) + * geom_point() + * scale_x_log10() + scale_y_log10() + * geom_abline(slope = 1, linetype = "dashed") + * geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + * geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + * labs(x = "Raw p-value", y = "Benjamini-Hochberg p-value") #plotting the results from previous p-adjusting *ggplot() + * aes(p_fdr, p_fwer) + * geom_point() + * scale_x_log10() + scale_y_log10() + * geom_abline(slope = 1, linetype = "dashed") + * geom_hline(yintercept = 0.05, linetype = "dashed", color = "red") + * geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") + * labs(x = "Benjamini-Hochberg p-value", y = "Bonferroni p-value") *DAY 1 (27 Feb) *ATTENDANCE: Andrzej Romaniuk (Host) Aleksandra Chybowska (helper) Alan O'Callaghan (helper) Nneka Nnadi Siva Kumar Josie Robertson Maciej Migdał Karim Abu Nahia Juan Zeng Shikha Vashisht Manal *LIVE CODE # CHALLENGE 2 # loading library "here" library("here") # reading a previously saved rds file Prostate <- readRDS(here("data/prostate.rds")) dim(Prostate) # print the number of rows and columns names(Prostate) # examine the variable names head(Prostate) # print the first 6 rows names(Prostate) # examine column names pairs(Prostate) # plot each pair of variables against each other #CHALLENGE 3 # use cor() to examine correlations cor(Prostate) # rounding helps to visualise the correlations round(cor(Prostate), 2) # create a simple (univariate) regression model exploring the impact of # highly correlated variables (here: gleason and pgg45 are 0.75) # lm() to create a model model1 <- lm(age ~ gleason, data = Prostate) model2 <- lm(age ~ pgg45, data = Prostate) # summary() to explore the results summary(model1) summary(model2) # create a mutivariate regression model using both variables (gleason + pgg45) model3 <- lm(age ~ gleason + pgg45, data = Prostate) # check the results (no actual impact if using both as covariates) summary(model3) #USING BIOCONDUCTOR # example package 'minfi', specifically for analysing # Illumina Infinium DNA methylation arrays library("minfi") # we can explore this package using browseVignettes() browseVignettes("minfi") # example setup to explore methylation dataset library("here") library("ComplexHeatmap") methylation <- readRDS(here("data_input/methylation.rds")) # Oryginal dataset is complex, and not realy a table, so to view it we will use colData() and head() head(colData(methylation)) # assay() function creates a matrix-like (table) object where rows represent # probes for genes and columns represent samples. methyl_mat <- t(assay(methylation)) # example usage # calculate correlations between cells in matrix cor_mat <- cor(methyl_mat) # print the top-left corner of the correlation matrix cor_mat[1:10, 1:10] #DNA METHYLATION DATA # starting from methylation data, see previous day # methylation is a GenomicRatioSet object, a specific type of object methylation dim(methylation) # 5000 fatures x 37 columns # assay() for extracting this data methyl_mat <- assay(methylation) # distribution of m-values hist(methyl_mat, breaks = "FD", xlab = "M-value") # exploring data, using knitr:kable knitr::kable(head(colData(methylation)), row.names = FALSE) # similar to head(colData(methylation)) , but can be better in some environments # In this episode, we will focus on the association between # age and methylation # extracting data age <- methylation$Age # example of using a Heatmap() function, to show how many possible relationships are there library("ComplexHeatmap") order <- order(age) age_ord <- age[order] methyl_mat_ord <- methyl_mat[, order] Heatmap(methyl_mat_ord, name = "M-value", cluster_columns = FALSE, show_row_names = FALSE, show_column_names = FALSE, row_title = "Feature", column_title = "Sample", top_annotation = columnAnnotation(age = age_ord)) #REGRESSION WITH MANY OUTCOMES # getting age data age <- methylation$Age # fitting model # methyl_mat[1, ] indicates that the 1st CpG will be used as outcome variable lm_age_methyl1 <- lm(methyl_mat[1, ] ~ age) lm_age_methyl1 # plotting the results with the fitting line plot(age, methyl_mat[1, ], xlab = "Age", ylab = "Methylation level", pch = 16) abline(lm_age_methyl1) *THOUGHTS ABOUT THE COURSE: WHAT YOU LIKED: I enjoy the teaching approach of the main instrctor, and the explanation is quite straightforward. Clear explanations, helpful to have pre-prepared data to use WHAT COULD BE IMPROVED we can spend more time on breakout room. This will give us more time to discuss We could spend less time on the breakout rooms, imo there is no improvement to the amount of discussion than what we would have on an open forum plus our group is quite small anyway. It also feels like the breakout rooms limit our chances to interact with the main instructor. I would also like to see a biger challange to solve at the end of each episode. it is better if we spend more time with the main instructor to discuss the challenges rather than using the breakout rooms Overall too much time spent on the first two topics Would have been good to focus more on the later sections & cover more of these