R studio
Contents
RStudio Server Tutorial - KENET HPC Cluster
Overview
RStudio Server is a web-based integrated development environment for R, providing a complete R programming workspace accessible through your browser. It combines a powerful code editor, console, graphics viewer, and package management tools into a single interface, making it ideal for statistical computing and data analysis on the KENET HPC cluster.
Use Cases: RStudio Server is particularly well-suited for statistical analysis and hypothesis testing, data visualization with ggplot2 and interactive plots, bioinformatics workflows with Bioconductor packages, reproducible research with RMarkdown documents, machine learning with tidymodels and caret, spatial data analysis with sf and terra packages, and collaborative research projects with version control integration.
Access: RStudio Server is available through the KENET Open OnDemand web portal at https://ondemand.vlab.ac.ke
Code Examples: All code examples for this tutorial are available in our GitHub repository at https://github.com/Materials-Modelling-Group/training-examples
Prerequisites
Before using RStudio Server, you should have an active KENET HPC cluster account with access to the Open OnDemand portal. Basic knowledge of R programming will be helpful, though beginners can also use this tutorial to get started. Your data files should be stored in your home directory at /home/username/localscratch. Familiarity with statistical concepts and data analysis workflows will enhance your experience.
Launching RStudio Server
Step 1: Access Interactive Apps
Begin by logging into Open OnDemand at https://ondemand.vlab.ac.ke using your KENET credentials. Once logged in, click the Interactive Apps menu in the top navigation bar, then select RStudio Server from the dropdown list of available applications.
Step 2: Configure Job Parameters
The job submission form allows you to specify the computational resources needed for your RStudio session. The requirements vary depending on your analysis complexity and dataset size.
| Parameter | Description | Recommended Value |
|---|---|---|
| Partition | Queue for job execution | normal for CPU tasks, gpu for parallel computing
|
| Walltime | Maximum runtime in hours | 4 hours for interactive analysis, up to 24 for long computations
|
| CPU Cores | Number of processor cores | 4-8 cores for typical work, more for parallel operations
|
| Memory | RAM allocation | 16 GB for small datasets, 32-64 GB for large data
|
| Working Directory | Starting directory | /home/username or your project folder
|
Tip: For genomics or large-scale statistical analyses, request more memory and cores. RStudio can take advantage of parallel processing for many operations when multiple cores are available.
Step 3: Submit and Wait
After configuring your job parameters, click the Launch button to submit your job to the cluster scheduler. The job will initially show a "Queued" status while waiting for resources to become available. Once resources are allocated (typically within 30-60 seconds), the status will change to "Running" and a Connect to RStudio Server button will appear. Click this button to open your RStudio session in a new browser tab.
Quick Start Guide
Understanding the RStudio Interface
When RStudio opens, you will see a workspace divided into four main panes. The Source pane in the upper left displays your R scripts and RMarkdown documents. The Console pane in the lower left is where R commands are executed and output appears. The Environment/History pane in the upper right shows your workspace variables and command history. The Files/Plots/Packages pane in the lower right provides file navigation, plot viewing, package management, and help documentation.
Your First R Script
To create a new R script, click File → New File → R Script or press Ctrl+Shift+N. The rstudio/examples/01_hello_cluster.R file in our GitHub repository provides a simple introduction. Type your R code in the script editor and execute it by placing your cursor on a line and pressing Ctrl+Enter to run that line, or select multiple lines and press Ctrl+Enter to run the selection.
Working with the Console
The R Console at the bottom left is an interactive environment where you can type commands directly and see immediate results. This is useful for quick calculations, testing code snippets, or exploring data interactively. The console maintains your R session state, so all variables and loaded packages persist throughout your session.
Installing R Packages
R packages extend R's functionality with additional functions and datasets. To install packages, use the Packages pane on the lower right and click Install, or use the console. The proper syntax is shown in the rstudio/examples/02_package_installation.R file in our GitHub repository. Packages are installed in your personal library, so they persist between sessions.
Important: Always install packages to your user library, not the system library. RStudio handles this automatically when you use the Packages pane or the install.packages function.
Common Tasks
Task 1: Loading and Exploring Data
Loading data in R is straightforward using built-in functions or packages like readr for improved performance. You can read CSV files, Excel spreadsheets, and many other formats. The rstudio/examples/03_loading_data.R example demonstrates reading data from various sources including your home directory and the faster scratch storage, along with basic data exploration techniques.
After loading data, explore its structure using functions like str, summary, and head. The tidyverse packages provide powerful tools for data manipulation and visualization that integrate well with the RStudio workflow.
Task 2: Data Visualization with ggplot2
The ggplot2 package is the standard for creating publication-quality graphics in R. It uses a grammar of graphics approach that makes it easy to create complex, multi-layered plots. See the rstudio/examples/04_ggplot2_visualization.R example for creating various plot types including scatter plots, line graphs, histograms, and box plots.
Plots appear in the Plots pane in the lower right corner of RStudio. You can zoom, export to various formats, and navigate through your plot history using the arrows in the Plots pane toolbar.
Task 3: Statistical Analysis
R excels at statistical analysis with comprehensive built-in functions and specialized packages. The rstudio/examples/05_statistical_analysis.R example demonstrates common statistical tests including t-tests for comparing means, ANOVA for multiple group comparisons, linear regression for modeling relationships, and correlation analysis.
Results are displayed in the console and can be extracted for further use or reporting. The broom package helps convert statistical output into tidy data frames for easier manipulation.
Task 4: Creating RMarkdown Reports
RMarkdown combines R code, output, and narrative text into dynamic documents that update automatically when data or code changes. This is ideal for reproducible research and sharing analyses with collaborators. Create a new RMarkdown document with File → New File → R Markdown.
The rstudio/examples/06_rmarkdown_template.Rmd file provides a template showing how to combine text, code chunks, and output. Click the Knit button to render your document to HTML, PDF, or Word format. The document includes your code, results, and any plots you generate.
Task 5: Parallel Computing
When working with large datasets or computationally intensive tasks, you can leverage multiple CPU cores for parallel processing. R provides several packages for parallelization including parallel, foreach, and future. The rstudio/examples/07_parallel_computing.R example shows how to detect available cores and parallelize common operations like bootstrapping, cross-validation, and simulations.
When you request multiple cores in your Open OnDemand job submission, those cores become available for parallel processing within your RStudio session.
Tips & Best Practices
Performance Optimization
When working with large datasets, use data.table or the tidyverse for efficient data manipulation. Request adequate memory when launching RStudio to avoid running out of RAM during analysis. Store intermediate results to disk using saveRDS and readRDS rather than keeping everything in memory. Use the pryr package to monitor memory usage of objects in your workspace.
For very large datasets that do not fit in memory, consider using database connections with DBI and dbplyr, or chunked processing with the chunked package. The scratch directory at /scratch/username/ provides faster I/O than your home directory for large file operations.
Project Organization
Use RStudio Projects to organize your work. Create a new project with File → New Project which sets up a dedicated workspace with its own working directory and settings. Keep your data, scripts, and outputs organized in subdirectories. The rstudio/workflows/01_research_project/ directory in our GitHub repository demonstrates a well-organized project structure.
Use descriptive names for variables and functions. Comment your code liberally to explain what each section does. This helps both collaborators and your future self understand the analysis.
Reproducibility
Use the renv package to manage project-specific package dependencies, ensuring your analysis works consistently across different systems and over time. Document your R version and package versions at the beginning of scripts or RMarkdown documents using sessionInfo. Save your workspace and history periodically, though for true reproducibility, your scripts should be self-contained and able to recreate your analysis from raw data.
Store random seeds with set.seed before any operation involving randomness to ensure reproducible results in simulations, resampling, or machine learning workflows.
Session Management
Your RStudio session persists for the duration of your requested walltime. The workspace including all variables, loaded packages, and environment state remains active even if you close the browser tab. You can reconnect by clicking the Connect to RStudio Server button again.
Before your session ends, save important objects with saveRDS or save, and save your scripts. Consider using File → Save Workspace to preserve your entire environment, though relying on scripts rather than saved workspaces is better for reproducibility.
Example Workflows
Example 1: Exploratory Data Analysis
Objective: Load a dataset, perform initial exploration, create visualizations, and generate summary statistics.
Begin by launching RStudio Server with 8 cores and 16 GB memory for a typical dataset. Follow the complete workflow in rstudio/workflows/02_exploratory_analysis.R from our GitHub repository.
The workflow demonstrates loading data using readr, checking for missing values and data types, creating summary statistics with dplyr, generating exploratory plots with ggplot2 including distributions, correlations, and relationships between variables, identifying outliers and unusual patterns, and documenting findings with inline comments.
This type of exploratory analysis is typically the first step in any data analysis project, helping you understand your data before proceeding to more complex modeling.
Example 2: Statistical Modeling and Reporting
Objective: Perform regression analysis and create an automated report with RMarkdown.
Launch RStudio and create a new RMarkdown document. Follow the template and workflow shown in rstudio/workflows/03_statistical_modeling.Rmd which demonstrates loading and preparing data, fitting multiple regression models, comparing model performance with metrics like R-squared and AIC, performing diagnostic checks including residual plots, creating publication-quality tables with knitr and kableExtra, and generating a complete HTML or PDF report.
The RMarkdown document combines narrative explanation, code execution, statistical output, and visualizations into a single reproducible report that can be shared with collaborators or included in publications.
Example 3: Bioinformatics Analysis
Objective: Analyze genomic data using Bioconductor packages.
Launch RStudio on the GPU partition if performing computationally intensive operations. The rstudio/workflows/04_bioinformatics_analysis.R workflow demonstrates installing and loading Bioconductor packages, reading genomic data formats like FASTQ and BAM files, performing differential expression analysis, creating heatmaps and volcano plots for visualization, and exporting results for further analysis.
Bioconductor provides a comprehensive ecosystem of packages for genomic data analysis. The workflow shows how to integrate multiple packages to create a complete analysis pipeline.
Example 4: Machine Learning with Tidymodels
Objective: Build and evaluate machine learning models using the tidymodels framework.
The rstudio/workflows/05_machine_learning.R example demonstrates splitting data into training and testing sets, preprocessing data with recipes including normalization and feature engineering, specifying and fitting multiple model types such as random forest, gradient boosting, and elastic net, tuning hyperparameters with cross-validation, comparing model performance, and making predictions on new data.
The tidymodels framework provides a unified interface for machine learning in R, making it easier to try different models and compare their performance systematically.
Troubleshooting
Problem: Session won't start or stays in "Queued" state
This typically occurs when no computational resources are available or the queue is full. Try reducing the number of requested cores or memory in your job parameters. Switch to a different partition such as debug for quick testing with limited resources. Check the cluster status on the dashboard to see current load. If the problem persists after trying these solutions, contact KENET support for assistance.
Problem: R session crashes or becomes unresponsive
An unresponsive R session usually indicates an out-of-memory condition or an infinite loop in your code. Check memory usage in the Environment pane to see how much memory your objects are consuming. Use rm to remove large objects you no longer need, and call gc to run garbage collection. If the session is completely frozen, you may need to terminate it from the Open OnDemand session card and start a new one with more memory allocated.
Problem: Package installation fails
Package installation failures can occur due to missing system dependencies or compilation errors. Check the error message carefully for clues. Some packages require system libraries that may not be available on the cluster. Contact KENET support if you need specific system libraries installed. For packages that require compilation, ensure you have the necessary development tools loaded. Installing binary packages is generally faster and avoids compilation issues when available.
Problem: Cannot read or write files
File access problems usually indicate permission issues or incorrect paths. Check the current working directory with getwd and change it with setwd if needed. Verify file paths are correct using file.exists. Check file permissions with file.info. Remember that you cannot access files in other users' home directories without explicit permission. For large files, use the scratch directory at /scratch/username/ which provides better performance.
Problem: Plots not appearing
If plots are not showing in the Plots pane, check that you are using the correct graphics device. The default device should work, but you can explicitly call dev.new if needed. If using ggplot2, ensure you are printing the plot object either implicitly by having it as the last line of a code chunk or explicitly with print. For RMarkdown documents, make sure code chunks have the appropriate chunk options set.
Problem: RMarkdown won't knit
RMarkdown knitting failures are often due to errors in code chunks or missing dependencies. Check that all required packages are installed and loaded. Ensure your code runs without errors in the console before knitting. Check the R Markdown pane for specific error messages. If knitting to PDF, ensure LaTeX is available on the system. HTML output is generally more reliable and does not require additional dependencies.
Additional Resources
The official RStudio documentation is available at https://docs.posit.co/ide/user/ and provides comprehensive information about all RStudio features. For learning R, the R for Data Science book at https://r4ds.had.co.nz/ is an excellent free resource covering data manipulation, visualization, and modeling. KENET HPC usage guidelines are documented at https://training.kenet.or.ke/index.php/HPC_Usage.
For statistical analysis, the Quick-R website at https://www.statmethods.net/ provides practical examples of common statistical tasks. The Bioconductor project at https://www.bioconductor.org/ offers extensive documentation for genomic data analysis. For machine learning, the tidymodels website at https://www.tidymodels.org/ provides tutorials and reference documentation.
Code Examples Repository: All code examples referenced in this tutorial are available at https://github.com/Materials-Modelling-Group/training-examples
For support, contact KENET at support@kenet.or.ke, consult the documentation at https://training.kenet.or.ke/index.php/HPC_Usage, or access the Open OnDemand portal at https://ondemand.vlab.ac.ke.
Feedback
If you encounter issues or have suggestions for improving this tutorial, please contact KENET support or submit feedback through the Open OnDemand interface using the feedback button.