A variety of widely used tools for bioinformatics analysis are available on Seawulf.
Some tools have their own module that can be loaded (e.g., blast+/2.10.0). However, because a lot of bioinformatics tools are intended to be used together, we have created several Anaconda environments that provide multiple tools for different types of analyses. We have created modules to activate each of these environments for convenience.
The following table shows a list and description for each module:
Module name | Module description | Example software |
---|---|---|
bioconvert/1.0.0 | Format conversion tool for life science data | bioconvert, biopython, sambamba |
diffexp/1.0 | RNA-Seq & differential gene expression | Salmon, DESeq2, Trinity |
genome_assembly/1.0 | Genome assembly & evaluation | Flye, SPAdes, BUSCO |
genome_annotation/1.0 | Functional annotation of genome assemblies | RepeatMasker, Augustus, Maker |
GWAS/1.0 | Genome-wide association software | Plink, GEMMA |
hts/1.0 | Standard high throughput sequencing software | Samtools, bwa, bowtie2 |
metagenomics/1.0 | Metagenomic classification, assembly, and analysis | Kraken2, Megahit |
popgen/1.0 | Variant calling & population genomics | GATK, Plink, admixture |
phylo/1.0 | Multiple sequence alignment & phylogenetic inference | MAFFT, IQ-TREE2, raxml-ng |
singlecell/1.0 | Single Cell sequencing analysis | Seurat, scanpy, scvi-tools |
structural-variant/1.0 | Structural variant calling and genotyping | TIDDIT, lumpy, manta |
Once one of the above modules is loaded, the executables for the installed programs will be available in your path.
For example, the following is how to access the samtools program in the hts/1.0 module
module load hts/1.0 samtools --help
To see a full list of software installed under each conda module, do the following after loading one of the modules:
conda list
For further instructions on how to run individual programs, please consult the relevant online documentation.
When running a program, you will need to execute it as part of SLURM bash script. An example SLURM script for running Trinity RNA-seq assembly can be found at:
/gpfs/projects/samples/Trinity
If there are missing or outdated packages in any of the above modules, or if you have suggestions for a new conda biofinformatics module, please submit a ticket.
Bioinformatics Workflow Managers
Users may find it convenient (or even necessary) to use workflow management system to handle analyses involving anything but a small number of inputs (e.g., multiple samples to be processed in parallel). SeaWulf currently has two different bioinformatics workflow systems available: Snakemake and Nextflow.
Snakemake
Snakemake is a python-based workflow manager with flexible scripting options. Typically, users will create an input file called a "Snakefile" that defines the input files that the biofinformatics pipeline will run on, along with one or more "rules" that define the steps that will be taken by the workflow. Snakemake also comes with a larger number of wrappers that simplify management of standard bioinformatics software and make it easy call these programs in your Snakemake rules.
Once the pipeline has been defined in the Snakefile, it can be run with the following:
snakemake -s Snakefile <... additional flags ...>
Please see the Snakemake documentation for more information on setting up and executing the pipeline.
A recent version of Snakemake is installed in each of the above bioinformatics conda environments, so no additional modules need to be loaded to use Snakemake.
Nextflow
Nextflow is a workflow manager written in a language called Groovy. It requires a relatively recent version of Java, so please load the following modules to access Nextflow:
module load openjdk/latest module load nextflow/latest
These modules will be updated periodically to ensure that the Java and Nextflow versions available are current.
While you are welcome to build your own Nextflow pipelines, you may wish to instead use one of the many community-curated pipelines available via the nf-core project. nf-core provides reproducible, best-practice pipelines with detailed reporting for a variety of typical bioinformatics use cases. To see a list of available nf-core pipelines, please visit the nf-core website.
Nextflow and nf-core handle software dependencies using Conda, Docker (not available on SeaWulf), or Singularity. We recommend use of Singularity, as it is available without loading any modules and greatly simplifies dependency installation. When using Singularity for software management, nf-core will download several container images, which can lead to large amounts of storage usage. To avoid running out of space in your home directory, we recommend setting the following environment variables before running nf-core pipelines:
export SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity export NXF_SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity
This will force Singularity to save container images in your scratch directory, where you should have plenty of space.
The SeaWulf HPC team has created a custom nf-core configuration file for SeaWulf that allows nexftflow to automatically submit jobs to the 40-core and 96-core partitons and uses Singularity for software management. Because of this, nf-core pipeline jobs can be launched on the login node, and Nextflow will handle submitting jobs for each step in the analysis. However, the Nextflow process needs to keep running for the duration of the pipeline for this to work successfully.
The following are a set of recommended steps for running an nf-core pipeline with Nextflow:
1. ssh to a milan login node:
ssh <netid>@milan.seawulf.stonybrook.edu
2. Start a tmux session to keep the Nextflow process running even after logging out of SeaWulf (see tmux documentation here):
tmux
3. Load the openjdk and nextflow modules:
module load openjdk/latest module load nextflow/latest
4. Set the Singularity cache directory environment variables:
export SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity export NXF_SINGULARITY_CACHEDIR=/gpfs/scratch/$USER/singularity
5. Run the nf-core/rnaseq workflow using the seawulf and test profiles (the test profile just runs a short job with a small amount of test data):
nextflow run nf-core/rnaseq -profile seawulf,test --outdir test_out
Nextflow will submit a series of jobs to the 40-core or 96-core partitions, and results will be saved in the "test_out" directory that was specified in the command above.
For more information, please see the Nextflow and nf-core documentation.