High Performance and Cloud Computing Research Facility

Overview

A combination of fast pace, cutting edge technologies, massive resource consumption, access level, technical sophistication needs from the research community brought Information Systems and Management (IMS), the Glenn Biggs Institute for Alzheimer’s and Neurodegenerative Diseases and the Genome Sequencing Facility to address research needs.

IMS brings innovative technology and expertise of configuration management and cloud computing knowledge while the Biggs Institute and the Genome Sequencing Facility provide research data management and analysis needed by research/scientific communities.

This unique partnership provides the UT Health San Antonio research community with computational resources to enable scientific research.

Benefits

A centralized high performance and cloud cluster provides a scalable computational power and a flexible platform to grow and upgrade.

Advantages of HPC over small, individually managed machines include:

  • Elimination of duplicate spending: HPC will provide a computational infrastructure to the whole campus, eliminating repetitive spending from departments and institutes on similar facilities.
  • Scalable and flexible HPC resources: Centralized HPC can meet a wide range of computational needs. The planned new university wide HPC will have 600 cores and will be equipped with GPU’s. In addition, high memory nodes will be constructed for high memory demanding jobs such as genome assembly.
  • Avoiding the small, outdated HPC silo trap: HPC equipment can easily be updated and expanded. Instead of PI’s buying their own machines, they can request funds to add new nodes to the centralized HPC in grant applications. Our management model prioritizes contributing PI’s in node access, and in return, the facility grows.
  • Pathway to secure additional grant funding: PI’s and research finance managers will have the necessary start to apply for infrastructure or data management grants to secure additional funding.
  • IT knowledge and support: IMS will be the primary group to maintain and support HPC.
  • Reputation and vision: Having a state-of-the-art centralized HPC facility will:
    Allow investigators to process large data scale more expeditiously and at lower costs
    Help lead to new research grant funding and enhance the university’s reputation to attract new faculty recruits who are either computational biologists or whose research programs are reliant on our computational strengths
    Enhance the appeal of our graduate program to prospective students who wish to be trained in computational approaches

The Genetics and Multiomics Core has facilities at UT Health San Antonio and the University of Texas at the Rio Grande Valley. Facilities include the UT Health San Antonio Genome Sequencing Facility and several high-memory servers managed by IMS.

The core has dedicated staff and faculty to support the educational and research program conducted at the STAC.

Faculty, students and staff have access to state-of-the-art facilities consisting of faculty-owned work-stations, a newly-installed HPC and access to external HPC’s, including the Texas Advanced Computing Center.

The core also has a collection of bioinformatics tools such as Ingenuity Pathway Analysis, Cytoscape, TRANSFAC database and other functional data analysis tools.

Genetics and Multiomics Core faculty include bioinformaticians and computational biologists and genetic epidemiologists.

The Genome Sequencing Facility is ~1,500sq ft in controlled-access research space conveniently located at the Greehey Children’s Cancer Research Institute walking distance from UT Health San Antonio’s other research facilities. It contains six bench stations and standard equipment for a fully operational molecular lab.

The Genome Sequencing Facility at UT Health San Antonio is currently equipped with:

  • 1Illumina HiSeq 3000 system: suitable for whole-genome DNA-seq, exome-seq, RNA-seq, scRNA-seq and ChIP-seq
  • 1 NextSeq 500: used for RNA-seq, ChIP-seq, small RNA-seq, scRNA-seq, and exome-seq analyses for individual project with fast turnaround times
  • 1 MiSeq: commonly used for amplicon sequencing and 16S metagenomics
  • All necessary peripheral instruments used for sample processing and sequencing to meet the needs of various high-throughput sequencing projects
  • The bioinformatics component of the facility is supported by an Illumina Compute system and several large Linux servers.
  • The Genome Sequencing Facility provides equipment and expertise to promote various functional genomics studies by using advanced technologies, such as next-generation sequencing, microarray, flow cytometry and imaging system.

The appropriate equipment that could be used to fulfill the missions of the STAC include:

  • Illumina HiSeq 3000 Sequencing System
  • Illumina NextSeq 500 Sequencing System
  • Illumina MiSeq Sequencing System
  • Illumina cBot Cluster Generation Station
  • 10X Genomics Chromium System for single-cell sequencing library preparation
  • Covaris S220 Ultra Sonicator
  • Agilent 2100 Bioanalyzer
  • Advanced Analytical Fragment Analyzer
  • ProteinSimple FluorChem E
  • Eppendorf Realplex Quantitative PCR
  • Eppendorf epMotion 5075t Liquid Handling Workstations

Currently, NGS assays used by users cover a diverse array of protocols.
Main areas include:

  1. Genomics sequencing (for genome structure variation and chromatin three-dimensional architecture capture), which encompasses whole genome DNA-seq (WGS), whole-exome DNA-seq (WES), Hi-C and targeted gene re-sequencing.’
  2. Transcriptome sequencing (for whole transcriptome gene regulation analysis, miRNA expression in different brain tissues, and pathway analysis), which includes total RNA-seq, mRNA-seq, small RNA-seq and targeted gene expression.
  3. Epigenomic sequencing (for Alzheimer’s disease DNA methylation, histone modification, DNA-protein interactions, RNA-protein interactions, DNA-RNA interactions, RNA modification, and methylation), which includes methyl-CpG binding domain-based capture-seq (MBDCap-seq), ChIP-seq, Ribo-seq, RIP-seq, CLIP-seq and GRO-seq analyses.
  4. Amplicon sequencing: CRISPR-cas9 insert screening, targeted gene seq, ITS amplicon seq and 16S rRNA seq.

Computational biology and bioinformatics work require instant testing and optimization of biological software. Faculty at the STAC own powerful workstations (each with 80 cores, 256GB RAM, 16TB storage), as well as IMS-managed servers (each with 104-cores, 512 GB RAM, 60 TB local storage) and data storage (1 PB research archival data storage system based on opensource Ceph storage platform).

  • Network connectivity: The university provides an enterprise network with gig connection from the data center to the entire campus.
  • Storage backup and service: The center manages and services computer servers and data storage for the STAC, including backup (weekly incremental and monthly full backup), additional data storage requests and handling.

In 2019, all computing equipment were consolidated into the Advanced Data Center, including the newly installed GENIE cluster system (see below). The center is managed by Information Management and Services providing operating systems and operations support for computer systems residing in the center, and two legacy data centers at the McDermott and South Texas Research Facility buildings. Both legacy data centers will retire within a few years after consolidation and the Advanced Data Center duplicator is completed.

As part of the Data Center consolidation and modernization effort, the newly established UT Health San Antonio Advanced Data Center is an enterprise data center located at the Greehey campus approximately 1 mile from the main campus. The new modular facility design provides IT floor space with several state-of-the-art capabilities and additional power capacity to accommodate future growth. The selection of new technology includes green initiatives, considering both the use of more power-efficient processors and making more efficient use of that processing power through more intelligent and automating software. Optimal security, performance and availability are achieved through a combination of operational tasks performed by data center operators and automated tools implemented and managed by the systems programming staff. The consolidation of our database, data storage to the center and management by IMS also assure that they meet regulatory and compliance requirements, like HIPAA and HITECH Act.

The University IT Disaster Recovery plan is designed to provide institutional recovery from the destruction or failure of the enterprise data center. Business continuity planning aims to build a tested and validated plan to maintain key business functions and operations continuity before, during and following a disruptive event. This event could be a natural disaster, human error, power failure or technical system maintenance. Thus, the Business Continuity and Disaster Recovery Data Center is implemented at the UT Austin Shared Data Center in Austin, Texas to protect UT Health San Antonio from such a threat. The Business Continuity and Disaster Recovery Data Center site can also be utilized to run a small subset of the workloads during systems/application maintenance and data center migration to the new enterprise data center.  The two data centers are interconnected over a unique redundant 10 Gb point-to-point DWDM circuit safeguarding the data generated at STAC.

In July 2020, a new HPC cluster, collaboratively funded by the Biggs Institute and the Greehey Children’s Cancer Research Institute, was established. GENIE, the computational infrastructure for genomics, epigenomics, network, imaging and education, consists of two head nodes, two login nodes, one storage node (300TB), eighteen compute nodes (20 cores each), two large memory nodes (768GB each) and 10 GPU nodes. In total, there are 600 Xeon CPU cores, 9 NVIDIA T4 GPU, 9 NVIDIA V100s GPU cards, 12.3 TB aggregate RAM, 57 TB mirrored-SSD as scratch space, and all nodes are connected to an InfiniBand switch device that provides up to 200 Gb/s full bidirectional bandwidth per port (2-4 ports each node). GENIE is a combination of cutting-edge technologies, massive computing resource, with technical sophistication suitable from advance biomedical research that handles enormous genomic data and medical images from thousands of patients, processing them with the most demanding fast turn-around requirement and sophisticated Deep Learning and artificial intelligence applications for medical image processing and genomic data interpretation. The system will significantly shorten our next-generation sequencing data analysis time, from currently processing 2-3 human whole-genome sequencing (WGS) per day to ~100 WGS per day, out-pacing the current NovaSeq 6000 data generation capability (48 WGS every 3 days).

Faculty of the GMC have privileged access to the Texas Advanced Computing Center’s HPCs and dedicated technical support.

HPC’s include:

  • Stampede: Stampede is 1 of the 10 top supercomputers in the world with 10 petaFLOPS dedicated to scientific research. This Dell PowerEdge cluster is equipped with 6,400 Nodes. Each node has two 8-core Xeon E5 processors and one 61- core Xeon Phi co-processor and 32 GB of RAM.
  • Lonestar: Lonestar cluster has 1252 Cray XC40 compute nodes, each with two 12-core Intel® Xeon® processing cores for a total of 30,048 compute cores and 45 TB aggregated RAM.
  • Maverick: Maverick cluster has 132 nodes with 2 Intel Xeon E5-2680 v2 Ivy Bridge sockets and 10 CPU cores per socket. It has 132 nodes, each with 250 GB of RAM dedicated to memory-intensive computation.
  • Ranch: Ranch is a Sun Microsystems StorageTek Mass Storage Facility. It has 2 petabytes of online storage for data transfer and capacity for 160 petabytes of offline tape storage.

Additional bioinformatics tasks server computers at UT Health San Antonio:

  • 7x  Linux servers with 96-cores, 64-cores, 40-cores, and 16-cores, each with 512GB RAM and up to 30TB per system for complex computation (>200 computation cores)
  • 1x  Linux server with 2x Quad-core Xeon 3GHz, 32GB RAM as dedicated MySQL database server
  • 1x Linux server (2x Dual-core Xeon per node) for dedicated Illumina pipeline (demultiplexing, data transfer, etc.)
  • 1x  1PB shared data storage covered with a paid annual license agreement with the Advanced Data Center
  • 1x  6TB disk storages for general research activity shared within the Greehey Children’s Cancer Research Institute
  • 1x  10Gig internet connection to the University’s central network and storage support, as well as to the University of Texas’ system-wide support, including The University of Texas at San Antonio’s CBI facility