Cancer RNA-Seq Nexus Tutorial
Cancer RNA-Seq Nexus: a comprehensive database of phenotype-specific transcriptome profiling in cancer cells
The Framework of the database construction in the CRN database.
Research and development laboratories
The main interface of the database
The Web Function in the right Panel
Tab of DE coding transcripts (differentially expressed protein-coding transcripts)
Tab of DE IncRNAs (differentially expressed protein-coding transcripts)
Tab of mRNA-lncRNA coexpression network
Tab of search (users can input coding transcript or lncRNA)
RNA Sequencing (RNA-Seq), a fast development and applications of next-generation sequencing (NGS) technology in recent years, which has promoted genetic research and been used to several cancer research to provide a revolutionary tool to study alternative splicing and quantify gene/isoform expression levels. In recent RNA-seq research, we have constructed the database for isoform-isoform interactions using 19 RNA-seq datasets (published in BMC Genomics 2015), and performed high-resolution functional annotation of human transcriptome using 29 RNA-seq datasets (published in Nucleic Acids Research 2014). Nevertheless, these studies have not utilized long non-coding RNA (lncRNA) expression profiles yet. Construction of a large-scale and comprehensive RNA-seq database with both coding transcripts and lncRNAs is necessary for further improving isoform-isoform interaction study and isoform function prediction. Here, we present the Cancer RNA-Seq Nexus (CRN) database, the first public database providing phenotype-specific coding-transcript/lncRNA expression profiles and lncRNA regulatory networks in cancer cells. CRN is freely available at http://syslab4.nchu.edu.tw/CRN.
In the CRN database, we systematically collected RNA-seq datasets from The Cancer Genome Atlas (TCGA), Sequence Read Archive (SRA) and NCBI Gene Expression Omnibus (GEO). It resulted in 89 cancer RNA-seq datasets including 325 subsets and 12,167 samples. Each dataset has several phenotype-specific subsets, and each subset contained a group of RNA-seq samples with specific phenotypic traits or cancer conditions, e.g. disease state, cell line, cell type, tissue, genotype. We manually created subsets and then assigned samples to subsets according to the description of the datasets and samples. To identify phenotype-specific differentially expressed transcripts (DETs) in each dataset, we selected the subsets with at least 3 samples, and then performed t-test between two subsets without overlap samples. It resulted in 973 subset pairs including 822 “cancer vs. cancer” subset pairs and 151 “cancer vs. normal” subset pairs. To obtain the expression profiles for both coding transcripts and lncRNAs, we align the RNA-seq reads to the Human transcriptome (GENCODE release 21) included 93,139 protein-coding and 26,414 lncRNA transcript sequences.
The CRN database includes 40 cancers (e.g. lung cancer, colon cancer and breast cancer) and 325 phenotype-specific subsets. Each subset contains a group of RNA-seq samples with specific phenotype or genotype, e.g. breast cancer stage II, ER+ breast cancer and Her2+ breast cancer. Thus, CRN database can facilitate the personalized medicine. For example, the triple-negative breast cancer is not responsive to current targeted therapeutics with characteristic of negative expressed ER, PR, and Her2/Neu.
The expression of two TP63 major isoform groups “TAp63 and ΔNp63” in lung squamous cell carcinoma and lung adenocarcinoma.
(A) and (B) shows TP63 isoform expression profiles in lung squamous cell carcinoma and normal lung squamous cells, respectively. (C) and (D) show TP63 isoform expression profiles in lung adenocarcinoma and normal lung tissue cells, respectively. ΔNp63 isoforms are significantly overexpressed in lung squamous cell carcinoma compared with normal lung cells, whereas ΔNp63 isoforms do not have differential expressions between cancer and normal subsets in lung adenocarcinoma. In contrast, TAp63 is expressed extremely low in all four subset pairs.
The cancer RNA-Seq datasets were collected from NCBI GEO, SRA and TCGA, and then all samples were classified into different phenotype-specific subsets. In GEO datasets, Bowtie2 and eXpress software were used to calculate isoform expressions using GENCODE v21 reference. In TCGA datasets, we transferred the expression values (i.e. tau value) of the TCGA Level 3 RNA-seq version 2 datasets to TPM (transcripts per million) values. To identify phenotype-specific differentially expressed protein-coding transcripts and lncRNAs in each dataset, we performed t-test between two subsets which do not have overlapping samples and are from the same dataset. Given a subset pair, we selected differentially expressed coding transcripts and lncRNAs, and then performed the correlation analysis of expression profiles between selected coding transcripts and lncRNAs to construct an mRNA-lncRNA coexpression network.
The CRN database provides a tree structure in the disease panel, which facilitates searching and browsing related cancer subsets. Step 1. Users select a cancer name or a cancer subset, the associated subset pairs are subsequently listed in the subset-pair panel.
Step 2. Users select a subset pair.
Step 3. Then web server shows the detailed description of dataset and subsets.
The CRN web interface provides three major panels as follows:
(1) Disease-dataset panel (up-left panel): the hierarchical structure is constructed using cancer diseases and cancer subsets. A subset represents a group of RNA-seq samples associatedwith specific phenotype or genotype, e.g. breast cancer stage II, ER+ breast cancer and Her2+ breast cancer.
(2) Subset-pair panel (bottom-left panel): there are two types of subset pairs: “Cancer v.s. Cancer” and “Cancer v.s. Normal”. When users click a cancer disease or a cancer subset, CRN shows the associated subset pairs in the subset-pair panel.
(3) Profile panel (right panel): when clicking a subset pair in the subset-pair panel, CRN displays the detailed description of dataset and subsets.
When users select a subset pair, the web server shows the detailed description of dataset and subsets.
In the right panel, there are four tab panels as follows:
(i) DE coding transcripts: it visualizes the expression profiles of differentially expressed (DE) protein-coding transcripts.
(ii) DE lncRNAs: it visualizes the expression profiles of DE lncRNAs.
(iii) mRNA-lncRNA coexpression network: when searching gene and lncRNA in this tab panel, it visualizes the coexpression network using the most significant negative and positive correlations between coding transcripts and lncRNAs.
(iv) Search of coding transcript and lncRNA: users can search gene/transcript names and transcript IDs in this panel.
In the right panel, there are four tab panels as follows:
Given a subset pair, CRN visualizes the expression profiles of differentially expressed (DE) protein-coding transcripts sorted by the significance level (P value) between two subsets.
If the mouse cursor is placed over the tab of DE of coding transcripts, the full name of the tab will display.
Users have the following options of selecting DE transcripts: P value threshold and up/down regulation. The rank, P value, expression values of two subsets, transcript ID and gene symbol are shown for each transcript.
Given a subset pair, CRN visualizes the expression profiles of DE lncRNAs sorted by the significance level between two subsets.
If the mouse cursor is placed over the tab of DE lncRNAs, the full name of the tab will display.
Users have the following options of selecting DE lncRNAs: P value threshold and up/down regulation. The rank, P value, expression values of two subsets, lncRNA ID and lncRNA name are shown for each lncRNA.
Given a subset pair, CRN visualizes an mRNA-lncRNA coexpression network using DE coding transcripts and DE lncRNAs with the significant correlations between the coding transcripts and lncRNAs.
Users can input a gene symbol and an lncRNA name, and then select a correlation threshold. To select the significant connections between coding transcripts and lncRNAs, users can set the option of the top N connections (N = 10, 15, 20 or 25), and then the web server selects the most significant N connections between the given coding transcript and lncRNA.
Given a gene symbol, CRN provides a search function of showing expression profiles of all transcripts associated to the given gene, wherein users can investigate the differential expressions of various isoforms from the same gene.
Users can input only part of a gene symbol or transcript ID, then the auto-complete function of CRN provides quickly searching and selecting the partially matched terms. Furthermore, by entering more characters, it will filter down the list to better matches. The auto-complete function can not only help users searching efficiently, but also make a quick filtering. After selecting a gene or a transcript, the panel shows the expression profiles of the given gene or transcript.