scReadSim for 10x MultiOme ATAC + Gene Expression Dataset
This tutorial demonstrates the application of scReadSim generating synthetic reads for single-cell multiomics with 10x Single Cell Multiome ATAC + Gene Expression Dataset. The current version of scReadSim multiomic module requires the input of two modalities, scATAC-seq and scRNA-seq, to have matchinging single cells.
This tutorial’s main steps and corresponding estimated time usage are as follows (tested on a server with the 256x Intel Xeon Phi CPU 7210 at 1.30 GHz):
Step 1: Import packages and data files: < 1 min
Step 2: Generate features: < 1 min
Step 3: Generate real count matrices: ~ 3 mins
Step 4: Simulate synthetic count matrix: ~ 6 mins
Step 5: Output synthetic read: ~ 8 mins
By default, this tutorial uses Python (Python >= 3.8). However, we also include code chunks using bash commands to preprocess necessary files. To avoid users’ confusion, bash commands start with a symbol $. We also indicate when a following code chunk is using bash commands.
Required softwares for scReadSim
scReadSim requires users to pre-install the following softwares:
Depending on users’ choices, the following softwares are optional:
Pre-process BAM file before scReadSim
Note: This tutorial does not need this pre-process step since the processed BAM file is provided by the scReadSim package (see below Step 1: Import packages and data files).
Input BAM file for scReadSim needs pre-processing to add the cell barcode in front of the read name. For example, in 10x sequencing data, cell barcode TGGACCGGTTCACCCA-1
is stored in the field CB:Z:TGGACCGGTTCACCCA-1
.
The following code chunk (bash commands) outputs a read record from the original BAM file.
$ samtools view unprocess.bam | head -n 1
A00836:472:HTNW5DMXX:1:1372:16260:18129 83 chr1 4194410 60 50M = 4193976 -484 TGCCTTGCTACAGCAGCTCAGGAAATGTCTTTGTGCCCACAGTCTGTGGT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:50 AS:i:50 XS:i:0 CR:Z:TCCGGGACAGCTAACA CY:Z:FFFFFFFFFFFFFFF: CB:Z:TGGACCGGTTCACCCA-1 BC:Z:AAACTCAT QT:Z::FFFFFFF RG:Z:e18_mouse_brain_fresh_5k:MissingLibrary:1:HTNW5DMXX:1
The following code chunk (bash commands) adds the cell barcodes in front of the read names.
$ # extract the header file
$ mkdir tmp
$ samtools view unprocess.bam -H > tmp/unprocess.header.sam
$ # create a bam file with the barcode embedded into the read name
$ time(cat <( cat tmp/unprocess.header.sam ) \
<( samtools view unprocess.bam | awk '{for (i=12; i<=NF; ++i) { if ($i ~ "^CB:Z:"){ td[substr($i,1,2)] = substr($i,6,length($i)-5); } }; printf "%s:%s\n", td["CB"], $0 }' ) \
| samtools view -bS - > processed.bam)
$ rm -dr tmp
$ samtools view processed.bam | head -n 1
TGGACCGGTTCACCCA-1:A00836:472:HTNW5DMXX:1:1372:16260:18129 83 chr1 4194410 60 50M = 4193976 -484 TGCCTTGCTACAGCAGCTCAGGAAATGTCTTTGTGCCCACAGTCTGTGGT :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NM:i:0 MD:Z:50 AS:i:50 XS:i:0 CR:Z:TCCGGGACAGCTAACA CY:Z:FFFFFFFFFFFFFFF: CB:Z:TGGACCGGTTCACCCA-1 BC:Z:AAACTCAT QT:Z::FFFFFFF RG:Z:e18_mouse_brain_fresh_5k:MissingLibrary:1:HTNW5DMXX:1
Download reference genome for test example
The example deploys scReadSim on the 10x Single Cell Multiome ATAC + Gene Expression Dataset. For user convienience, we prepared the indexed reference genome files (by bowtie2), which can be downloaded using the following bash commands:
GENCODE reference genome FASTA file and index file(indexed by bowtie2): reference.genome.chr1.tar.gz
GENCODE genome annotation gtf file: gencode.vM10.annotation.gtf
Note: users may need to edit the code by using their own path. The following code chunk is using bash commands.
$ mkdir /home/users/example/refgenome_dir # may use users' own path
$ cd /home/users/example/refgenome_dir
$ wget http://compbio10data.stat.ucla.edu/repository/gayan/Projects/scReadSim/reference.genome.chr1.tar.gz # 292 MB
$ wget http://compbio10data.stat.ucla.edu/repository/gayan/Projects/scReadSim/gencode.vM10.annotation.gtf # 765 MB
$ tar -xf reference.genome.chr1.tar.gz
Step 1: Import packages and data files
Import modules.
import sys, os
import scReadSim.Utility as Utility
import scReadSim.scRNA_GenerateBAM as scRNA_GenerateBAM
import scReadSim.scATAC_GenerateBAM as scATAC_GenerateBAM
import scReadSim.GenerateSyntheticCount_MultiOmics as GenerateSyntheticCount_MultiOmics
import pkg_resources
The real BAM file and other input files are listed and can be accessed by simply loading the code chunk below:
BAM file (RNA modality): 10X_RNA_chr1_3073253_4526737.bam
BAM file (ATAC modality): 10X_ATAC_chr1_4194444_4399104.bam
cell barcode file: barcodes.tsv
chromosome size file: mm10.chrom.sizes
# Load cell barcode and chrom size files
INPUT_cells_barcode_file = pkg_resources.resource_filename("scReadSim", 'data/barcodes.tsv')
INPUT_genome_size_file = pkg_resources.resource_filename("scReadSim", 'data/mm10.chrom.sizes')
# Load RNA bam file
filename_RNA = "10X_RNA_chr1_3073253_4526737"
INPUT_RNA_bamfile = pkg_resources.resource_filename("scReadSim", 'data/%s.bam' % filename_RNA)
# Load ATAC bam file
filename_ATAC = "10X_ATAC_chr1_4194444_4399104"
INPUT_ATAC_bamfile = pkg_resources.resource_filename("scReadSim", 'data/%s.bam' % filename_ATAC)
Step 2: Generate features
Specify output directory
Note: users may need to edit the code by using their own path.
outdirectory = "/home/users/example/outputs" # may use user's own path
os.mkdir(outdirectory)
Specify pre-installed software paths
Note: users may need to edit the code by using their own path.
samtools_directory="/home/users/Tools/samtools"
macs3_directory="/home/users/Tools/MACS3/bin"
bedtools_directory="/home/users/Tools/bedtools/bedtools2/bin"
seqtk_directory="/home/users/Tools/seqtk"
fgbio_jarfile="/home/users/Tools/fgbio/target/scala-2.13/fgbio-2.0.1-e884860-SNAPSHOT.jar"
Prepare Features for RNA modality
Given the input BAM file (RNA modality) and gene annotation file, scReadSim prepares the bed files for features using function scRNA_CreateFeatureSets
. This function needs user to specify
INPUT_bamfile
: Input BAM file for anlaysis.samtools_directory
: Path to software samtools.bedtools_directory
: Path to software bedtools.outdirectory
: Specify the output directory of the features files.genome_annotation
: Genome annotation file for the reference genome that the input BAM aligned on or the synthetic BAM should align on.genome_size_file
: Genome sizes file. The file should be a tab delimited text file with two columns: first column for the chromosome name, second column indicating the size.
This function will generate the following two bed files into directory outdirectory
for following analysis:
gene bed file: scReadSim.Gene.bed
inter-gene bed file: scReadSim.InterGene.bed
Note: users may need to edit the code by using their own path.
# Specify the absolute path to gene annotation file
INPUT_genome_annotation = "/home/users/example/refgenome_dir/gencode.vM10.annotation.gtf" # may use user's own path
# Generate features
Utility.scRNA_CreateFeatureSets(INPUT_bamfile=INPUT_RNA_bamfile, samtools_directory=samtools_directory, bedtools_directory=bedtools_directory, outdirectory=outdirectory, genome_annotation=INPUT_genome_annotation, genome_size_file=INPUT_genome_size_file)
Prepare Features for ATAC modality
To pre-process real scATAC-seq data for training, scReadSim segments the reference genome into trustworthy peaks, trustworthy non-peaks and gray ares. First scReadSim prepares the trustworthy peaks and non-peaks for the input BAM file. Then scReadSim defines gray areas as the genomic regions complementary to the trustworthy peaks and non-peaks. Three bed files recording peaks, non-peaks and gray areas will be prepared by scReadSim for following analysis.
To prepare features for the following analysis, scReadSim utilizes function Utility.scATAC_CreateFeatureSets
with following arguments
INPUT_bamfile
: Input BAM file for anlaysis.samtools_directory
: Directory of software samtools.bedtools_directory
: Directory of software bedtools.outdirectory
: Output directory of the prepared features.genome_size_file
: Directory of Genome sizes file. The file should be a tab delimited text file with two columns: first column for the chromosome name, second column indicates the size.peak_mode
: (Optional, default: “macs3”) Specify mode for trustworthy peak and non-peak generation, must be one of the following: “macs3”, “user”, and “superset”.macs3_directory
: (Optional, default: None) Path to software MACS3. Must be specified ifINPUT_peakfile
andINPUT_nonpeakfile
are None.INPUT_peakfile
: (Optional, default: None) Directory of user-specified input peak file.INPUT_nonpeakfile
: (Optional, default: None) Directory of user-specified input non-peak file.superset_peakfile
: (Optional, default: None) Directory of a superset of potential chromatin open regions, including sources such as ENCODE cCRE (Candidate Cis-Regulatory Elements) collection. Must be specified under peak_mode “superset”.OUTPUT_peakfile
: (Optional, default: None) Directory of user-specified output peak file. Synthetic scATAC-seq reads will be generated takingOUTPUT_peakfile
as ground truth peaks. Note thatOUTPUT_peakfile
does not name the generated feature files by functionscATAC_CreateFeatureSets
.
Three modes are supported by scReadSim to prepare features: “macs3” (default), “user” and “superset”.
Note: This tutorial provides an example with the default peak_mode
“macs3”. Thus the following two code chunks with peak_mode
set to “user” or “superset” do not need to be implemented.
Under default mode “macs3” (by setting argument peak_mode
as the default values “macs3”), scReadSim uses MACS3 with the stringent criteria to call trustworthy peaks (q-value 0.01
) and non-peaks (q-value 0.1
) from the input BAM file. This function will generate the following three bed files into directory outdirectory
for following analysis:
peak bed file: scReadSim.MACS3.peak.bed
non-peak bed file: scReadSim.MACS3.nonpeak.bed
gray area bed file: scReadSim.grayareas.bed
Utility.scATAC_CreateFeatureSets(peak_mode="macs3", INPUT_bamfile=INPUT_ATAC_bamfile, samtools_directory=samtools_directory, bedtools_directory=bedtools_directory, outdirectory=outdirectory, genome_size_file=INPUT_genome_size_file, macs3_directory=macs3_directory, INPUT_peakfile=None, INPUT_nonpeakfile=None)
Step 3: Generate real count matrices
RNA modality
Based on the feature sets output in Step 2, scReasSim constructs the UMI count matrices for genes and inter-genes through function Utility.scRNA_bam2countmat_paral
. This function needs user to specify
cells_barcode_file
: Cell barcode file corresponding to the input BAM file.bed_file
: Features bed file to generate the count matrix.INPUT_bamfile
: Input BAM file for anlaysis.outdirectory
: Specify the output directory of the count matrix file.count_mat_filename
: Specify the base name of output read (or UMI) count matrix.UMI_modeling
: (Optional, default: True) Specify whether scReadSim should model UMI count of the input BAM file.UMI_tag
: (Optional, default: ‘UB:Z’) If UMI_modeling is set to True, specify the UMI tag of input BAM file, default value ‘UB:Z’ is the UMI tag for 10x scRNA-seq.n_cores
: (Optional, default: ‘1’) Specify the number of cores for parallel computing when generating count matrix.
For the user specified count_mat_filename
, scReadSim will generate a count matrix named count_mat_filename
.txt to directory outdirectory
.
# Specify the path to bed files generated by Utility.scRNA_CreateFeatureSets
gene_bedfile = outdirectory + "/" + "scReadSim.Gene.bed"
intergene_bedfile = outdirectory + "/" + "scReadSim.InterGene.bed"
# Specify the output count matrices' prenames
UMI_gene_count_mat_filename = "%s.gene.countmatrix" % filename_RNA
UMI_intergene_count_mat_filename = "%s.intergene.countmatrix" % filename_RNA
## Construct count matrix for genes
Utility.scRNA_bam2countmat_paral(cells_barcode_file=INPUT_cells_barcode_file, bed_file=gene_bedfile, INPUT_bamfile=INPUT_RNA_bamfile, outdirectory=outdirectory, count_mat_filename=UMI_gene_count_mat_filename, UMI_modeling=True, UMI_tag = "UB:Z", n_cores=8)
## Construct count matrix for inter-genes
Utility.scRNA_bam2countmat_paral(cells_barcode_file=INPUT_cells_barcode_file, bed_file=intergene_bedfile, INPUT_bamfile=INPUT_RNA_bamfile, outdirectory=outdirectory, count_mat_filename=UMI_intergene_count_mat_filename, UMI_modeling=True, UMI_tag = "UB:Z", n_cores=8)
ATAC modality
Based on the feature sets output in Step 2, scReasSim constructs the count matrices for both foreground feautures and background features through function Utility.scATAC_bam2countmat_paral
. This function needs user to specify
cells_barcode_file
: Cell barcode file corresponding to the input BAM file.bed_file
: Features’ bed file to generate the count matrix (Generated by functionUtility.scATAC_CreateFeatureSets
).INPUT_bamfile
: Input BAM file for anlaysis.outdirectory
: Specify the output directory of the count matrix file.count_mat_filename
: Specify the base name of output count matrix.n_cores
: (Optional, default: ‘1’) Specify the number of cores for parallel computing when generating count matrix.
For the user specified count_mat_filename
, scReadSim will generate a count matrix named count_mat_filename
.txt to directory outdirectory
.
# Specify the path to bed files generated by Utility.scATAC_CreateFeatureSets
peak_bedfile = outdirectory + "/" + "scReadSim.MACS3.peak.bed"
nonpeak_bedfile = outdirectory + "/" + "scReadSim.MACS3.nonpeak.bed"
# Specify the output count matrices' prenames
count_mat_peak_filename = "%s.peak.countmatrix" % filename_ATAC
count_mat_nonpeak_filename = "%s.nonpeak.countmatrix" % filename_ATAC
# Construct count matrix for peaks
Utility.scATAC_bam2countmat_paral(cells_barcode_file=INPUT_cells_barcode_file, bed_file=peak_bedfile, INPUT_bamfile=INPUT_ATAC_bamfile, outdirectory=outdirectory, count_mat_filename=count_mat_peak_filename, n_cores=1)
# Construct count matrix for non-peaks
Utility.scATAC_bam2countmat_paral(cells_barcode_file=INPUT_cells_barcode_file, bed_file=nonpeak_bedfile, INPUT_bamfile=INPUT_ATAC_bamfile, outdirectory=outdirectory, count_mat_filename=count_mat_nonpeak_filename, n_cores=1)
Step 4: Simulate synthetic count matrix
In this tutorial, scReadSim implements scDesign3 to simulate single-cell multiomics count matrix based on the real count matrices obtained from the input BAM files. Use function GenerateSyntheticCount_MultiOmics.scMultiOmics_GenerateSyntheticCount
to generate synthetic count matrices with following paramters
RNA_count_mat_filename
: Base name of the count matrix output by functionscRNA_bam2countmat_paral()
.ATAC_count_mat_filename
: Base name of the count matrix output by functionscATAC_bam2countmat_paral()
.directory
: Path to the count matrix.outdirectory
: Specify the output directory of the synthetic count matrix file.doub_classification_label_file
: (Optional, default: ‘None’) Specify the absolute path to the doublet classification resultdoublet_classification.Rdata
generated by functionDoubletDetection.detectDoublet
.n_cell_new
: (Optional, default: ‘None’) Number of synthetic cells. If not specified, scReadSim uses the number of real cells.celllabel_file
: (Optional, default: ‘None’) Specify the one-column text file containing the predefined cell labels. Make sure that the order of cell labels correspond to the cell barcode file. If no cell labels are specified, scReadSim performs a Louvain clustering before implementing scDesign3.n_cores
: (Optional, default: 1): Number of cores for parallel computing.
Given the input RNA count matrix RNA_count_mat_filename
.txt and ATAC count matrix ATAC_count_mat_filename
.txt, scReadSim generates the syntheitic count matrices file to outdirectory
for following analysis:
Synthetic count matrix (RNA modality):
RNA_count_mat_filename
.scMultiOmics.scDesign3Simulated.ATAC.txtSynthetic count matrix (ATAC modality):
ATAC_count_mat_filename
.scMultiOmics.scDesign3Simulated.ATAC.txtSynthetic cell cluster/type labels: scMultiOmics.scDesign3Simulated.CellTypeLabel.txt
Additionaly, if no celllabel_file
is specified, scReadSim automatically performs Louvain clustering (from Seurat) based on the RNA modality count matrix and outputs clustering labels to outdirectory
:
Real cells’ Louvain clustering labels:
RNA_count_mat_filename
.LouvainClusterResults.txt
# Generate multiomic count matrices for genes and peaks
GenerateSyntheticCount_MultiOmics.scMultiOmics_GenerateSyntheticCount(RNA_count_mat_filename=UMI_gene_count_mat_filename, ATAC_count_mat_filename=count_mat_peak_filename, directory=outdirectory, outdirectory=outdirectory, n_cores=10)
# Specify clustering labels obtained from gene-by-cell matrix
celllabel_file = outdirectory + "/" + "10X_RNA_chr1_3073253_4526737.gene.countmatrix.LouvainClusterResults.txt"
# Generate multiomic count matrices for intergenes and non-peaks
GenerateSyntheticCount_MultiOmics.scMultiOmics_GenerateSyntheticCount(RNA_count_mat_filename=UMI_intergene_count_mat_filename, ATAC_count_mat_filename=count_mat_nonpeak_filename, directory=outdirectory, outdirectory=outdirectory, celllabel_file=celllabel_file ,n_cores=10)
Step 5: Output synthetic read
Generate synthetic reads in BED format (RNA modality)
Based on the synthetic count matrix, scReadSim generates synthetic reads by randomly sampling from the real BAM file input by users. First use function scRNA_GenerateBAMCoord
to create the synthetic reads and output in BED file storing the coordinates information. Function scRNA_GenerateBAMCoord
takes following input arguments:
bed_file
: Features’ bed file to generate the synthetic reads (Generated by functionUtility.scRNA_CreateFeatureSets
).UMI_count_mat_file
: The path to the synthetic UMI count matrix generated byGenerateSyntheticCount.scRNA_GenerateSyntheticCount
.synthetic_cell_label_file
: Synthetic cell label file generated byscRNA_GenerateSyntheticCount
.read_bedfile_prename
: Specify the base name of output bed file.INPUT_bamfile
: Input BAM file for anlaysis.outdirectory
: Specify the output directory for synthetic reads bed file.OUTPUT_cells_barcode_file
: Specify the file name storing the synthetic cell barcodes.jitter_size
: (Optional, default: ‘5’) Specify the range of random shift to avoid replicate synthetic reads. Default value is 5 bp.read_len
: (Optional, default: ‘50’) Specify the length of synthetic reads. Default value is 50 bp.
This function will output a bed file read_bedfile_prename
.read.bed storing the coordinates information of synthetic reads and its cell barcode file OUTPUT_cells_barcode_file
in directory outdirectory
.
After generation of synthetic reads for genes and inter-genes, combine the two bed files using function scRNA_GenerateBAM.scRNA_CombineBED
, which takes following input arguments:
outdirectory
: Directory ofgene_read_bedfile_prename
.txt andintergene_read_bedfile_prename
.txt.gene_read_bedfile_prename
: File prename of foreground synthetic reads bed file.intergene_read_bedfile_prename
: File prename of background synthetic reads bed file.BED_filename_combined_pre
: Specify the combined syntehtic reads bed file prename. The combined bed file will be output tooutdirectory
.
# Specify the names of synthetic count matrices (generated by GenerateSyntheticCount.scRNA_GenerateSyntheticCount)
synthetic_cell_label_file = "scMultiOmics.scDesign3Simulated.CellTypeLabel.txt"
synthetic_countmat_gene_file = UMI_gene_count_mat_filename + ".scMultiOmics.scDesign3Simulated.RNA.txt"
synthetic_countmat_intergene_file = UMI_intergene_count_mat_filename + ".scMultiOmics.scDesign3Simulated.RNA.txt"
# Specify the base name of bed files containing synthetic reads
gene_read_bedfile_prename = "%s.scMultiOmics.scReadSim.RNA.gene" % filename_RNA
intergene_read_bedfile_prename = "%s.scMultiOmics.scReadSim.RNA.intergene" % filename_RNA
RNA_read_bedfile_prename = "%s.scMultiOmics.scReadSim.RNA.combined" % filename_RNA
# Specify the file name of synthetic cell barcodes
OUTPUT_cells_barcode_file = "synthetic_cell_barcode.txt"
# Create synthetic read coordinates for genes
scRNA_GenerateBAM.scRNA_GenerateBAMCoord(
bed_file=gene_bedfile, UMI_count_mat_file=outdirectory + "/" + synthetic_countmat_gene_file, synthetic_cell_label_file=outdirectory + "/" + synthetic_cell_label_file, read_bedfile_prename=gene_read_bedfile_prename, INPUT_bamfile=INPUT_RNA_bamfile, outdirectory=outdirectory, OUTPUT_cells_barcode_file=OUTPUT_cells_barcode_file, jitter_size=5, read_len=90)
# Create synthetic read coordinates for intergenes
scRNA_GenerateBAM.scRNA_GenerateBAMCoord(
bed_file=intergene_bedfile, UMI_count_mat_file=outdirectory + "/" + synthetic_countmat_intergene_file, synthetic_cell_label_file=outdirectory + "/" + synthetic_cell_label_file, read_bedfile_prename=intergene_read_bedfile_prename, INPUT_bamfile=INPUT_RNA_bamfile, outdirectory=outdirectory, OUTPUT_cells_barcode_file=OUTPUT_cells_barcode_file, jitter_size=5, read_len=90)
# Combine synthetic read bed files
scRNA_GenerateBAM.scRNA_CombineBED(outdirectory=outdirectory, gene_read_bedfile_prename=gene_read_bedfile_prename, intergene_read_bedfile_prename=intergene_read_bedfile_prename, BED_filename_combined_pre=RNA_read_bedfile_prename)
Generate synthetic reads in BED format (ATAC modality)
Based on the synthetic count matrix, scReadSim generates synthetic reads by randomly sampling from the real BAM file input by users. First use function scATAC_GenerateBAM.scATAC_GenerateBAMCoord
to create the synthetic reads and output in BED file storing the coordinates information. Function scATAC_GenerateBAM.scATAC_GenerateBAMCoord
takes following input arguments:
bed_file
: Features’ bed file to generate the synthetic reads (Generated by functionUtility.scATAC_CreateFeatureSets
).count_mat_file
: The path to the synthetic count matrix generated byGenerateSyntheticCount.scATAC_GenerateSyntheticCount
.synthetic_cell_label_file
: Synthetic cell label file generated byscATAC_GenerateSyntheticCount
.read_bedfile_prename
: Specify the base name of output bed file.INPUT_bamfile
: Input BAM file for anlaysis.outdirectory
: Specify the output directory for synthetic reads bed file.OUTPUT_cells_barcode_file
: Specify the file name storing the synthetic cell barcodes.jitter_size
: (Optional, default: ‘5’) Specify the range of random shift to avoid replicate synthetic reads. Default value is 5 bp.read_len
: (Optional, default: ‘50’) Specify the length of synthetic reads. Default value is 50 bp.random_noise_mode
: (Optional, default: ‘False’) Specify whether to use a uniform distribution of reads.GrayAreaModeling
: (Optional, default: ‘False’) Specify whether to generate synthetic reads for Gray Areas when generaing reads for non-peaks. Do not specify ‘True’ when generating reads for peaks.
This function will output two bed files read_bedfile_prename
.read1.bed and read_bedfile_prename
.read2.bed storing the coordinates information of synthetic reads and its cell barcode file OUTPUT_cells_barcode_file
in directory outdirectory
.
After generation of synthetic reads for both peaks and non-peaks, combine the their bed files using function scATAC_GenerateBAM.scATAC_CombineBED
, which takes following input arguments:
outdirectory
: Directory ofpeak_read_bedfile_prename
.txt andnonpeak_read_bedfile_prename
.txt.peak_read_bedfile_prename
: Base name of the bed file containig synthetic reads for peaks (generated by functionscATAC_GenerateBAM.scATAC_GenerateBAMCoord
).nonpeak_read_bedfile_prename
: Base name of the bed file containig synthetic reads for non-peaks (generated by functionscATAC_GenerateBAM.scATAC_GenerateBAMCoord
).BED_filename_combined_pre
: Specify the base name for the combined syntehtic reads bed file. The combined bed file will be output tooutdirectory
.
# Specify the names of synthetic count matrices (generated by GenerateSyntheticCount.scATAC_GenerateSyntheticCount)
synthetic_countmat_peak_file = count_mat_peak_filename + ".scMultiOmics.scDesign3Simulated.ATAC.txt"
synthetic_countmat_nonpeak_file = count_mat_nonpeak_filename + ".scMultiOmics.scDesign3Simulated.ATAC.txt"
# Specify the base name of bed files containing synthetic reads
peak_read_bedfile_prename = "%s.scMultiOmics.scReadSim.ATAC.peak" % filename_ATAC
nonpeak_read_bedfile_prename = "%s.scMultiOmics.scReadSim.ATAC.nonpeak" % filename_ATAC
ATAC_read_bedfile_prename = "%s.scMultiOmics.scReadSim.ATAC.combined" % filename_ATAC
# Create synthetic read bed file for peaks
scATAC_GenerateBAM.scATAC_GenerateBAMCoord(bed_file=peak_bedfile, count_mat_file=outdirectory + "/" + synthetic_countmat_peak_file, synthetic_cell_label_file=outdirectory + "/" + synthetic_cell_label_file, read_bedfile_prename=peak_read_bedfile_prename, INPUT_bamfile=INPUT_ATAC_bamfile, outdirectory=outdirectory, OUTPUT_cells_barcode_file=OUTPUT_cells_barcode_file, jitter_size=5, read_len=50)
# Ceate synthetic read bed file for non-peaks
scATAC_GenerateBAM.scATAC_GenerateBAMCoord(bed_file=nonpeak_bedfile, count_mat_file=outdirectory + "/" + synthetic_countmat_nonpeak_file, synthetic_cell_label_file=outdirectory + "/" + synthetic_cell_label_file, read_bedfile_prename=nonpeak_read_bedfile_prename, INPUT_bamfile=INPUT_ATAC_bamfile, outdirectory=outdirectory, OUTPUT_cells_barcode_file=OUTPUT_cells_barcode_file, jitter_size=5, read_len=50, GrayAreaModeling=True)
# Combine bed files
scATAC_GenerateBAM.scATAC_CombineBED(outdirectory=outdirectory, peak_read_bedfile_prename=peak_read_bedfile_prename, nonpeak_read_bedfile_prename=nonpeak_read_bedfile_prename, BED_filename_combined_pre=ATAC_read_bedfile_prename)
Convert BED files to FASTQ files
Use functions scRNA_BED2FASTQ
and scATAC_BED2FASTQ
to convert BED file to FASTQ file. These functions take the following arguments:
bedtools_directory
: Path to software bedtools.seqtk_directory
: Path to software seqtk.referenceGenome_file
: Reference genome FASTA file that the synthteic reads should align.outdirectory
: Output directory of the synthteic bed file and its corresponding cell barcodes file.BED_filename_combined
: Base name of the combined bed file output by functionscRNA_CombineBED
orscATAC_CombineBED
.synthetic_fastq_prename
: Specify the base name of the output FASTQ files.
This function will output paired-end reads in FASTQ files named as synthetic_fastq_prename
.read1.bed2fa.sorted.fq, synthetic_fastq_prename
.read2.bed2fa.sorted.fq to directory outdirectory
.
Note: users may need to edit the code by using their own path.
referenceGenome_name = "chr1"
referenceGenome_dir = "/home/users/example/refgenome_dir" # may use users' own path
referenceGenome_file = "%s/%s.fa" % (referenceGenome_dir, referenceGenome_name)
# RNA modality
RNA_synthetic_fastq_prename = RNA_read_bedfile_prename
# Convert combined bed file into FASTQ files
scRNA_GenerateBAM.scRNA_BED2FASTQ(bedtools_directory=bedtools_directory, seqtk_directory=seqtk_directory, referenceGenome_file=referenceGenome_file, outdirectory=outdirectory, BED_filename_combined=RNA_read_bedfile_prename, synthetic_fastq_prename=RNA_synthetic_fastq_prename)
# ATAC modality
ATAC_synthetic_fastq_prename = ATAC_read_bedfile_prename
# Convert combined bed file into FASTQ files
scATAC_GenerateBAM.scATAC_BED2FASTQ(bedtools_directory=bedtools_directory, seqtk_directory=seqtk_directory, referenceGenome_file=referenceGenome_file, outdirectory=outdirectory, BED_filename_combined=ATAC_read_bedfile_prename, synthetic_fastq_prename=ATAC_synthetic_fastq_prename)
Introduce Error to synthetic data
Use function scRNA_GenerateBAM
and scATAC_ErrorBase
to introduce random error to synthetic reads.
Build reference genome dictionary (optional)
Note that before using function scRNA_GenerateBAM
and scATAC_ErrorBase
, please create the reference dictionary for the reference genome with function CreateSequenceDictionary
using software Picard and make sure that the output .dict files are within the same directory to referenceGenome_name
.fa.
Note: For this tutorial, no dictionary building is needed since we have built for chr1.fa in reference.genome.chr1.tar.gz. The following code chunk is using bash commands.
$ cd /home/users/example/refgenome_dir # may use users' own path
$ java -jar /home/users/picard/build/libs/picard.jar CreateSequenceDictionary \
$ -R chr1.fa \
$ -O chr1.fa.dict
Introduce errors to synthetic reads
Functions scRNA_GenerateBAM
and scATAC_ErrorBase
take the following arguments:
fgbio_jarfile
: Path to software fgbio jar script.INPUT_bamfile
: Input BAM file for anlaysis.referenceGenome_file
: Reference genome FASTA file that the synthteic reads should align.outdirectory
: Specify the output directory of the synthteic FASTQ file with random errors.synthetic_fastq_prename
: Base name of the synthetic FASTQ files output by functionscRNA_BED2FASTQ
orscATAC_BED2FASTQ
.
This function will output synthetic reads with random errors in FASTQ files named as synthetic_fastq_prename
.ErrorIncluded.read1.bed2fa.fq, synthetic_fastq_prename
.ErrorIncluded.read2.bed2fa.fq to directory outdirectory
.
# RNA modality
scRNA_GenerateBAM.scRNA_ErrorBase(fgbio_jarfile=fgbio_jarfile, INPUT_bamfile=INPUT_RNA_bamfile, referenceGenome_file=referenceGenome_file, outdirectory=outdirectory, synthetic_fastq_prename=RNA_synthetic_fastq_prename)
# ATAC modality
scATAC_GenerateBAM.scATAC_ErrorBase(fgbio_jarfile=fgbio_jarfile, INPUT_bamfile=INPUT_ATAC_bamfile, referenceGenome_file=referenceGenome_file, outdirectory=outdirectory, synthetic_fastq_prename=ATAC_synthetic_fastq_prename)