Tutorials for predicting human and mouse data modalities using EPCOT v2.
========================================================================

Introduction
------------
EPCOT v2 predicts diverse molecular modalities for the central 500kb regions extracted from 600kb segments across the entire genome.

Human Data Prediction
---------------------
(1) **Input file preparation.** The initial BAM file, is pre-processed to become the input for EPCOT v2. The name of the generated .bigWig file is formatted as '{cell_line_name}_atac'. **Note** The detailed locations (chrom id, start pos, end pos) of each 600kb region is stored in file "input_region_600kb.bed", which can be found on Github. There are 11,123 600kb regions across the whole human genome.

.. code-block:: bash

    ## Generate the .bigWig file
    bamCoverage --bam input.bam -o ${cell_line}_atac.bigWig --outFileFormat bigwig --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 --Offset 1 --binSize 10 --numberOfProcessors 12 \
    --blackListFileName ../data/black_list.bed
    ## Run atac_process.py to generate the .pickle file which is the required input format of EPCOT v2.
    python atac_process.py ${cell_line}_atac.bigWig

(2) **Prediction.** Please confirm the directories of input ${cell_line}_atac.pickle file and the output ${cell_line}.pickle file. You can specify which modality you want to predict in file "tt_pred_gw.py" by modifying variable "pred_modalities".

 .. code-block:: bash

     ## Run the model execution file tt_pred_gw.py to predict desired molecular modalities. 
     python tt_pred_gw.py

(3) **Extraction of each predicted modality.** Assume the output file is called 'spleen.pickle'. Each predicted sample is the central 500kb from 600kb segments, composed of 500 1kb genomic bin.

.. code-block:: python

    import pickle
    with open('spleen.pickle', 'rb') as file:
        spleen_pickle = pickle.load(file)
    ## All predicted modalities
    spleen_pickle['epi']
    spleen_pickle['rna'] # expected shape: (11123, 500, 3)
    spleen_pickle['bru'] # expected shape: (11123, 500, 3)
    spleen_pickle['microc']  # expected shape: (11123, 500, 500, 2)
    spleen_pickle['hic'] # expected shape: (11123, 100, 100, 3) Hi-C is predicted at 5kb resolution.
    spleen_pickle['intacthic'] # expected shape: (11123, 500, 2)
    spleen_pickle['rna_strand'] # expected shape: (11123, 500, 2)
    spleen_pickle['external_tf'] 
    spleen_pickle['tt'] # expected shape: (11123, 500, 2)
    spleen_pickle['groseq'] # expected shape: (11123, 500, 2)
    spleen_pickle['grocap'] # expected shape: (11123, 500, 4)
    spleen_pickle['proseq'] # expected shape: (11123, 500, 3)
    spleen_pickle['netcage'] # expected shape: (11123, 500, 2)
    spleen_pickle['starr'] # expected shape: (11123, 500, 1)

Here is the explanation of each modality that can be predicted:

(1) Epigenomic features. The list of epigenomic features can be found on Github in a file named "epigenomes.txt".

(2) RNA-seq. 1. CAGE-seq 2. Total RNA-seq 3. PolyA+ RNA-seq

(3) Bru-seq. 1. Bru-seq 2. BruUV-seq 3. BruChase-seq

(4) Micro-c. 1. O/E normalized Micro-C 2. KR normalized Micro-C

(5) Hi-C. 1. CTCF ChIA-PET 2. RNApol2 ChIA-PET 3. Hi-C

(6) Intact Hi-C. 1. O/E normalized intact Hi-C 2. KR normalized intact Hi-C

(7) RNA Strand. 1. Total RNA-seq (forward) 2. Total RNA-seq (reverse)

(8) Additional TFs. The list of additional TFs can be found on Github in a file named unseeen_tf.txt

(9) TT-seq. 1. TT-seq (forward) 2. TT-seq (reverse)

(10) GRO-seq. 1. GRO-seq (forward) 2. GRO-seq (reverse)

(11) GRO-cap. 1. GRO-cap (forward) 2. GRO-cap (reverse) 3. GRO-cap_wTAP (forward) 4. GRO-cap_wTAP(reverse)

(12) PRO-seq. 1. PRO-seq (forward) 2. PRO-seq (reverse) 3. PRO-cap

(13) NET-CAGE. 1. NET-CAGE (forward) 2. NET-CAGE (reverse)

(14) STARR-seq. 1. STARR-seq