You can install the latest release of rs3 from pypi using
pip install rs3
For mac users you may also have to brew install the OpenMP library
brew install libomp
or install lightgbm without Openmp
pip install lightgbm --install-option=--nomp
See the LightGBM documentation for more information
Documentation
You can see the complete documentation for Rule Set 3 here.
To calculate Rule Set 3 (sequence) scores, import the predict_seq function from the seq module.
from rs3.seq import predict_seq
You can store the 30mer context sequences you want to predict as a list.
context_seqs = ['GACGAAAGCGACAACGCGTTCATCCGGGCA', 'AGAAAACACTAGCATCCCCACCCGCGGACT']
predict_seq(context_seqs, sequence_tracr='Hsu2013')
Target based model
To get target scores, which use features at the endogenous target site to make predictions, you must build or load feature matrices for the amino acid sequences, conservation scores, and protein domains.
As an example, we'll calculate target scores for 250 sgRNAs in the GeckoV2 library.
import pandas as pd
from rs3.predicttarg import predict_target
from rs3.targetfeat import (add_target_columns,
get_aa_subseq_df,
get_protein_domain_features,
get_conservation_features)
design_df = pd.read_table('test_data/sgrna-designs.txt')
design_df.head()
Throughout the analysis we will be using a core set of ID columns to merge the feature matrices. These ID columns should uniquely identify an sgRNA and its target site.
id_cols = ['sgRNA Context Sequence', 'Target Cut Length', 'Target Transcript', 'Orientation']
Amino acid sequence input
To calculate the amino acid sequence matrix, you must first load the complete sequence from ensembl using the
build_transcript_aa_seq_df
. See the documentation for the predicttarg
module for an example of how to
use this function.
In this example we will use amino acid sequences that have been precalculated using the write_transcript_data
function in the targetdata
module. Check out the documentation for this module for more information on
how to use this function.
We use pyarrow to read the written transcript data.
The stored transcripts are indexed by their Ensembl ID without the version number identifier.
To get this shortened version of the Ensembl ID use the add_target_columns
function from the targetfeat
module.
This function adds the 'Transcript Base' column as well as a column indicating the amino acid index ('AA Index')
of the cut site. The 'AA Index' column will be used for merging with the amino acid translations.
design_targ_df = add_target_columns(design_df)
design_targ_df.head()
transcript_bases = design_targ_df['Transcript Base'].unique()
transcript_bases[0:5]
aa_seq_df = pd.read_parquet('test_data/target_data/aa_seqs.pq', engine='pyarrow',
filters=[[('Transcript Base', 'in', transcript_bases)]])
aa_seq_df.head()
From the complete transcript translations, we extract an amino acid subsequence as input to our model. The subsequence is centered around the amino acid encoded by the nucleotide preceding the cut site in the direction of transcription. This is the nucleotide that corresponds to the 'Target Cut Length' in a CRISPick design file. We take 16 amino acids on either side of the cut site for a total sequence length of 33.
The get_aa_subseq_df
from the targetfeat
module will calculate these subsequences
from the complete amino acid sequences.
aa_subseq_df = get_aa_subseq_df(sg_designs=design_targ_df, aa_seq_df=aa_seq_df, width=16,
id_cols=id_cols)
aa_subseq_df.head()
Lite Scores
You now have all the information you need to calculate "lite" Target Scores, which are less data intensive than complete
target scores, with the predict_target
function from the predicttarg
module.
lite_predictions = predict_target(design_df=design_df,
aa_subseq_df=aa_subseq_df)
design_df['Target Score Lite'] = lite_predictions
design_df.head()
If you would like to calculate full target scores then follow the sections below.
Protein domain input
To calculate full target scores you will also need inputs for protein domains and conservation.
The protein domain input should have 16 binary columns for 16 different protein domain sources in addition to the
id_cols
. The protein domain sources are 'Pfam', 'PANTHER', 'HAMAP', 'SuperFamily', 'TIGRfam', 'ncoils', 'Gene3D',
'Prosite_patterns', 'Seg', 'SignalP', 'TMHMM', 'MobiDBLite', 'PIRSF', 'PRINTS', 'Smart', 'Prosite_profiles'.
These columns should be kept in order when inputting for scoring.
In this example we will load the protein domain information from a parquet file, which was written
using write_transcript_data
function in the targetdata
module. You can also query transcript data on the fly,
by using the build_translation_overlap_df
function. See the documentation for the predicttarg
module for more
information on how to do this.
domain_df = pd.read_parquet('test_data/target_data/protein_domains.pq', engine='pyarrow',
filters=[[('Transcript Base', 'in', transcript_bases)]])
domain_df.head()
Now to transform the domain_df
into a wide form for model input, we use the get_protein_domain_features
function
from the targetfeat
module.
domain_feature_df = get_protein_domain_features(design_targ_df, domain_df, id_cols=id_cols)
domain_feature_df.head()
For input into the predict_target
function, the domain_feature_df
should have the id_cols
as well as
columns for each of the 16 protein domain features.
Conservation input
Finally, for the full target model you need to calculate conservation features. The conservation features represent conservation across evolutionary time at the sgRNA cut site and are quantified using PhyloP scores. These scores are available for download by the UCSC genome browser for hg38 (phyloP100way), and mm39 (phyloP35way).
Within this package we query conservation scores using the UCSC genome browser's
REST API.
To get conservation scores, you can use the build_conservation_df
function from the targetdata
module.
Here we load conservation scores, which were written to parquet using the write_conservation_data
function from the
targetdata
module.
conservation_df = pd.read_parquet('test_data/target_data/conservation.pq', engine='pyarrow',
filters=[[('Transcript Base', 'in', transcript_bases)]])
conservation_df.head()
We normalize conservation scores to a within-gene percent rank, in the 'ranked_conservation' column, in order to make scores comparable across genes and genomes. Note that a rank of 0 indicates the least conserved nucleotide and a rank of 1 indicates the most conserved.
To featurize the conservation scores, we average across a window of 4 and 32 nucleotides centered around the nucleotide preceding the cut site in the direction of transcription. Note that this nucleotide is the 2nd nucleotide in the window of 4 and the 16th nucleotide in the window of 32.
We use the get_conservation_features
function from the targetfeat
module to get these features from the
conservation_df
.
For the predict_targ
function, we need the id_cols
and the columns 'cons_4' and 'cons_32' in the
conservation_feature_df
.
conservation_feature_df = get_conservation_features(design_targ_df, conservation_df,
small_width=2, large_width=16,
conservation_column='ranked_conservation',
id_cols=id_cols)
conservation_feature_df
Full Target Scores
In order to calculate Target Scores you must input the feature matrices and design_df to the predict_target
function from the predicttarg
module.
target_predictions = predict_target(design_df=design_df,
aa_subseq_df=aa_subseq_df,
domain_feature_df=domain_feature_df,
conservation_feature_df=conservation_feature_df,
id_cols=id_cols)
design_df['Target Score'] = target_predictions
design_df.head()
from rs3.predict import predict
import matplotlib.pyplot as plt
import gpplot
import seaborn as sns
Preloaded data
In this first example with the predict
function, we calculate predictions for GeckoV2 sgRNAs.
In this example the amino acid sequences, protein domains and conservation scores were prequeried using the
write_transcript_data
and write_conservation_data
functions from the targetdata module.
Pre-querying these data can be helpful for large scale design runs.
You can also use the predict
function without pre-querying and calculate
scores on the fly. You can see an example of this in the next section.
The predict
function allows for parallel computation
for querying databases (n_jobs_min
) and featurizing sgRNAs (n_jobs_max
).
We recommend keeping n_jobs_min
set to 1 or 2, as the APIs limit the amount of queries per hour.
design_df = pd.read_table('test_data/sgrna-designs.txt')
import multiprocessing
max_n_jobs = multiprocessing.cpu_count()
scored_designs = predict(design_df, tracr=['Hsu2013', 'Chen2013'], target=True,
n_jobs_min=2, n_jobs_max=max_n_jobs,
aa_seq_file='./test_data/target_data/aa_seqs.pq',
domain_file='./test_data/target_data/protein_domains.pq',
conservatin_file='./test_data/target_data/conservation.pq',
lite=False)
scored_designs.head()
Here are the details for the keyword arguments of the above function
tracr
- tracr to calculate scores for. If a list is supplied instead of a string, scores will be calculated for both tracrstarget
- boolean indicating whether to calculate target scoresn_jobs_min
,n_jobs_max
- number of cpus to use for parallel computationaa_seq_file
,domain_file
,conservatin_file
- precalculated parquet files. Optional inputs as these features can also be calculated on the flylite
- boolean indicating whether to calculate lite target scores
By listing both tracrRNAs tracr=['Hsu2013', 'Chen2013']
and setting target=True
,
we calculate 5 unique scores: one sequence score for each tracr, the target score,
and the sequence scores plus the target score.
We can compare these predictions against the observed activity from GeckoV2
gecko_activity = pd.read_csv('test_data/Aguirre2016_activity.csv')
gecko_activity.head()
gecko_activity_scores = (gecko_activity.merge(scored_designs,
how='inner',
on=['sgRNA Sequence', 'sgRNA Context Sequence',
'Target Gene Symbol', 'Target Cut %']))
gecko_activity_scores.head()
Since GeckoV2 was screened with the tracrRNA from Hsu et al. 2013, we'll use these scores sequence scores a part of our final prediction.
plt.subplots(figsize=(4,4))
gpplot.point_densityplot(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
x='RS3 Sequence (Hsu2013 tracr) + Target Score')
gpplot.add_correlation(gecko_activity_scores, y='avg_mean_centered_neg_lfc',
x='RS3 Sequence (Hsu2013 tracr) + Target Score')
sns.despine()
design_df = pd.read_table('test_data/sgrna-designs_BCL2L1_MCL1_EEF2.txt')
scored_designs = predict(design_df,
tracr=['Hsu2013', 'Chen2013'], target=True,
n_jobs_min=2, n_jobs_max=8,
lite=False)
scored_designs
We see that the predict function is querying the target data in addition to making predictions.