Get underlying data to calculate Rule Set 3 target scores
design_df = pd.read_table('test_data/sgrna-designs.txt')
max_n_jobs = multiprocessing.cpu_count()

Get amino acid sequences

ensembl_post[source]

ensembl_post(ext, data, headers=None, params=None)

Generic wrapper for using POST requests to the ensembl rest API

:param ext: str, url extension :param data: dict, query data :param headers: dict or None, meta-information for query :param params: dict or None, parameters for query :return: Response object

chunks[source]

chunks(lst, n)

Yield successive n-sized chunks from lst.

lst: list n: int

returns: generator of list chunks

post_transcript_sequence_chunk[source]

post_transcript_sequence_chunk(ids, params, headers)

Helper function for post_transcript_sequence

:param ids: list :param params: dict :param headers: dict :return: dict

post_transcript_sequence[source]

post_transcript_sequence(ensembl_ids, seq_type='protein', max_queries=50, n_jobs=1, **kwargs)

Request multiple types of sequence by stable identifier. Supports feature masking and expand options. Uses https://rest.ensembl.org/documentation/info/sequence_id_post

:param ensembl_ids: list of str :param seq_type: str, one of [genomic, cds, cdna, protein] :param max_queries: int, maximum number of queries for post :param n_jobs: int, number of jobs to run in parallel :param kwargs: additional parameter arguments :return: list, dict of sequences 5' to 3' in the same orientation as the input transcript

assert(post_transcript_sequence(["ENSG00000157764", "ENSG00000248378"],
                                seq_type='genomic')[0]['seq'][:4] == 'GTCT')
100%|██████████| 1/1 [00:01<00:00,  1.69s/it]

build_transcript_aa_seq_df[source]

build_transcript_aa_seq_df(design_df, transcript_id_col='Target Transcript', transcript_len_col='Target Total Length', n_jobs=1)

Get amino acid sequence for transcripts of interest

:param design_df: DataFrame :param transcript_id_col: str, column with ensembl transcript id :param transcript_len_col: str, column with length of transcript :param n_jobs: int, number of jobs to use to query transcripts :return: DataFrame

transcript_aa_seq_df = build_transcript_aa_seq_df(design_df, n_jobs=2)
transcript_aa_seq_df
Getting amino acid sequences
100%|██████████| 4/4 [00:02<00:00,  1.54it/s]
/var/folders/n9/c397fj3n5f16khr7tp76tbpc0000gn/T/ipykernel_37465/1237391668.py:23: UserWarning: Unable to find translations for the following transcripts: ENST00000629399, ENST00000284049
  warnings.warn('Unable to find translations for the following transcripts: ' + ', '.join(missing_seqs))
Target Transcript Target Total Length Transcript Base version desc molecule id seq AA len
0 ENST00000259457.8 834 ENST00000259457 3 None protein ENSP00000259457 MAAVSVYAPPVGGFSFDNCRRNAVLEADFAKRGYKLPKVRKTGTTI... 277
1 ENST00000394249.8 1863 ENST00000394249 3 None protein ENSP00000377793 MRRSEVLAEESIVCLQKALNHLREIWELIGIPEDQRLQRTEVVKKH... 620
2 ENST00000361337.3 2298 ENST00000361337 2 None protein ENSP00000354522 MSGDHLHNDSQIEADFRLNDSHKHKDKHKDREHRHKEHKKEKDREK... 765
3 ENST00000368328.5 267 ENST00000368328 4 None protein ENSP00000357311 MALSTIVSQRKQIKRKAPRGFLKRVFKRKKPQLRLEKSGDLLVHLN... 88
4 ENST00000610426.5 783 ENST00000610426 1 None protein ENSP00000483484 MPQNEYIELHRKRYGYRLDYHEKKRKKESREAHERSKKAKKMIGLK... 260
... ... ... ... ... ... ... ... ... ...
193 ENST00000339374.11 1371 ENST00000339374 6 None protein ENSP00000343041 MAAALKCLLTLGRWCPGLGVAPQARALAALVPGVTQVDNKSGFLQK... 456
194 ENST00000393047.8 882 ENST00000393047 3 None protein ENSP00000376767 MSRIPLGKVLLRNVIRHTDAHNKIQEESDMWKIRELEKQMEDAYRG... 293
195 ENST00000454402.7 1023 ENST00000454402 2 None protein ENSP00000408295 METSALKQQEQPAATKIRNLPWVEKYRPQTLNDLISHQDILSTIQK... 340
196 ENST00000254998.3 423 ENST00000254998 2 None protein ENSP00000254998 MASVDFKTYVDQACRAAEEFVNVYYTTMDKRRRLLSRLYMGTATLV... 140
197 ENST00000381685.10 2067 ENST00000381685 5 None protein ENSP00000371101 MQVSSLNEVKIYSLSCGKSLPEWLSDRKKRALQKKDVDVRRRIELI... 688

198 rows × 9 columns

Get protein domains

ensembl_get[source]

ensembl_get(ext, query=None, headers=None, params=None)

Generic wrapper for using GET requests to the ensembl rest API

ext: str, url extension | query: str or None, end of url extension specifying species, taxon, esnembl_id etc | headers: dict or None, meta-information for query | params: dict or None, parameters for query |

returns: Response object

get_translation_overlap[source]

get_translation_overlap(ensembl_id)

Get features that overlap with translation, such as protein domains

:param ensembl_id: str :return: DataFrame

brca1_overlap = get_translation_overlap('ENSP00000350283')
assert 'BRCT domain' in pd.DataFrame(brca1_overlap)['description'].to_list()

build_translation_overlap_df[source]

build_translation_overlap_df(protein_ids, n_jobs=1)

Get protein domain information

:param protein_ids: list of str, ensemble protein IDs :param n_jobs: int :return: DataFrame

translation_overlap_df = build_translation_overlap_df(transcript_aa_seq_df['id'],
                                                      n_jobs=2)
translation_overlap_df
Getting protein domains
100%|██████████| 198/198 [00:49<00:00,  4.00it/s]
cigar_string align_type hit_start start description hit_end hseqname seq_region_name interpro type feature_type Transcript Base translation_id end id
0 None 3 3 Proteasome B-type subunit 234 PTHR32194 ENSP00000259457 IPR023333 PANTHER protein_feature ENST00000259457 1109135 237 PTHR32194
1 None 1 44 Via SIFTS (2025/02/17) UniProt protein Q99436 ... 220 4r3o.I ENSP00000259457 sifts protein_feature ENST00000259457 1109135 263 4r3o.I
2 None 1 44 Via SIFTS (2025/02/17) UniProt protein Q99436 ... 220 4r3o.W ENSP00000259457 sifts protein_feature ENST00000259457 1109135 263 4r3o.W
3 None 1 44 Via SIFTS (2025/02/17) UniProt protein Q99436 ... 220 4r67.I ENSP00000259457 sifts protein_feature ENST00000259457 1109135 263 4r67.I
4 None 1 44 Via SIFTS (2025/02/17) UniProt protein Q99436 ... 220 4r67.W ENSP00000259457 sifts protein_feature ENST00000259457 1109135 263 4r67.W
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7675 None 0 558 Coil 0 Coil ENSP00000371101 ncoils protein_feature ENST00000381685 1303115 587 Coil
7676 None 0 516 Coil 0 Coil ENSP00000371101 ncoils protein_feature ENST00000381685 1303115 537 Coil
7677 None 0 426 Coil 0 Coil ENSP00000371101 ncoils protein_feature ENST00000381685 1303115 446 Coil
7678 None 1 1 Nucleolar protein 10/Enp2 655 PTHR14927 ENSP00000371101 IPR040382 PANTHER protein_feature ENST00000381685 1303115 684 PTHR14927
7679 None 0 42 WD40-repeat-containing domain superfamily 0 SSF50978 ENSP00000371101 IPR036322 SuperFamily protein_feature ENST00000381685 1303115 339 SSF50978

7680 rows × 15 columns

Store data using parquet for later use

write_transcript_data[source]

write_transcript_data(design_df, transcript_id_col='Target Transcript', transcript_len_col='Target Total Length', n_jobs=1, overwrite=True, filepath='./data/target_data/', aa_seq_name='aa_seqs.pq', protein_domain_name='protein_domains.pq')

Write amino acid sequences and protein domain information to parquet files

:param design_df: DataFrame :param transcript_id_col: str :param transcript_len_col: str :param n_jobs: int :param overwrite: bool, whether to overwrite existing file :param filepath: str, directory for output sequences :param aa_seq_name: str, name of amino acid sequence file :param protein_domain_name: str, name of protein domain file

write_transcript_data(design_df, n_jobs=2)
assert os.path.isfile('./data/target_data/' + 'aa_seqs.pq')
Getting amino acid sequences
100%|██████████| 4/4 [00:01<00:00,  2.11it/s]
/var/folders/n9/c397fj3n5f16khr7tp76tbpc0000gn/T/ipykernel_37465/1237391668.py:23: UserWarning: Unable to find translations for the following transcripts: ENST00000629399, ENST00000284049
  warnings.warn('Unable to find translations for the following transcripts: ' + ', '.join(missing_seqs))
Getting protein domains
100%|██████████| 198/198 [00:50<00:00,  3.95it/s]

Get Conservation Scores

Get PhyloP conservation scores

get_transcript_info[source]

get_transcript_info(base_transcript)

Using an ensembl transcript ID, get

:param base_transcript: str :return: (exon_df, trans_sr, chr) exon_df: DataFrame, with global exon start and end position trans_sr: Series, with global translation start and stop positions for CDS and translation length chr: str

exon_df, trans_sr, chr = get_transcript_info('ENST00000259457')
assert chr == '9'
assert trans_sr['length'] == 277 # number of aas
assert exon_df.shape[0] == 8 # eight exons

get_conservation[source]

get_conservation(chr, start, end, genome)

Get conservation scores for a given region of a genome

:param chr: str, chromosome number :param start: int :param end: int :param genome: str :return: DataFrame

rps20_exon_conservation = get_conservation('8', 56074060, 56074159, 'hg38')
cd274_exon_conservation = get_conservation('9', 5457079, 5457420, 'hg38')
assert rps20_exon_conservation['conservation'].mean() > cd274_exon_conservation['conservation'].mean()

get_exon_conservation[source]

get_exon_conservation(exon_df, chr, genome)

Get conservation scores for each exon

:param exon_df: DataFrame :param chr: str :param genome: str :return: DataFrame

get_transcript_conservation[source]

get_transcript_conservation(transcript_id, target_strand, genome)

Get conservation scores for a transcript

:param transcript_id: str :param target_strand: str, '+' or '-' :param genome: str :return: DataFrame

rps20_conservation_df = get_transcript_conservation('ENST00000009589', '-', 'hg38')
cd274_conservation_df = get_transcript_conservation('ENST00000381577', '+', 'hg38')
assert rps20_conservation_df['conservation'].mean() > cd274_conservation_df['conservation'].mean()

get_transcript_conservation_safe[source]

get_transcript_conservation_safe(transcript_id, target_strand, genome)

Helper function for parrallel query. Return None when conservation dataframe cannot be assembled

build_conservation_df[source]

build_conservation_df(design_df, n_jobs=1)

conservation_df = build_conservation_df(design_df, n_jobs=max_n_jobs)
assert (conservation_df.groupby('Target Transcript')
        .apply(lambda df: stats.spearmanr(df['conservation'], df['ranked_conservation'])[0])
        > 0.99).all()
assert (conservation_df.loc[(conservation_df['Transcript Base'] == 'ENST00000361337') &
                            (conservation_df['target position'].between(432*3, 663*3)),
                            'ranked_conservation'].mean() >  # pfam
        conservation_df.loc[(conservation_df['Transcript Base'] == 'ENST00000361337') &
                            (conservation_df['target position'].between(36*3, 199*3)),  # mobidblit
                            'ranked_conservation'].mean())
Getting conservation

 84%|████████▎ | 167/200 [05:05<01:00,  1.83s/it]
100%|██████████| 200/200 [05:20<00:00,  1.60s/it]
/var/folders/n9/c397fj3n5f16khr7tp76tbpc0000gn/T/ipykernel_37465/986071231.py:35: UserWarning: Failed to get conservation scores for 2 transcriptsENST00000629399, ENST00000284049
  warnings.warn('Failed to get conservation scores for ' + str(len(failed_list)) +
/var/folders/n9/c397fj3n5f16khr7tp76tbpc0000gn/T/ipykernel_37465/3931529888.py:2: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  assert (conservation_df.groupby('Target Transcript')

write_conservation_data[source]

write_conservation_data(design_df, n_jobs=1, overwrite=True, filepath='./data/target_data/', cons_file_name='conservation.pq')

Write conservation scores to parquet files

:param design_df: DataFrame :param n_jobs: int :param overwrite: bool, whether to overwrite existing file :param filepath: str, directory for output sequences :param cons_file_name: str, name of conservation file

write_conservation_data(design_df, n_jobs=max_n_jobs)
assert os.path.isfile('./data/target_data/' + 'conservation.pq')
Getting conservation
100%|██████████| 200/200 [06:05<00:00,  1.83s/it]
/var/folders/n9/c397fj3n5f16khr7tp76tbpc0000gn/T/ipykernel_37465/986071231.py:35: UserWarning: Failed to get conservation scores for 2 transcriptsENST00000629399, ENST00000284049
  warnings.warn('Failed to get conservation scores for ' + str(len(failed_list)) +