Get underlying data to calculate Rule Set 3 target scores
design_df = pd.read_table('test_data/sgrna-designs.txt')
max_n_jobs = multiprocessing.cpu_count()

Get amino acid sequences

ensembl_post[source]

ensembl_post(ext, data, headers=None, params=None)

Generic wrapper for using POST requests to the ensembl rest API

:param ext: str, url extension :param data: dict, query data :param headers: dict or None, meta-information for query :param params: dict or None, parameters for query :return: Response object

chunks[source]

chunks(lst, n)

Yield successive n-sized chunks from lst.

lst: list n: int

returns: generator of list chunks

post_transcript_sequence_chunk[source]

post_transcript_sequence_chunk(ids, params, headers)

Helper function for post_transcript_sequence

:param ids: list :param params: dict :param headers: dict :return: dict

post_transcript_sequence[source]

post_transcript_sequence(ensembl_ids, seq_type='protein', max_queries=50, n_jobs=1, **kwargs)

Request multiple types of sequence by stable identifier. Supports feature masking and expand options. Uses https://rest.ensembl.org/documentation/info/sequence_id_post

:param ensembl_ids: list of str :param seq_type: str, one of [genomic, cds, cdna, protein] :param max_queries: int, maximum number of queries for post :param n_jobs: int, number of jobs to run in parallel :param kwargs: additional parameter arguments :return: list, dict of sequences 5' to 3' in the same orientation as the input transcript

assert(post_transcript_sequence(["ENSG00000157764", "ENSG00000248378"],
                                seq_type='genomic')[0]['seq'][:4] == 'CTTC')
100%|██████████| 1/1 [00:01<00:00,  1.72s/it]

build_transcript_aa_seq_df[source]

build_transcript_aa_seq_df(design_df, transcript_id_col='Target Transcript', transcript_len_col='Target Total Length', n_jobs=1)

Get amino acid sequence for transcripts of interest

:param design_df: DataFrame :param transcript_id_col: str, column with ensembl transcript id :param transcript_len_col: str, column with length of transcript :param n_jobs: int, number of jobs to use to query transcripts :return: DataFrame

transcript_aa_seq_df = build_transcript_aa_seq_df(design_df, n_jobs=2)
transcript_aa_seq_df
Getting amino acid sequences
100%|██████████| 4/4 [00:02<00:00,  1.47it/s]
Target Transcript Target Total Length Transcript Base version desc molecule seq id AA len
0 ENST00000259457.8 834 ENST00000259457 3 None protein MAAVSVYAPPVGGFSFDNCRRNAVLEADFAKRGYKLPKVRKTGTTI... ENSP00000259457 277
1 ENST00000394249.8 1863 ENST00000394249 3 None protein MRRSEVLAEESIVCLQKALNHLREIWELIGIPEDQRLQRTEVVKKH... ENSP00000377793 620
2 ENST00000361337.3 2298 ENST00000361337 2 None protein MSGDHLHNDSQIEADFRLNDSHKHKDKHKDREHRHKEHKKEKDREK... ENSP00000354522 765
3 ENST00000368328.5 267 ENST00000368328 4 None protein MALSTIVSQRKQIKRKAPRGFLKRVFKRKKPQLRLEKSGDLLVHLN... ENSP00000357311 88
4 ENST00000610426.5 783 ENST00000610426 1 None protein MPQNEYIELHRKRYGYRLDYHEKKRKKESREAHERSKKAKKMIGLK... ENSP00000483484 260
... ... ... ... ... ... ... ... ... ...
195 ENST00000339374.11 1371 ENST00000339374 6 None protein MAAALKCLLTLGRWCPGLGVAPQARALAALVPGVTQVDNKSGFLQK... ENSP00000343041 456
196 ENST00000393047.8 882 ENST00000393047 3 None protein MSRIPLGKVLLRNVIRHTDAHNKIQEESDMWKIRELEKQMEDAYRG... ENSP00000376767 293
197 ENST00000454402.7 1023 ENST00000454402 2 None protein METSALKQQEQPAATKIRNLPWVEKYRPQTLNDLISHQDILSTIQK... ENSP00000408295 340
198 ENST00000254998.3 423 ENST00000254998 2 None protein MASVDFKTYVDQACRAAEEFVNVYYTTMDKRRRLLSRLYMGTATLV... ENSP00000254998 140
199 ENST00000381685.10 2067 ENST00000381685 5 None protein MQVSSLNEVKIYSLSCGKSLPEWLSDRKKRALQKKDVDVRRRIELI... ENSP00000371101 688

200 rows × 9 columns

Get protein domains

ensembl_get[source]

ensembl_get(ext, query=None, headers=None, params=None)

Generic wrapper for using GET requests to the ensembl rest API

ext: str, url extension | query: str or None, end of url extension specifying species, taxon, esnembl_id etc | headers: dict or None, meta-information for query | params: dict or None, parameters for query |

returns: Response object

get_translation_overlap[source]

get_translation_overlap(ensembl_id)

Get features that overlap with translation, such as protein domains

:param ensembl_id: str :return: DataFrame

brca1_overlap = get_translation_overlap('ENSP00000350283')
assert 'BRCT domain' in pd.DataFrame(brca1_overlap)['description'].to_list()

build_translation_overlap_df[source]

build_translation_overlap_df(protein_ids, n_jobs=1)

Get protein domain information

:param protein_ids: list of str, ensemble protein IDs :param n_jobs: int :return: DataFrame

translation_overlap_df = build_translation_overlap_df(transcript_aa_seq_df['id'],
                                                      n_jobs=2)
translation_overlap_df
Getting protein domains
100%|██████████| 200/200 [00:48<00:00,  4.09it/s]
hseqname translation_id interpro start Transcript Base align_type feature_type cigar_string id type hit_end end hit_start seq_region_name description
0 3.60.20.10 976188 IPR029055 44 ENST00000259457 None protein_feature 3.60.20.10 Gene3D 233 277 1 ENSP00000259457 Nucleophile aminohydrolases, N-terminal
1 PF12465 976188 IPR024689 235 ENST00000259457 None protein_feature PF12465 Pfam 36 271 1 ENSP00000259457 Proteasome beta subunit, C-terminal
2 PF00227 976188 IPR001353 41 ENST00000259457 None protein_feature PF00227 Pfam 190 221 2 ENSP00000259457 Proteasome, subunit alpha/beta
3 PR00141 976188 IPR000243 51 ENST00000259457 None protein_feature PR00141 PRINTS 0 66 0 ENSP00000259457 Peptidase T1A, proteasome beta-subunit
4 PR00141 976188 IPR000243 171 ENST00000259457 None protein_feature PR00141 PRINTS 0 182 0 ENSP00000259457 Peptidase T1A, proteasome beta-subunit
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5275 Coil 975385 516 ENST00000381685 None protein_feature Coil ncoils 0 537 0 ENSP00000371101
5276 Coil 975385 558 ENST00000381685 None protein_feature Coil ncoils 0 587 0 ENSP00000371101
5277 Coil 975385 640 ENST00000381685 None protein_feature Coil ncoils 0 669 0 ENSP00000371101
5278 PTHR14927 975385 IPR040382 1 ENST00000381685 None protein_feature PTHR14927 PANTHER 676 682 5 ENSP00000371101 Nucleolar protein 10/Enp2
5279 SSF50978 975385 IPR036322 42 ENST00000381685 None protein_feature SSF50978 SuperFamily 0 339 0 ENSP00000371101 WD40-repeat-containing domain superfamily

5280 rows × 15 columns

Store data using parquet for later use

write_transcript_data[source]

write_transcript_data(design_df, transcript_id_col='Target Transcript', transcript_len_col='Target Total Length', n_jobs=1, overwrite=True, filepath='./data/target_data/', aa_seq_name='aa_seqs.pq', protein_domain_name='protein_domains.pq')

Write amino acid sequences and protein domain information to parquet files

:param design_df: DataFrame :param transcript_id_col: str :param transcript_len_col: str :param n_jobs: int :param overwrite: bool, whether to overwrite existing file :param filepath: str, directory for output sequences :param aa_seq_name: str, name of amino acid sequence file :param protein_domain_name: str, name of protein domain file

write_transcript_data(design_df, n_jobs=2)
assert os.path.isfile('./data/target_data/' + 'aa_seqs.pq')
Getting amino acid sequences
100%|██████████| 4/4 [00:02<00:00,  1.42it/s]
Getting protein domains
100%|██████████| 200/200 [00:48<00:00,  4.13it/s]

Get Conservation Scores

Get PhyloP conservation scores

get_transcript_info[source]

get_transcript_info(base_transcript)

Using an ensembl transcript ID, get

:param base_transcript: str :return: (exon_df, trans_sr, chr) exon_df: DataFrame, with global exon start and end position trans_sr: Series, with global translation start and stop positions for CDS and translation length chr: str

exon_df, trans_sr, chr = get_transcript_info('ENST00000259457')
assert chr == '9'
assert trans_sr['length'] == 277 # number of aas
assert exon_df.shape[0] == 8 # eight exons

get_conservation[source]

get_conservation(chr, start, end, genome)

Get conservation scores for a given region of a genome

:param chr: str, chromosome number :param start: int :param end: int :param genome: str :return: DataFrame

rps20_exon_conservation = get_conservation('8', 56074060, 56074159, 'hg38')
cd274_exon_conservation = get_conservation('9', 5457079, 5457420, 'hg38')
assert rps20_exon_conservation['conservation'].mean() > cd274_exon_conservation['conservation'].mean()

get_exon_conservation[source]

get_exon_conservation(exon_df, chr, genome)

Get conservation scores for each exon

:param exon_df: DataFrame :param chr: str :param genome: str :return: DataFrame

get_transcript_conservation[source]

get_transcript_conservation(transcript_id, target_strand, genome)

Get conservation scores for a transcript

:param transcript_id: str :param target_strand: str, '+' or '-' :param genome: str :return: DataFrame

rps20_conservation_df = get_transcript_conservation('ENST00000009589', '-', 'hg38')
cd274_conservation_df = get_transcript_conservation('ENST00000381577', '+', 'hg38')
assert rps20_conservation_df['conservation'].mean() > cd274_conservation_df['conservation'].mean()

get_transcript_conservation_safe[source]

get_transcript_conservation_safe(transcript_id, target_strand, genome)

Helper function for parrallel query. Return None when conservation dataframe cannot be assembled

build_conservation_df[source]

build_conservation_df(design_df, n_jobs=1)

conservation_df = build_conservation_df(design_df, n_jobs=max_n_jobs)
assert (conservation_df.groupby('Target Transcript')
        .apply(lambda df: stats.spearmanr(df['conservation'], df['ranked_conservation'])[0])
        > 0.99).all()
assert (conservation_df.loc[(conservation_df['Transcript Base'] == 'ENST00000361337') &
                            (conservation_df['target position'].between(432*3, 663*3)),
                            'ranked_conservation'].mean() >  # pfam
        conservation_df.loc[(conservation_df['Transcript Base'] == 'ENST00000361337') &
                            (conservation_df['target position'].between(36*3, 199*3)),  # mobidblit
                            'ranked_conservation'].mean())
Getting conservation
100%|██████████| 200/200 [03:33<00:00,  1.07s/it]

write_conservation_data[source]

write_conservation_data(design_df, n_jobs=1, overwrite=True, filepath='./data/target_data/', cons_file_name='conservation.pq')

Write conservation scores to parquet files

:param design_df: DataFrame :param n_jobs: int :param overwrite: bool, whether to overwrite existing file :param filepath: str, directory for output sequences :param cons_file_name: str, name of conservation file

write_conservation_data(design_df, n_jobs=max_n_jobs)
assert os.path.isfile('./data/target_data/' + 'conservation.pq')
Getting conservation
100%|██████████| 200/200 [05:16<00:00,  1.58s/it]