Core functions for analyzing CRISPR screens
import seaborn as sns
import gpplot
import matplotlib.pyplot as plt

Log-normalizing Data

lognorm[source]

lognorm(reads)

Standardize read counts by calculating reads per million, adding a pseudo-count of one, and taking the log2

reads: numpy or pandas array | returns: numpy or pandas array

lognorm_columns[source]

lognorm_columns(reads_df, columns)

Calculate lognorms for specified columns

reads_df: dataframe | columns: list | returns: lognorm dataframe

To demonstrate the functionality of this module we'll use read counts from Sanson et al. 2018.

supp_reads = 'https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-07901-8/MediaObjects/41467_2018_7901_MOESM4_ESM.xlsx'
read_counts = pd.read_excel(supp_reads,
                            sheet_name = 'A375_orig_tracr raw reads', 
                            header = None,
                            skiprows = 3, 
                            names = ['sgRNA Sequence', 'pDNA', 'A375_RepA', 'A375_RepB'], 
                            engine='openpyxl')

read_counts
sgRNA Sequence pDNA A375_RepA A375_RepB
0 AAAAAAAATCCGGACAATGG 522 729 774
1 AAAAAAAGGATGGTGATCAA 511 1484 1393
2 AAAAAAATGACATTACTGCA 467 375 603
3 AAAAAAATGTCAGTCGAGTG 200 737 506
4 AAAAAACACAAGCAAGACCG 286 672 352
... ... ... ... ...
77436 TTTGTTTGGATTGCTTCAAG 514 2878 303
77437 TTTTACCTTGTTCACATGGA 828 3896 3367
77438 TTTTGACTCTAATCACCGGT 588 2725 2498
77439 TTTTTAATACAAGGTAATCT 503 2209 1785
77440 TTTTTCTCACCCGATGAATC 661 3198 3189

77441 rows × 4 columns

Note that the read counts have a long right tail

sns.kdeplot(read_counts['pDNA'])
<AxesSubplot:xlabel='pDNA', ylabel='Density'>

To mitigate this tail and account for sequencing biases we log-normalize read counts by taking $$\log_2 \bigg ( \frac{\text{reads}}{\text{total reads in condition}} * 10^6 + 1 \bigg )$$

lognorms = lognorm_columns(reads_df=read_counts, columns=['pDNA', 'A375_RepA', 'A375_RepB'])
sns.kdeplot(lognorms['pDNA'])
<AxesSubplot:xlabel='pDNA', ylabel='Density'>

Filter pDNA

filter_pdna[source]

filter_pdna(lognorm_df, pdna_cols, z_low=-3, z_high=None)

Filter the lognorm dataframe based on the z_scored lognorms of pDNA

lognorm_df: dataframe | pdna_cols: list | z_low: int or None, lower range of z-scores to filter | z_high: int or None, upper range of z-scores to filter | returns: filtered dataframe

Note that there's a small number of sgRNAs with low pDNA abundance. We'll filter out any sgRNAs with lognormed pDNA less than the mean minus 3 standard deviations.

filtered_lognorms = filter_pdna(lognorm_df=lognorms, pdna_cols=['pDNA'])
lognorms.shape[0] - filtered_lognorms.shape[0]
576
sns.kdeplot(filtered_lognorms['pDNA'])
<AxesSubplot:xlabel='pDNA', ylabel='Density'>
assert ((filtered_lognorms['pDNA'] - lognorms['pDNA'].mean())/lognorms['pDNA'].std()).min() > -3

Calculate log-fold changes

calculate_lfcs[source]

calculate_lfcs(lognorm_df, ref_col=None, target_cols=None, ref_map=None)

Calculate log-fold changes between reference column and target columns

lognorm_df: dataframe | ref_col: str or None, if str then target_col must also be present | target_col: list or None | ref_map: dict or None, key-value pairs correspond to target-reference conditions | returns: dataframe with log-fold changes

We'll calculate log-fold changes from the pDNA reference. Since we only have one reference condition we can supply the calculate_lfcs function with the ref_col and target_cols arguments.

lfc_df = calculate_lfcs(lognorm_df=filtered_lognorms, ref_col='pDNA', target_cols=['A375_RepA', 'A375_RepB'])

Alternatively we can supply a ref_map, which is useful when we have more than one reference condition

lfc_df2 = calculate_lfcs(filtered_lognorms, ref_map = {'A375_RepA': 'pDNA', 'A375_RepB': 'pDNA'})
assert lfc_df.equals(lfc_df2)

We can use seaborn's clustermap function to plot a clustermap of replicate correlations

sns.clustermap(lfc_df.corr(), vmin= 0, cmap= 'viridis', annot= True, figsize= (4,4))
<seaborn.matrix.ClusterGrid at 0x7ffc9ba0a780>

Since we only have two conditions it's also easy to visualize replicates as a point densityplot using gpplot

plt.subplots(figsize=(4,4))
gpplot.point_densityplot(data=lfc_df, x='A375_RepA', y='A375_RepB')
gpplot.add_correlation(data=lfc_df, x='A375_RepA', y='A375_RepB')
sns.despine()

Average log-fold changes across replicates

get_condition[source]

get_condition(condition_name, sep, condition_indices)

Split replicate name from condition name

average_replicate_lfcs[source]

average_replicate_lfcs(lfcs, guide_col, condition_indices=None, sep=None, condition_map=None, lfc_cols=None, condition_name='condition', lfc_name='avg_lfc')

Average log-fold changes of sgRNAs across replicates

lfcs: dataframe | guide_col: str, sgrna column name | condition_indices: list of int or None, specifies which elements to use for conditions after separating column names with sep | sep: str or None, separator in column names | condition_map: dict or None, alternative to supplying condition_indices and sep. Keys are column names and values are condition names | lfc_cols: list or None, lfc column(s) to melt. If None use all columns that are not guide_col | condition_name: str, name of condition columns | lfc_name: str, name of new column with log-fold changes | returns: dataframe of average lfcs

Since we see a strong correlation, we'll average the log-fold change of each sgRNA across replicates

avg_replicate_lfc_df = average_replicate_lfcs(lfcs=lfc_df, guide_col='sgRNA Sequence', condition_indices=[0], 
                                              sep='_')
assert np.abs(avg_replicate_lfc_df.shape[0] - read_counts.shape[0]) < 1000

We can alternatively combine replicates by supplying a condition_map

avg_replicate_lfc_df_condition_map = average_replicate_lfcs(lfcs=lfc_df, guide_col='sgRNA Sequence',
                                                            condition_map={'A375_RepA': 'A375',
                                                                           'A375_RepB': 'A375'})
pd.testing.assert_frame_equal(avg_replicate_lfc_df_condition_map, avg_replicate_lfc_df)

If we don't want to join replicates, we don't have to supply these arguments

distinct_replicate_lfc_df = average_replicate_lfcs(lfcs=lfc_df, guide_col='sgRNA Sequence')

assert distinct_replicate_lfc_df.shape[0] == avg_replicate_lfc_df.shape[0]*2

Group control genes

group_pseudogenes[source]

group_pseudogenes(annotations, pseudogene_size, gene_col, control_regex, seed=7)

Remap annotations dataframe such that control genes are grouped into pseudo-genes

annotations: dataframe | pseudogene_size: int | gene_col: str | control_regex: list of str, regular expressions to identify groups of pseudogenes | seed: int, random seed for reproducible outputs | returns: dataframe of annotations with controls grouped into pseudogenes |

Before combining sgRNAs at the gene level, it's sometimes helpful to group controls into pseudo-genes so they're easier to compare with target genes

guide_annotations = pd.read_excel(supp_reads,
                                  sheet_name='sgRNA annotations', 
                                  engine='openpyxl')
guide_annotations
sgRNA Sequence Annotated Gene Symbol Annotated Gene ID
0 AAAAAAAATCCGGACAATGG SLC25A24 29957
1 AAAAAAAGGATGGTGATCAA FASTKD3 79072
2 AAAAAAATGACATTACTGCA BCAS2 10286
3 AAAAAAATGTCAGTCGAGTG GPR18 2841
4 AAAAAACACAAGCAAGACCG ZNF470 388566
... ... ... ...
77436 TTTGTTTGGATTGCTTCAAG NEPRO 25871
77437 TTTTACCTTGTTCACATGGA NO_CURRENT_997 NO_CURRENT_997
77438 TTTTGACTCTAATCACCGGT NO_CURRENT_998 NO_CURRENT_998
77439 TTTTTAATACAAGGTAATCT NO_CURRENT_999 NO_CURRENT_999
77440 TTTTTCTCACCCGATGAATC NO_CURRENT_1000 NO_CURRENT_1000

77441 rows × 3 columns

remapped_annotations = group_pseudogenes(annotations=guide_annotations, pseudogene_size=4, 
                                         gene_col='Annotated Gene Symbol', 
                                         control_regex=['NO_CURRENT'])
control_genes = remapped_annotations.loc[remapped_annotations['Annotated Gene Symbol'].str.contains('_'), 'Annotated Gene Symbol']
assert control_genes.value_counts().max() == 4
assert control_genes.value_counts().min() == 4
assert len(control_genes) == 1000
remapped_annotations
sgRNA Sequence Annotated Gene Symbol Annotated Gene ID
0 AAAAAAAATCCGGACAATGG SLC25A24 29957
1 AAAAAAAGGATGGTGATCAA FASTKD3 79072
2 AAAAAAATGACATTACTGCA BCAS2 10286
3 AAAAAAATGTCAGTCGAGTG GPR18 2841
4 AAAAAACACAAGCAAGACCG ZNF470 388566
... ... ... ...
77436 TTTGTTTGGATTGCTTCAAG NEPRO 25871
77437 TTTTACCTTGTTCACATGGA NO_CURRENT_181 NO_CURRENT_997
77438 TTTTGACTCTAATCACCGGT NO_CURRENT_63 NO_CURRENT_998
77439 TTTTTAATACAAGGTAATCT NO_CURRENT_154 NO_CURRENT_999
77440 TTTTTCTCACCCGATGAATC NO_CURRENT_112 NO_CURRENT_1000

77441 rows × 3 columns

Scale guide-level scores

We provide two methods for scaling log-fold change values to controls:

  1. Z-score from a set of negative controls
  2. Scale scores between a set of negative and positive controls

For both scoring methods, you can input either a regex or a list of genes to define control sets

get_annotated_lfcs[source]

get_annotated_lfcs(lfcs, annotations, merge_on=None, lfc_merge_on=None, annotation_merge_on=None)

Merge lfcs and annotatsion

lfcs: dataframe, log-fold changes annotations: dataframe, sgRNA annotations merge_on: str or None, name of sgRNA column to merge on. Must be present in both lfc and annotation dataframes if supplied, otherwise supply unique merge column to each | lfc_merge_on: str or None, name of sgRNA column to merge on | annotation_merge_on: str or None, name of sgRNA column to merge on

get_control_lfcs[source]

get_control_lfcs(annotated_lfcs, controls, gene_col)

Get lfcs from control populations

annotated_lfcs: dataframe, merged lfcs and annotations | controls: str or list of str or None, if string then interpreted as a regex otherwise interpreted as a list of gene names found in gene_col | gene_col: str, column which uniquely identifies a gene |

get_neg_ctl_z_score[source]

get_neg_ctl_z_score(lfcs, negative_controls, gene_col, condition_col, lfc_col)

Z-score lfcs by negative control population

lfcs: dataframe, merged lfcs and annotations | negative_controls: str or list of str or None, if string then interpreted as a regex otherwise interpreted as a list of gene names found in gene_col | gene_col: str, column which uniquely identifies a gene | condition_col: str, column which uniquely identifies experimental conditions | lfc_col: str, name of value to average | returns: dataframe, gene lfcs with column for z-score

scale_lfcs[source]

scale_lfcs(lfcs, negative_controls, positive_controls, gene_col, condition_col, lfc_col, pos_control_direction)

Scale lfcs between negative and positive control populations

lfcs: dataframe, merged lfcs and annotations | negative_controls: str or list of str or None, if string then interpreted as a regex otherwise interpreted as a list of gene names found in gene_col | positive_controls: str or list of str of None, if string then interpreted as a regex otherwise interpreted as a list of gene names found in gene_col | gene_col: str, column which uniquely identifies a gene | condition_col: str, column which uniquely identifies experimental conditions | lfc_col: str, name of value to average | pos_control_direction: int, direction in which positive controls score | returns: dataframe, gene lfcs with column for z-score and z-score p-values

nonessential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/non-essential-genes-Hart2014.txt', 
                                    names=['gene'])
                      .gene)
essential_genes = (pd.read_table('https://raw.githubusercontent.com/gpp-rnd/genesets/master/human/essential-genes-Hart2015.txt', 
                                 names=['gene'])
                   .gene)

annotated_lfcs = get_annotated_lfcs(avg_replicate_lfc_df, remapped_annotations, merge_on='sgRNA Sequence')
assert annotated_lfcs.shape[0] == avg_replicate_lfc_df.shape[0]

no_current_controls = get_control_lfcs(annotated_lfcs, 'NO_CURRENT', 'Annotated Gene Symbol')
assert abs(no_current_controls.shape[0] - 1000) < 20
nonessential_controls = get_control_lfcs(annotated_lfcs, nonessential_genes, 'Annotated Gene Symbol')
assert abs(nonessential_controls.shape[0] - len(nonessential_genes) * 4) < 200

z_scored_lfcs = get_neg_ctl_z_score(annotated_lfcs, nonessential_genes, 'Annotated Gene Symbol', 'condition', 'avg_lfc')
control_z_scores = z_scored_lfcs.loc[z_scored_lfcs['Annotated Gene Symbol'].isin(nonessential_genes), 'z_scored_avg_lfc']
assert np.abs(control_z_scores.mean()) < 0.001
assert np.abs(control_z_scores.std() - 1) < 0.001

scaled_lfcs = scale_lfcs(annotated_lfcs, nonessential_genes, essential_genes, 'Annotated Gene Symbol', 'condition', 'avg_lfc', -1)
pos_scaled_lfcs = scaled_lfcs[scaled_lfcs['Annotated Gene Symbol'].isin(essential_genes)]
assert pos_scaled_lfcs['control_scaled_avg_lfc'].median() == -1
neg_scaled_lfcs = scaled_lfcs[scaled_lfcs['Annotated Gene Symbol'].isin(nonessential_genes)]
assert neg_scaled_lfcs['control_scaled_avg_lfc'].median() == 0
assert pos_scaled_lfcs['probability_positive_control'].mean() > neg_scaled_lfcs['probability_positive_control'].mean()

All of these functions are wrapped together in annotate_guide_lfcs

annotate_guide_lfcs[source]

annotate_guide_lfcs(lfcs, annotations, gene_col, condition_col='condition', lfc_col='avg_lfc', merge_on=None, lfc_merge_on=None, annotation_merge_on=None, z_score_neg_ctls=False, z_score_neg_ctl_genes=None, scale_ctls=False, scale_neg_ctl_genes=None, scale_pos_ctl_genes=None, pos_control_direction=-1)

Join guide log-fold changes with annotations, and optionally calculate z-scores and scaled scores

lfcs: dataframe | annotations: dataframe, mapping between sgRNAs and genes | gene_col: str, column which uniquely identifies a gene | condition_col: str, column which uniquely identifies experimental conditions | lfc_col: str, name of value to average | merge_on: str or None, name of sgRNA column to merge on. Must be present in both lfc and annotation dataframes if supplied, otherwise supply unique merge column to each | lfc_merge_on: str or None, name of sgRNA column to merge on | annotation_merge_on: str or None, name of sgRNA column to merge on. Present in annotation dataframe if supplied | z_score_neg_ctls: bool, z-score log-fold changes relative to negative controls | z_score_neg_ctl_genes: str, list of str, or None, if string then interpreted as a regex otherwise interpreted as a list of gene names found in gene_col. If None and z_score_neg_ctls is True, then uses all sgRNAs as controls | scale_ctls: bool, scale-lfcs by positive and negative controls | scale_neg_ctl_genes: str or list of str or None, if string then interpreted as a regex otherwise interpreted as a list of gene names found in gene_col | scale_pos_ctl_genes: str or list of str or None, if string then interpreted as a regex otherwise interpreted as a list of gene names found in gene_col | pos_control_direction: int, direction in which positive controls score | returns: dataframe, lfcs and (optionally) z-scores averaged by gene

annot_guide_lfcs = annotate_guide_lfcs(avg_replicate_lfc_df, remapped_annotations, 'Annotated Gene Symbol', 
                                                     merge_on='sgRNA Sequence',
                                                     z_score_neg_ctls=True, z_score_neg_ctl_genes=nonessential_genes, 
                                                     scale_ctls=True, scale_neg_ctl_genes=nonessential_genes, 
                                                     scale_pos_ctl_genes=essential_genes, pos_control_direction=-1)
annot_guide_lfcs.sort_values('z_scored_avg_lfc')
sgRNA Sequence condition avg_lfc n_obs Annotated Gene Symbol Annotated Gene ID z_scored_avg_lfc control_scaled_avg_lfc probability_positive_control
6837 ACCAGATCCACATCAAACCG A375 -4.094542 2 FBL 2091 -8.416034 -2.472776 0.999965
49866 GGAATCTCACTCTGTCACCC A375 -3.903209 2 GVQW2 100507462 -8.029763 -2.361037 0.999940
31346 CGATAGACGCGTGCACCACG A375 -3.897174 2 EIF6 3692 -8.017578 -2.357512 0.999939
5407 ACACAGCATCTCCACTACCC A375 -3.818617 2 SPDYE6 729597 -7.858985 -2.311635 0.999925
5566 ACACGGAGTAGCTCTCCTTG A375 -3.815685 2 HIST1H2BN 8341 -7.853065 -2.309923 0.999924
... ... ... ... ... ... ... ... ... ...
13820 AGGCTGTTGGTAAGTATACA A375 1.879858 2 ZDHHC13 54503 3.645351 1.016289 0.002021
17810 ATGATGGCAAAGTAGATGTG A375 2.068461 2 TAOK1 57551 4.026111 1.126433 0.001203
7842 ACGGCCACCTCACAAAGTAA A375 2.093036 2 MAP4K4 9448 4.075724 1.140785 0.001125
65173 TCCTGGTCCGAAGCCGAGTG A375 2.214092 2 MED19 219541 4.320117 1.211482 0.000806
51228 GGATATTCTCTGATGCTCAG A375 2.463255 2 SUPT20H 55578 4.823139 1.356994 0.000406

76865 rows × 9 columns

plot_lfcs = annot_guide_lfcs.copy()
plot_lfcs['neg_control'] = plot_lfcs['Annotated Gene Symbol'].isin(nonessential_genes)
plot_lfcs['nlog10_pos_ctrl'] = -np.log10(plot_lfcs['probability_positive_control'])
plot_lfcs = plot_lfcs.sort_values('neg_control')
plt.subplots(figsize=(4,4))
sns.scatterplot(data=plot_lfcs, x='z_scored_avg_lfc', 
                y='control_scaled_avg_lfc', hue='neg_control', alpha=0.3)
sns.despine()

Aggreagate log-fold changes at the gene level

To aggregate scores at the gene level, we specify columns to average, and columns to z-score

aggregate_gene_lfcs[source]

aggregate_gene_lfcs(lfcs, gene_col, condition_col='condition', average_cols=None, zscore_cols=None)

Aggregate log-fold changes at the gene level

lfcs: dataframe, scores with annotations | gene_col: str, column which uniquely identifies a gene | condition_col: str, column which uniquely identifies experimental conditions | average_cols: list of str, columns to average zscore_cols: list of str, columns to z-score --> will also calculate p-value and FDR for each column returns: dataframe, lfcs aggregated at the gene level

get_hypergeometric_pval[source]

get_hypergeometric_pval(df, name)

Calculates the p-value for a gene using the hypergeometric test

df: dataframe, scores with annotations per gene returns: str of ascending constuct ranks, str of descending construct ranks and -log10(avg. p-value) for gene

apply_parallel[source]

apply_parallel(df, func)

Parallelizes calculation of hypergeometric p-values

df: dataframe, df grouped by condition, gene symbol func: name of function to call return: tuple, str of ascending constuct ranks, str of descending construct ranks and -log10(avg. p-value) for gene

aggregate_gene_lfcs_hypergeometric[source]

aggregate_gene_lfcs_hypergeometric(lfcs, gene_col, condition_col='condition', average_cols=None)

Aggregate log-fold changes at the gene level and calculate p-values using the hypergeometric test

lfcs: dataframe, scores with annotations | gene_col: str, column which uniquely identifies a gene | condition_col: str, column which uniquely identifies experimental conditions | average_cols: list of str, columns to average returns: dataframe, lfcs aggregated at the gene level

gene_lfcs = aggregate_gene_lfcs(annot_guide_lfcs, 'Annotated Gene Symbol', 
                                average_cols=['avg_lfc', 'control_scaled_avg_lfc', 'probability_positive_control'], 
                                zscore_cols=['z_scored_avg_lfc'])
assert np.abs(gene_lfcs.shape[0] - 20000) < 1000
# Do we score about the expect number of essential genes?
assert np.abs((gene_lfcs['probability_positive_control'] > 0.5).sum()/20000 - 0.2) < 0.1
assert gene_lfcs.sort_values('probability_positive_control', ascending=False).head(5)['Annotated Gene Symbol'].isin(essential_genes).sum() > 2
assert gene_lfcs[gene_lfcs.z_scored_avg_lfc < 0].sort_values('z_scored_avg_lfc_fdr', ascending=True).head(5)['Annotated Gene Symbol'].isin(essential_genes).sum() > 2
# Do the two scoring mechanisms agree
gene_lfcs_copy = gene_lfcs.copy()
gene_lfcs_copy['z_essential'] = (gene_lfcs_copy['z_scored_avg_lfc_fdr'] < 0.1) & (gene_lfcs_copy['z_scored_avg_lfc'] < 0)
gene_lfcs_copy['p_essential'] = (gene_lfcs_copy['probability_positive_control'] > 0.5)
scoring_overlap = (gene_lfcs_copy[['z_essential', 'p_essential']].value_counts()
                   .reset_index()
                   .rename({0: 'n'}, axis=1))
agree = scoring_overlap.loc[(scoring_overlap['z_essential'] == scoring_overlap['p_essential']), 'n'].sum()
disagree = scoring_overlap.loc[(scoring_overlap['z_essential'] != scoring_overlap['p_essential']), 'n'].sum()
accuracy = agree/(agree + disagree)
assert accuracy > 0.95
# Are controls centered around 0
control_z_scores = gene_lfcs.loc[gene_lfcs['Annotated Gene Symbol'].isin(nonessential_genes), 'z_scored_avg_lfc']
assert np.abs(control_z_scores.mean()) < 0.1
assert np.abs(control_z_scores.std() - 1) < 0.3
gene_lfcs.sort_values('avg_lfc')
condition Annotated Gene Symbol n_guides avg_lfc control_scaled_avg_lfc probability_positive_control z_scored_avg_lfc z_scored_avg_lfc_p_value z_scored_avg_lfc_fdr
13317 A375 PSMG3 4 -3.165679 -1.930318 0.999157 -13.081609 4.194682e-39 8.122163e-35
7375 A375 HSPA5 4 -3.021085 -1.845874 0.999113 -12.497781 7.676386e-36 7.431893e-32
4927 A375 EIF6 4 -3.009939 -1.839365 0.998650 -12.452779 1.350592e-35 8.717173e-32
14219 A375 RPL19 4 -2.993390 -1.829701 0.998627 -12.385961 3.113338e-35 1.507089e-31
12774 A375 POLR2L 4 -2.970395 -1.816271 0.998484 -12.293110 9.863629e-35 3.819789e-31
... ... ... ... ... ... ... ... ... ...
16365 A375 TAF5L 4 1.259290 0.653875 0.014899 4.785041 1.709528e-06 1.784453e-05
16386 A375 TAOK1 4 1.286787 0.669934 0.047783 4.896065 9.777464e-07 1.043091e-05
16367 A375 TAF6L 4 1.322374 0.690717 0.010756 5.039754 4.661311e-07 5.096384e-06
16219 A375 SUPT20H 4 1.417919 0.746515 0.031551 5.425536 5.778076e-08 6.838684e-07
9313 A375 MAP4K4 4 1.527641 0.810593 0.013910 5.868560 4.395965e-09 5.850108e-08

19363 rows × 9 columns

We see that our negative control population, which are the nonessential genes in this case, are grouped around 0

plot_lfcs = gene_lfcs.copy()
plot_lfcs['nlog10_fdr'] = -np.log10(plot_lfcs['z_scored_avg_lfc_fdr'])
plot_lfcs['neg_control'] = plot_lfcs['Annotated Gene Symbol'].isin(nonessential_genes)
plot_lfcs = plot_lfcs.sort_values('neg_control')
plt.subplots(figsize=(4,4))
sns.scatterplot(data=plot_lfcs, x='z_scored_avg_lfc', 
                y='nlog10_fdr', hue='neg_control', alpha=0.3)
sns.despine()

We can also calculate the gene level p-values using the hypergeometric test as follows:

gene_lfcs_hyp = aggregate_gene_lfcs_hypergeometric(annot_guide_lfcs, 'Annotated Gene Symbol',
                                average_cols=['avg_lfc'])
gene_lfcs_hyp
condition Annotated Gene Symbol n_guides avg_lfc avg_lfc_ascending_construct_ranks avg_lfc_descending_construct_ranks avg_lfc_hypergeometric_test_avg_pval
0 A375 A1BG 3 -0.084621 23319;35845;46354 30512;41021;53547 0.490678
1 A375 A1CF 4 0.500314 51763;62213;71912;74278 2588;4954;14653;25103 1.542492
2 A375 A2M 4 0.288629 42669;47983;54255;73458 3408;22611;28883;34197 0.920590
3 A375 A2ML1 4 -0.481786 10645;17323;22150;24754 52112;54716;59543;66221 1.080913
4 A375 A3GALT2 4 0.059362 12609;55370;57476;61651 15215;19390;21496;64257 0.642799
... ... ... ... ... ... ... ...
19358 A375 ZYG11A 4 -0.280137 18076;24707;24879;32720 44146;51987;52159;58790 0.860380
19359 A375 ZYG11B 4 0.215013 25652;40027;61085;73812 3054;15781;36839;51214 0.748303
19360 A375 ZYX 4 -0.182864 14031;29692;40133;41865 35001;36733;47174;62835 0.623025
19361 A375 ZZEF1 4 0.104225 25206;36473;53298;67325 9541;23568;40393;51660 0.573432
19362 A375 ZZZ3 4 -0.743157 3628;11128;19396;47666 29200;57470;65738;73238 0.992930

19363 rows × 7 columns

plot_lfcs_hyp = gene_lfcs_hyp.copy()
plot_lfcs_hyp = plot_lfcs_hyp[(plot_lfcs_hyp.n_guides>=3)&(plot_lfcs_hyp.n_guides<=5)]
hyp_ctrls = plot_lfcs_hyp[plot_lfcs_hyp['Annotated Gene Symbol'].isin(nonessential_genes)]
plt.subplots(figsize=(4,4))
sns.scatterplot(data=plot_lfcs_hyp, x='avg_lfc',
                y='avg_lfc_hypergeometric_test_avg_pval')
sns.scatterplot(data=hyp_ctrls, x='avg_lfc',
                y='avg_lfc_hypergeometric_test_avg_pval')
sns.despine()

Calculate ROC-AUC between essential and non-essential genes

get_roc_aucs[source]

get_roc_aucs(lfcs, tp_genes, fp_genes, gene_col, score_col=None, condition_col=None, condition_list=None, pos_control_direction=-1)

Calculate the ROC-AUC between true positive and false positive gene sets. Must specificy score_col (and group_col optionally) or conditions

lfcs: dataframe | tp_genes: list-like or str, true positive genes | fp_genes: list-like or str, false positive genes | gene_col: str | score_col: str, column for ranking genes from smallest to largest. condition_col: str or list, columns to use for grouping genes. condition_list: list, columns for which to calculate ROC-AUCs pos_control_direction: int, direction in which positive controls score | returns: dataframe of roc_aucs

Finally, to evaluate the quality each screen, we'll calculate the ROC-AUC between essential and nonessential genes for each condition

We can calculate ROC-AUCs three different ways:

1) We can specify a condition_col and score_col

roc_aucs, roc_df = get_roc_aucs(lfcs=gene_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol', 
                                condition_col='condition', score_col='avg_lfc')
assert roc_aucs['ROC-AUC'][0] > 0.9
roc_aucs
condition ROC-AUC
0 A375 0.976175
plt.subplots(figsize=(4,4))
sns.lineplot(data=roc_df, x='fpr', y='tpr', hue='condition', ci=None)
sns.despine()

We can also specify tp_genes and fp_genes using regular expressions. This is useful for distinguishing between different control types defined by a regex.

roc_aucs, _ = get_roc_aucs(lfcs=gene_lfcs, tp_genes='RPS', fp_genes='CD', gene_col='Annotated Gene Symbol',
                                condition_col='condition', score_col='avg_lfc')
assert roc_aucs['ROC-AUC'][0] > 0.8

2) Since we only have one condition we can also just supply the score_col

roc_aucs, roc_df = get_roc_aucs(lfcs=gene_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol', 
                        score_col='avg_lfc')
roc_aucs
0.976174694978823
plt.subplots(figsize=(4,4))
sns.lineplot(data=roc_df, x='fpr', y='tpr', ci=None)
sns.despine()

Finally, we can specify condition_list as a list of columns we want to calculate AUCs for

roc_aucs, roc_df = get_roc_aucs(lfcs=gene_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
                                condition_list=['avg_lfc'])
roc_aucs
condition ROC-AUC
0 avg_lfc 0.976175

This is a useful approach when we want to quantify ROC-AUC at the guide level say

annotated_guide_lfcs = lfc_df.merge(guide_annotations, how='inner', on='sgRNA Sequence')
roc_aucs, roc_df = get_roc_aucs(lfcs=annotated_guide_lfcs, tp_genes=essential_genes, fp_genes=nonessential_genes, gene_col='Annotated Gene Symbol',
                                condition_list=['A375_RepA', 'A375_RepB'])
roc_aucs
condition ROC-AUC
0 A375_RepA 0.918505
1 A375_RepB 0.917600
plt.subplots(figsize=(4,4))
sns.lineplot(data=roc_df, x='fpr', y='tpr', hue='condition', ci=None)
sns.despine()

We can see this visually as well that essentials deplete the most followed by nonessentials and control genes

plt.subplots(figsize=(4,4))
sns.kdeplot(gene_lfcs.loc[gene_lfcs['Annotated Gene Symbol'].isin(essential_genes), 'z_scored_avg_lfc'], label='essential')
sns.kdeplot(gene_lfcs.loc[gene_lfcs['Annotated Gene Symbol'].isin(nonessential_genes), 'z_scored_avg_lfc'], label='non-essential')
sns.kdeplot(gene_lfcs.loc[gene_lfcs['Annotated Gene Symbol'].str.contains('_'), 'z_scored_avg_lfc'], label='control')
sns.despine()
plt.legend()
<matplotlib.legend.Legend at 0x7ffc9da5c9b0>