Designing CRISPR sgRNAs in Python

Dan E. Webster
Quilt
Published in
5 min readOct 20, 2017

--

Selecting the most effective single guide RNA (sgRNA) is critical for any CRISPR experiment. While there are many tools to design sgRNAs from scratch, why not leverage pre-designed, optimized, genome-wide libraries from leaders in the field of functional genomics?

In this article we’ll demonstrate how to target almost any human gene or regulatory element.

We’ve curated a dozen of the latest human sgRNA libaries from Addgene into an open data package that anyone can install and use, danWebster/sgRNAs. In addition to the 12 genic libraries, we’ve included a custom library that targets regulatory elements that have been identified by the ENCODE project. Each of these libraries is uniformly formatted, searchable by gene name, and searchable by genomic location.

Our goal in curating this data is to make it easier to conduct CRISPR experiments. With comprehensive, curated sgRNA data available directly in Python, you’ll spend less time editing source code, and more time editing the genome.

In the past you may have cross-checked multiple genome-wide libraries to find the sgRNA closest to your amino acid or regulatory element. Now all of the libraries are in one place.

Figure 1 — The documentation page for the danWebster/sgRNAs data package.

Technical Pre-requisites

This article assumes basic familiarity with Python. We’ll teach you a bit of pandas as we go. We recommend, but do not require, Jupyter as an easy and interactive method to run your Python code.

Search for sgRNAs by gene name

To access the sgRNA libraries you’ll need the Quilt package manager.

$ pip install quilt

You can find the full source code from this article on GitHub. (If you’re a Jupyter user and would like to start with a pre-populated Notebook, you can git clone our git repository.)

We’re now ready to fire up Python and install danWebster/sgRNAs, which contains our data.

import pandas as pd
import numpy as np
import quilt
quilt.install(“danWebster/sgRNAs/libraries”, force=True)

A concrete example

Suppose that you’d like to design sgRNAs for the hypothetical treatment of sickle cell anemia. Sickle cell anemia arises from a single base pair change in the gene that encodes hemoglobin beta (HBB). To repair this mutation we’ll select an sgRNA that guides the CRISPR machinery to HBB in the human genome. Here’s how to search across a dozen genic libraries for the sgRNAs that target HBB:

from quilt.data.danWebster import sgRNAs# Concatenate all libraries into one DataFrame with ()
all_labs = sgRNAs.libraries()
# Enter gene names in a list
search_terms = ['HBB']
# Convert list to a DataFrame with one column
search = pd.DataFrame({'term': pd.Series(search_terms)})
# Merge search and all_labs DataFrames to see results
result = search.merge(all_labs, left_on='term', right_on='targetGene')

Our search returns more than 30 sgRNAs that target HBB. For our hypothetical treatment of sickle cell anemia, we don’t want to cut haphazardly in the HBB gene and destroy its function. Instead we wish to edit or repair the gene and cut as close as possible to the affected base pair. We’ll therefore search by genomic coordinates to find the sgRNA that’s closest to the point mutation that causes sickle cell anemia.

Search for sgRNAs by genomic coordinates

The single nucleotide polymorphism (SNP) responsible for sickle cell anemia changes the 6th amino acid in HBB from glutamine (E) to valine (V), thus the nomenclature “HBB E6V”. (HBB E6V is equivalent to the dbSNP identifier “rs334”, shown below in the UCSC genome browser). We want to get as close as possible to that base pair in the genome, chr11:5248232–5248232. We’ll extend the search range — both upstream and downstream — because CRISPR gene editing doesn’t have to be exactly on the same base pair. We’ll use the extended range shown below in the UCSC Genome Browser: chr11:5,248,103–5,248,345.

Here’s how we search based on genomic location:

# Enter the search ranges into a DataFrame
searchloc = pd.DataFrame([{'chr' : 'chr11', 'start' : 5248103, 'stop' : 5248345}])
# To improve query speed, partition the sgRNA libraries by Chromosome
crispr_chrs = all_labs.groupby('chr_hg19')
# Match search locations to groups on the same chromosome.
matches = []
for idx, a_row in searchloc.iterrows():
# get chromosome
chr_grp = crispr_chrs.get_group(a_row['chr'])
# determine if search interval intersects
a_match = chr_grp.loc[(chr_grp['start_hg19'] < a_row['stop'])
& ((a_row['start'] <= chr_grp['stop_hg19']))]
if len(a_match.index) > 0:
matches.append(a_match)
# Munge resutlts into a DataFrame
allmatches = pd.concat(matches) if matches else []

sgRNAs between the genes

Lastly, some inside baseball for fans of epigenomics. If you wish to use CRISPR to activate or disrupt the epigenome at regulatory elements “between the genes”, the following code will search a custom library of sgRNAs targeting all of the DNaseI Hypersensitive Sites (DHS) from the ENCODE project. For example, sickle cell anemia phenotypes have been alleviated by disrupting the BCL11A enhancer. (For details on the construction of this epigenomic CRISPR library, see CRISPR Between the Genes.) Here’s how to perform a search within the BCL11A regulatory element region:

# Grab the (large) dhs sub-package
quilt.install("danWebster/sgRNAs/dhs", force=True)
# Pull the custom library into a DataFrame
dhs = sgRNAs.dhs.encode()
# Enter the search terms into a DataFrame
searchloc = pd.DataFrame([{'chr' : 'chr2', 'start' : 60721359, 'stop' : 60723401}])
# Group by chromosome for performance
dhs_chrs = dhs.groupby('chr_hg19')
# Gather range matches
matches = []
for idx, a_row in searchloc.iterrows():
chr_grp = crispr_chrs.get_group(a_row['chr'])
a_match = chr_grp.loc[(chr_grp['start_hg19'] < a_row['stop'])
& ((a_row['start'] <= chr_grp['stop_hg19']))]
if len(a_match.index) > 0:
matches.append(a_match)
else:
dhs_grp = dhs_chrs.get_group(a_row['chr'])
dhs_match = dhs_grp.loc[(dhs_grp['start_hg19'] < a_row['stop'])
& ((a_row['start'] <= dhs_grp['stop_hg19']))]
if len(dhs_match.index) > 0:
matches.append(dhs_match)
allmatches = pd.concat(matches)

Choosing the predicted best sgRNA

We often find multiple sgRNAs from a single range-based search. One way to choose the “best” sgRNA is to apply a scoring algorithm that predicts CRISPR cutting efficiency. The leading algorithm was developed by David Root, John Doench, and colleagues. A web-based implementation of that algorithm can be found here. The minimal input to this algorithm is a 30mer sequence, with flanking regions around the central 20mer sgRNA; so we’ve included the flanking 30mer in the data package.

Conclusion

Hopefully this article gives you a running start to design your own CRISPR experiments in Python. We welcome your comments and questions. Thanks for reading.

--

--