1 Weill Cornell Medicine in Qatar, Qatar Foundation-Education City, Doha, Qatar
Corresponding author details:
Dietrich Büsselberg
Weill Cornell Medicine in Qatar
Qatar Foundation-Education City
Doha,Qatar
Copyright:
© 2018 Büsselberg D, et al. This
is an open-access article distributed under the
terms of the Creative Commons Attribution 4.0
international License, which permits unrestricted
use, distribution, and reproduction in any
medium, provided the original author and source
are credited.
Most cellular processes are influenced or directly mediated by protein-protein interactions (PPIs). Studying PPIs is therefore essential for understanding normal and pathological physiology within a cell. Understanding PPIs helps us to understand disease processes such as cancer and their mechanisms. Ideally, we would like to understand and study the entire network of protein interactions, which is referred to as the “interactome”. An interactome defines the full network of PPIs that take place within a cell [1].
Diverse methods have been used to identify interactomes, including proteomics methods [2]. However, the chemistry and technology of protein quantitation have substantial challenges. In contrast, DNA-sequencing technologies have long been robust, and particularly in the past decade “next-generation” sequencing has revolutionized our ability to quickly and accurately sequence vast amounts of genetic material. Various methods have used DNA readouts to study PPIs. Of these, the two-hybrid system is probably most frequently used [3]. The original system used yeast proteins encoding a DNA-binding domain and an activation domain. One protein of interest (“X”) is N-terminally fused to the binding domain, while another protein is similarly attached to the activation domain (“Y”). The binding domain protein binds to a specific DNA sequence and the activation domain recruits transcription factors, which are necessary to initiate the transcription of a reporter gene. Only if the “X” and the “Y” protein interact, bringing the binding domain and the activation domain into proximity, will the reporter gene be transcribed (Figure 1).
The two-hybrid system has been used in the intervening decades, and it has two major variations, which are the “yeast” and the “bacterial” two-hybrid systems.
In 1997 Fromont-Racine and co-workers used yeast 2-hybrid system (Y2H) to identify
protein interactions in yeast [4]. Thereafter, the Y2H model was used to screen for PPIs in
large scale for species such as Saccharomyces cerevisiae, Helicobacter pylori, Drosophila
melanogaster, Caenorhabditis elegans, and Homo Sapiens [5-11]. Three years after the first
screening using the Y2H system Joung and colleagues were the first utilizing a bacterial
2-hybrid (B2H) system with a large library (~ 108 in size) [12]. The B2H system has two
major advantages compared to the Y2H system, as it has a faster growth rate, and higher
transformation efficiency [13].
Initial experiments with the two-hybrid systems generally employed full-length genes.
The use of fragments rather than full-length libraries is perhaps counterintuitive, as we know that many protein structures and interactions are impossible with fragments, and might be expected to result in a high rate of false negatives. Surprisingly, researchers have found that gene fragment libraries reduce false negatives interactions rather than full-length gene; and thorough screening reduce false positives interaction [7,14]. When Boxem, M et al, used fragments rather than full-length genes, they found that they recovered more physiological interactions, and in their limited set no false positive interactions [14]. Presumably, this is because fragments of the protein may avoid problems of folding or translocation found by fulllength proteins. Any false positive interactions can also be eliminated with more thorough library screening [7].
The use of fragments has the added benefit of permitting more rapid screening, as one does not have to first devise a library of all protein-coding genes with specific primers before testing pairs. Random fragmentation quickly allows cDNA to be converted into testable fragments. Finally, note that the use of fragments allows localization of interactions to specific regions of proteins. Rather than knowing only those two proteins interact, we can define their interacting regions as well. With overlapping fragments, we can even identify the minimal interacting region.
For a fragment to potentially demonstrate a physiological interaction, it must be an open reading frame (ORF) (Figure 2). An ORF is a DNA sequence without a stop codon and has the potential to encode proteins. If DNA is sheared into random fragments, followed by insertion into a vector, the majority of gene fragments (83%) do not represent any functional gene (termed as out of frame), therefore coding non-physiological proteins. Out of the 6 possible frames (Figure 3) only 1 fragment corresponds to the gene frame, which encodes the physiologically relevant protein. Therefore, these fragments need to be filtered in order to discard those that are nonphysiological. This process of removing non-ORF sequences is what is termed “ORF filtering.”
All methods currently in use for ORF filtering rely on the underrepresentation of stop codons in truly coding sequences. A truly random DNA sequence will, on average, encode a stop codon every 21 triplets. Indeed, with only 63 codons, there is a 95% probability of having at least one stop codon if the sequence is random, and with fragments encoding 100 amino acids (300 bp), there is a 99 % chance that random sequence will have a stop codon. By contrast, a coding sequence of DNA will of course avoid stop codons until the end of the sequence has been reached. If we take fragments of cDNA just 300 bp long, but in random frames, then 5/6 (83.3%) will be in the wrong frame. Yet 99% of those will have a stop codon; if we can selectively eliminate fragments with stop codons, the in-frame percentage of the library will go from 16.7% up to 96%. This has the disadvantage of selecting against in-frame sequences that include the physiological stop codon, and thus the C-termini of proteins are expected to be under represented.
In order to filter and express ORFs, a sheared DNA fragment is cloned upstream of a selectable marker. If it is an ORF, then the selectable marker will be transcribed and expressed in those cells. The vast majority of DNA fragments that are out of frame will encode a premature stop codon, and consequently the selectable marker is not translated, resulting in a difference in selectability between ORFcontaining cells and those without ORF fragments. For instance, if an antibiotic resistance gene is the selectable marker, then only the cells with ORF fragments will survive in the presence of antibiotic (Figure 4) [15]. This method was adopted by Weinstock, G. M., and co-workers (1983), who inserted random fragments into a vector between the outer membrane protein (Omp) gene and beta-galactosidase (LacZ) gene (which can be used to screen via blue-white colony selection). The vector has LacZ(-), which corresponds to the nonfunctional gene [16]. However, by inserting of random-fragments that realign both genes Omp and LacZ, the LacZ(+) becomes functional and is expressed on the colonies and, therefore, can be selected by bluewhite screening. Furthermore, the libraries with random-fragments were enriched from 54% to 100% ORFs by the selection of an antibiotic [17,18]. Therefore, such a selection improves the quality of the library [19]. Moreover, this method is also capable of localizing the sites of interactions within the sequence [20,21].
There are several organisms that can be used for ORF filtering
including: bacterial strains, viruses such as phages, yeast. All these
methods are based on a series of experiments, starting by 1) isolating
a gene, 2) shearing to ORF fragments, 3) amplifying by polymerase
chain reaction (PCR), 4) ligation into a vector, 5) transfection into cell,
6) applying selective pressure to the library, and 7) sequencing the
targeted DNA.
a) 1 ATG
ATT GAA CAA GAT GGA TTG CAC GCA GGT TCT CCG GCC GCT TGG GTG GAG AGG CTA TTC 61 GGC TAT GAC TGG GCA CAA CAG ACA ATC GGC TGC TCT GAT GCC GCC GTG
TTC CGG CTG TCA 121 GCG CAG GGG CGC CCG GTT CTT TTT GTC AAG ACC GAC CTG TCC GGT GCC
CTG AAT GAA CTG 181 CAA GAC GAG GCA GCG CGG CTA TCG TGG CTG GCC ACG ACG GGC GTT CCT
TGC GCA GCT GTG 241 CTC GAC GTT GTC ACT GAA GCG GGA AGG GAC TGG CTG CTA TTG GGC GAA
GTG CCG GGG CAG 301 GAT CTC CTG TCA TCT CAC CTT GCT CCT GCC GAG AAA GTA TCC ATC ATG
GCT GAT GCA ATG 361 CGG CGG CTG CAT ACG CTT GAT CCG GCT ACC TGC CCA TTC GAC CAC CAA
GCG AAA CAT CGC 421 ATC GAG CGA GCA CGT ACT CGG ATG GAA GCC GGT CTT GTC GAT CAG GAT
GAT CTG GAC GAA 481 GAG CAT CAG GGG CTC GCG CCA GCC GAA CTG TTC GCC AGG CTC AAG GCG
AGC ATG CCC GAC 541 GGC GAG GAT CTC GTC GTG ACC CAT GGC GAT GCC TGC TTG CCG AAT ATC
ATG GTG GAA AAT 601 GGC CGC TTT TCT GGA TTC ATC GAC TGT GGC CGG CTG GGT GTG GCG GAC
CGC TAT CAG GAC 661 ATA GCG TTG GCT ACC CGT GT ATT GCT GAA GAG CTT GGC GGC GAA TGG
GCT GAC CGC TTC 721 CTC GTG CTT TAC GGT ATC GCC GCT CCC GAT TCG CAG CGC ATC GCC TTC
TAT CGC CTT CTT 781 GAC GAG TTC TTC TGA |
b)
Frame1 GCT ATG ATT GAA…………TTC TTC TGA
Frame 2 TGC TAT GAT TGA……ATT CTT CTG
A
Frame 3 CTG CTA TGA TTG…AAT TCT
TCT GA
Frame- 1 GCT AGT CTT CTT…………AAG TTA GTA
Frame- 2 TGC TAG TCT TCT……TAA GTT AGT A
Frame- 3 CTG CTA GTC TTC…TTA AGT TAG TA
Figure 3: a) Shows example of Kanamycin gene sequence in the form of triplet (Using: www.bioinformatics.org/sms2/group_dna. html) labeled red ATG the start codon and labeled blue TGA stop codon. b) Shows how the frames might be changed due insertion of random fragments of genes. If it’s frame 1 the sequence will be expressing physiologically relevant protein. While if it’s frame 2 the expression will be shifted by 1 base pair, leading to expressing nonphysiological protein. Similar issue will be encountered for frame 3 (shifted by 2 base pair) and those three frames could be inserted in the wrong orientation leading to frames -1, -2, -3. Where the gene is read from opposite direction (i.e. from the end to start).
To identify ORF’s in bacteria a marker, such as AmpR, (marker for ampicillin resistance), is used to test the presence of ORFs. DNA fragments are inserted upstream of AmpR gene and downstream of its leader sequence. The leader sequence allows the export of the transcribed product of AmpR gene to the periplasm (Figure. 4), its site of action. Different antibiotics (e.g. chloramphenicol, kanamycin, spectinomycin, tetracycline) can be used as selectable markers [22]. Furthermore, some methods use a more complex cloning by insertion of some sequences, such as Lox sequence, which is cleaved by the Cre recombinase [18,23]. This allows a recombination of the ORF and the formation of the fused DNA product with a tag gene. By flanking the ORFs with recombination elements, we can facilitate the isolation of ORFs for further studies and validation of ORF interactions [24,25].
The first time the frame concept was used to generate a MH3000 E.coli strain which had a -galactosidase (LacZ) gene out-of-frame [16]. These researchers inserted ORFs downstream the OmpF gene and upstream of an out-of-frame LacZ gene. When the fragments were inserted, those which changed the frame to generate a functional LacZ gene could produce blue colonies in the presence of X-Gal. These blue colonies could be verified to contain functional proteins.
Davis & Benzer showed that ORF frequencies are dependent on the concentration of antibiotic concerning their selection, library size, or bacterial strain [26]. They showed 8% of the clones are in frame before selection, while this fraction increased to 70% following the selection. Furthermore, selection frequency differed from ORFs library size. Both strains XL1-Blue and DH10B are capable of cloning larger fragment, however XL1-Blue resulted in higher transformation efficiency compared to DH10B. They concluded that for a smaller library size a higher concentration of the antibiotic did result in a better selection, while this was opposite for large library seizes. By chance the ORF fragment could be orientated in the wrong orientation. To overcome this issue Davis and co-workers used directional cloning using two different restriction enzymes to clone the ORFs into the expression vector [26]. Moreover, they used PCR primers to modify kanamycin gene by having a stop codon in the second reading frame of ATGA. This allowed them to ensure the ORFs will not survive if the reading frame starts from the second nucleotide.
Four test genes were used to confirm the “theory of frame” by shifting two of the genes, adding one or two bases [22]. The vector used had a Chloramphenicol resistance, and an ampicillin resistance gene. The four test genes were inserted upstream of AmpR gene and were transformed and plated in two different plates (Chlor and Amp). The colonies with frame-shifted genes do not survive in Ampicillin plates since it does not have a functional AmpR gene. Therefore, all the four test gene were able to grow in Chloramphenicol plates.
Filtering of genic ORFs for a ‘real gene’ – resulting in a physiological
protein - researchers used a vector that has a chloramphenicol
resistance to grow colonies on plates [27]. Thereafter, they harvested
cells and grew them in selective media supplemented with both
chloramphenicol and with different concentrations of ampicillin
(as a selective marker). This step was followed by sequencing to
identify ORFs which obtained 96% corresponding to real genes.
Statistical analysis showed that the activity of beta-lactamase rises
with increasing concentrations of ampicillin. This proves that higher
expression of beta-lactamase is essential for colonies to grow in high
concentration of ampicillin.
The phage display method inserts the gene (which encodes the protein of interest) into a phage coat protein that is expressed on the surface of the phage. Thereafter the gene is expressed in bacteria (as a host); a process called transduction. The primary bacteriophages used for phage display are T7 and M13, both of which can use Escherichia coli as a host.
In one of the first reports of ORF filtering for phage display, researchers modified a vector by eliminating the original multiple cloning site (MCS) inserting a new site through inverse PCR [19]. This step allowed them to design ORFs which are in-frame even when subcloned into derivatives vectors, which they constructed. After ligation of the fragments upstream of AmpR, the vector was transformed into an XL-1blue strain and ampicillin selection was applied. In order to display the ORF’s on the phage surface, the insert was cloned into a derivative vector and was transformed into a bacterial strain (ER2738) to grow under kanamycin selection. Phagemid were rescued by helper phage and sequencing analysis of those samples determined that 97% were ORFs.
Zacchi et al fused fragment upstream of the beta-lactamase to filter ORFs and flanked the insert by lox recombination sequence [18]. Following ampicillin selection, they excised the beta lactamase gene from the vector. To facilitate the purification of those ORFs using Phage display, the constructed vector had fd phage tag “gene 3”. Results confirmed using high concentration of antibiotic eliminate out-of frame sequences, (when using 12 μM ampicillin they had 100% ORFs and 0.2% out of frame; but when using 25 μM ampicillin they had 85% ORF and none out of frame). From the 100% filtered library, 80% was detected using Dot blot for protein detection and the mapped ORFs represented 50% genic ORFs
In 2010, Di Niro et al, applied the antibiotic screening technique to prepare ORFs for expression in phage display vectors in a more seamless way. Genes were fragmented into to 100-600bp, and fragments were cloned in the right orientation using restriction enzymes sites. The ORFs fragments were inserted upstream of AmpR gene, but with lox sequences flanking the AmpR sequence. Downstream from the AmpR and Lox sequence is a g3p gene, which encodes a phage coat protein used to display ORFs in phage surface. Once ligated, the vectors were transformed into bacteria and grown on ampicillin-containing plates for ORF selection. Positive clones were then transformed into a bacterial strain that has a constitutively active Cre-recombinase. This cleaves the lox sequence and recombines ORFs with g3p gene, eliminating the AmpR gene. In this way, the ORF only needs to be once cloned into the vector, while still permitting expression of the ORF/g3p fusion without AmpR. To validate interactions ORFs were displayed in phage for the enzyme transglutaminase 2 (TG 2). This method allowed the selection of 99% ORFs in which 85% corresponding the correct frame of the gene, and provided the local regions of interactions domain.
By contrast, Caberoy NB et al, used phage display itself to select ORFs [28]. They used the T7 phage, inserting their ORF at the C-terminal end of the Capsid 10B protein. Crucially, however, they added a 3C protease cleavage site and then a biotinylation site further downstream. Consequently, only virus particles encoding an ORF will be biotinylated. The cDNA library was selected using streptavidin to isolate biotinylated ORFs, and these were cleaved from resin with the 3C protease. The recovered library was re-amplified in bacteria to generate an ORF-selected library suitable for use in selection experiments. They found 17 ORFs, of which 13 encode different protein were selected using phage display. Phage display of cDNA library fused with biotinylation tag in the C-terminus confirmed following selection that clones had 90% enriched ORFs inserts [29].
Gene sheared, gel purified fragments of 100-300bp. Preformed ampicillin selection to filter ORFs, following step vector transformed into strain that express constitutively active Cre gene to remove ampicillin gene from ORFs after selection [30]. Phage display of ORFs by infecting the bacteria with helper phage M13K07 represented 94% ORFs library.
Fragmented gene into 200-800bp via sonication, those fragments was cloned into vector [31]. Followed by transform into strain resistance to chloramphenicol and ampicillin as selective marker. To confirm that the target sequences obtained, the samples were sequenced and to determine the structure of the enzyme crystallization performed. For the purification and crystallization of proteins, a His-tag was attached in the N-terminus to have protein in the soluble form. Following this approach, they were able to identify two domains on the gene, covered 739 genes from chromosome 1 and 540 genes from chromosome 2 with a total of 1279 ORFs in their library.
Open reading frame percentage (ORFs%) corresponds to
percentage of sequences that were isolated without having a stop
codon. (ORFs genic%) is percentage of ORF sequences that were
isolated without having a stop codon and when aligned to the
reference gene it aligns to the correct gene frame. The (selective
marker) is the marker that have been used in the ORF filtering vector
to select for ORF sequence and filter out the one with stop codon
based on antibiotic selective pressure for bacteria and yeast, Tag
presence for phage. Fragments size used for sheared DNA library
on base pair (bp), Application applied such as Phage display to and
further tests applied to validate ORF (Validation methods). Lastly, the
strain used for selection whether it has been done in bacteria, phage,
or yeast and corresponding authors name. Authors highlighted
yellow are for ORF selection using bacteria, while green using yeast,
and blue using phage.
ORFs have also been tested in yeast, by transforming first into a bacterial strain for expression of the desired vector, which must be ampicillin resistance by plating into Amp plates [32]. Following the selection of the desired vector, the plasmid was transformed into yeast to get ORFs through histidine induction medium to select ORFs that are tagged with histidine gene. The ORFs were tagged with histidine gene, to filter out the ORFs. As other researched used ampicillin as selective marker, here Holz and his colleague used histidine gene as selective marker for yeast. Through this experiment they were able to cover 60% ORFs.
A higher ORF percentage is achieved using smaller insert fragments. Using bacteria ORF filtered library achieved highest ORF percent when utilized insert size ranging 100-800bp (Table1). However, a fragment ranging from 100-500bp is better in the sense that a fragment with size of 300bp will have 99% chance of having a stop codon. The great advantage of using fragmented library is that it allows localization of interaction site [20,21]. The ORF filtering eliminates fragments with stop codon, providing good selection for more genic ORFs (Figure 5). When fragmented libraries were used with ORF filtering a recovery of more physiological interactions was achieved [14].
ORF filtering is a great tool for providing functional sequence
within the gene. This offers a robust path for discovery of drug
targets, treatment of infections especially resistance to antibiotic and
cancer.
Copyright © 2020 Boffin Access Limited.