Current fluorescence-based in situ protocols involving third-strand probes require multiple binding sites for target visibility as each probe carries only a single fluor. The minimum copy number allowed when working with such probes is probably 15-20 (MD Johnson, personal communication). In the original literature it was reported that each human histone gene has a copy number of 30-40 (Wilson et al 1977). It was later reported that the entire family of histone genes is clustered (Carozzi et al 1984) in the short arm of chromosome 7 (Chandler et al 1979). This information supported the view that the human histone genes are potential targets to expand third-strand in situ hybridization from α-satellite sequences to non-centromeric targets.
The human genome codes for five main families of histone proteins (H1, H2A, H2B, H3, and H4). Each family has a number of variants. Depending on the family, homology between members can range from 100% (H4) to 68% (H1). A sixth histone, the poorly characterized H5, is often listed as a linker protein and is not always included with the other five members.
Because of this family relationship, and their common roles in eukaryotes, the histone genes, human and otherwise, seemed to be particularly suited for third-strand binding studies for a number of reasons. Their arrangement and structure in the genome, as noted below, matches well with most of the current TISH requirements.
Previous database searches (Grasso 1994; Niederstrasser 1997) have established a methodology for finding and analyzing appropriate genomic sequences for third-strand binding targets. Online search engines and the GCG software suite (University of Wisconsin, © Genetics Computer Group, Inc.) are invaluable tools for retrieving and analyzing nucleic acid sequences.
Human histone sequences were first sought in the GenBank database by keyword search in the nucleotide Entrez browser at the National Center for Biotechnology Information online area. Rapid visual examination was then performed on the retrieved sequences for appropriate third-strand binding sites.
A subsequent search of the literature found a project at the National Human Genome Research Institute/NIH that had brought together and organized all known sequences of the human histone genes and their variants (Baxevanis and Landsman 1997). Data from this "HISTONE Project" was used to locate other sequences suitable for third-strand binding in the published open reading frames and intergenic regions of the histone genes.
Sequences obtained from GenBank and the EMBL database were screened for third-strand binding targets. The GCG program FINDPATTERNS was used to search for homopurine runs of sixteen or more residues allowing for at most one inverted basepair (i.e., pyrimidine residue) within the retrieved histone database. Final targets were screened visually to exclude sequences containing the triplet GGG (as such sequences would require three positively charged C residues in a third-strand using the pyrimidine parallel motif) or long non-random stretches.
As an internal control, a similar search was made for "alpha satellite" and "alphoid" on the GeneMBL database, mimicking M. Grasso's GCG search in 1994. Limited to human entries, each query yielded 260 and 317 replies, respectively. No attempts were made to eliminate double hits. However, even with full overlap of the two returned lists, the results yielded at a minimum slightly more than 300 hits, comparable to the 336 separate α-satellite sequence entries found by Grasso.
The database and literature search described in the Methods for this section returned approximately 40 published nucleotide sequences of human histone genes. Several non-human sequences were also acquired by this process. These latter sequences were discarded without further analysis. Of the remaining human sequences, some redundant entries were eliminated as done by Baxevanis and Landsman (1997). The final data set of possible targets is shown in Table 6. All five main gene families and most variants are represented.
Table 7 lists sequences which were selected by the FINDPATTERNS screen. Two of the five histone gene families are listed, as well as HMG-17, a non-histone chromosomal protein. The genomic structure of HMG-17 is unknown. 8.8% (3 out of 34 sequences) of the retrieved histone database contained possible third-strand binding targets. One of these has a terminal cytosine and is therefore unlikely to be useful.
Locus | Accession No. | Definition |
---|---|---|
Table 6. The human histone gene families and their variations. Primary GenBank identification codes and description are listed for those genes of the five human histone families that were found as described in Methods. |
||
HUMH1T HUMHISAB HUMHISAC HUMHISH1T HSHIS10G HSH11 HSH12 |
M60094 M60747 M60748 M97755 X03473 X57129 X57130 |
Human testicular H1 histone (H1) gene Human histone H1 (H1F3) gene Human histone H1 (H1F4) gene Human histone H1T gene Human gene for histone H1(0) H.sapiens H1.1 gene for histone H1 H.sapiens H1.2 gene for histone H1 |
HUMH2A1B HUMHIS2AZ HUMHISAG HSHISH2A HSH2AX HSH2AZ |
L19778 M37583 M60752 X00089 X14850 X52317, X06885 |
Homo sapiens histone H2A.1b Human histone H2A.Z Human histone H2A.1 gene Human histone H2a Human histone H2A.X Human histone H2A.Z |
HUMHISAE HUMHISAF HSHISH2B HSH2B1 HSH2B2H2 |
M60750 M60751 X00088 X57127 X57138 |
Human histone H2B.1 gene Human histone H2B.1 gene Human histone H2b gene H.sapiens H2B.1 histone Human H2B.2 and H2A.1 Histone |
HUMHISH3C HUMHISH3B HUMHIS3PRM HUMHISAA HSHISH3 HSH31 |
M11353 M11354 M26150 M60746 X00090 X57128 |
Human H3.3 histone class Human H3.3 histone, class B Human histone H3 gene Human histone H3.1 (H1F3) gene Human histone H3 gene H.sapiens H3.1 histone |
HUMHIS4 HUMHISAD HSHIH4 HSH4AHIS HSH4BHIS HSH4DHIS HSH4EHIS HSH4GHIS HSH4HHIS HSH4HIST |
M16707 M60749 X00038 X60481 X60482 X60483 X60484 X60486 X60487 X67081 |
Human histone H4 gene, clone FO108 Human histone H4 (H4) gene Human H4 histone gene H.sapiens H4/a histone H.sapiens H4/b histone H.sapiens H4/d histone H.sapiens H4/e histone H.sapiens H4/g histone H.sapiens H4/h histone |
"HISTONE, HUMAN, REPETITIVE" Query | |||
---|---|---|---|
Sequence 5' THIRD-STRAND 3' 5' binding strand 3' 3' template strand 5' |
Target Size and Mispairs | Locus Accession No. NID |
Notes |
All sequences are from Hillier et al (1995). |
|||
Table 7. Results of database query for human histone sequences. Full sequences were extracted from GenBank using the keywords "HUMAN", "HISTONE", and "REPETITIVE". These sequences were then queried as described in Methods by the GCG program FINDPATTERNS to find third-strand binding targets. Only the putative binding regions of those genes with possible third-strand targets are shown. Bases unfavorable to triplex formation are underlined. They include bases that interrupt homopurine·homopyrimidine target continuity and cytosine triplets in the third-strand. |
|||
CTTCTTTCGCTTCTTC gaagaaagcgaagaag cttctttcgcttcttc |
16.1 | N92492 N92492 g1264801 |
cDNA clone 301887 sw:H1D_HUMAN P16403 H1D |
TCCGTCTTCCTTTTTTCC aggcagaaggaaaaaagg tccgtcttccttttttcc |
18.1 | AA188780 AA188780 g1775871 |
cDNA clone 626210 gb:L19779 H2A.1 |
TCCTTCACTTTACTTTTACC aggaagtgaaatgaaaatgg tccttcactttacttttacc |
21.3 | N91162 N91162 g1444489 |
cDNA clone 301811 gb:X13546 rna1 HMG-17 |
CCCTTCTTTCCTGTT gggaagaaaggacaa cccttctttcctgtt |
15.1 | W73650 W73650 g1383864 |
cDNA clone 344100 gb:M37583 H2A.Z |
CTTCTACTTTCCTCTTC gaagatgaaaggagaag cttctactttcctcttc |
17.1 | AA203446 AA203446 g1799157 |
cDNA clone 446556 gb:X13546 rna1 HMG-17 |
TTTCTTCCCTCTCTTCC aaagaagggagagaagg tttcttccctctcttcc |
17.0 | AA203446 AA203446 g1799157 |
cDNA clone 446556 gb:X13546 rna1 HMG-17 |
TTCCCTCTATTTCGTTTCC aagggagataaagcaaagg ttccctctatttcgtttcc |
19.2 | AA203446 AA203446 g1799157 |
cDNA clone 446556 gb:X13546 rna1 HMG-17 |
TTTCCCTTTTTCCCTTTTC aaagggaaaaagggaaaag tttccctttttcccttttc |
19.0 | AA203446 AA203446 g1799157 |
cDNA clone 446556 gb:X13546 rna1 HMG-17 |
Although less than 10% of the various histone genes contained appropriate targets for third-strand binding, it was determined that these targets would probably not be suitable for the purposes of this project because they are not closely packed into a repetitive arrangement. Newer analysis has shown that although the various histone genes are clustered, they are spread out almost randomly over an area of 260 kb (Albig et al 1997). Furthermore, the copy number of each gene in this area varies from 4 to 8, too small for proper signal production in fluorescent microscopy. This information is at variance with the earlier work of Wilson et al (1977).
The goal of analyzing the human histone genes in order to find suitable targets for TISH was therefore not successful. While several targets were found, their chromosomal arrangement is not suitable for third-strand in situ hybridization. Nevertheless, the methodology employed here improved upon the previously used database search and is a step in the right direction. Further refinements in the process are still required, as only 43 sequences were found here, while Baxevanis and Landsman list 373 non-redundant human histone sequences in the major protein databases.
There are several possible reasons for this discrepancy. To begin with, their database was generated from SWISS-PROT, PIR, the Protein Data Bank (PDB), and CDS translations from GenBank, all protein databases. The search presented here was based solely on GenBank and the EMBL nucleotide databases. It is possible that certain published protein sequences have never been submitted to a nucleic acid database. Also, the initial scope of the search presented here was aimed at finding repetitive sequences. As such, the keyword "repetitive" was added to all searches. Since genomic analysis is often not done on cloned sequences, the database entries do not always include this information. Therefore, many histone sequences were not obtained from the database. Searching only for "histone" and "human" is believed to be too broad a query that will yield inconclusive results with a low signal:noise ratio.
A late ENTREZ search of the databases for H1 histones returned 92 entries. At least thirty of these belonged to an EST project (Hillier et al 1995) and are redundant. Another 30 were sequenced cDNA clones listed only as similar to the histone genes and are probably redundant as well. Narrowing the search to exclude "similar" sequences, the new list extracted from the database contained only 28 entries. Visual examination confirmed that only verified histone sequences were listed. To gauge the importance of this last refinement, a comparable search for H3 sequences excluding "similar" entries reduced the returned list from 568 hits to 36 hits.
Nucleotide sequences extracted from the NIH histone project can be submitted to the same FINDPATTERNS search as other databases. The results should yield a more definitive list of third-strand binding targets in all the histone gene families.
H-DNA regions present possible sources of non-centromeric sequences useful for chromosomal binding by third-strands. H-DNA forms when two identical regions of d(G/A)n are mirrored across each other in close proximity on the same DNA strand (Frank-Kamenetskii and Mirkin 1995). During the 'breathing' of the DNA molecule, the two mirror regions can fold back on each other, creating a Y:R·Y motif. H-DNA regions are fast becoming an important focus of genome research since they have been found to be plentiful and are located in control regions upstream of genes (Beasty and Behem 1988). It is not known yet what, if any, biological purpose these sequences have, but they have been shown to affect the normal functioning of the cell's replication and transcription machinery.
To begin with, it has been shown that some nuclear proteins bind to simple repeat sequences like (GAA)n (Epplen et al 1996), a viable area for third-strand binding. This observation refutes the proposed biological insignificance of these interspersed elements. Furthermore, sequences implicated in H-DNA formation have been known to use similar sequences and regions that exhibit the above residue pattern. H-DNA, therefore, might represent genomic targets where nuclear proteins preferentially bind. Sridhara-Rao has shown that such sequences found in the simian virus 40 (SV40) slow down the rate of replication of the virus (1994). Although not conclusive proof, he presents a strong argument that they might have regulatory properties. Interestingly, the regulation seems to be only partial. Grabczyk and Fishman describe how certain H-DNA sequences act as transcriptional diodes, allowing transcription in only one direction but not the other (1995). Whether this is a consequence of sequence or structure is not known.
A search of the nucleotide database at the NCBI for "H-DNA and HUMAN" found three possible human H-DNA sites (Table 8). The low level of hits is surprising considering that H-DNA has been characterized to occur in many regulatory regions. A GeneMBL database search using GCG yielded similarly low (<10) results. Future work will need to identify other databases that provide more concise and useful returns. One such location that might be able to provide better results is the NCBI's GenBank Database Query engine.
Several long homopurine runs have also been identified in the genome (Table 9, top two entries). They are recorded here as they might be useful for future work in the Fresco lab. These sequences obviously provide ample sites for third-strand binding. The known sequences, which range from approximately 60 bp to over 400 bp, suggest that there are probably many more regions of the human genome yet to be sequenced that contain homopurine·homopyrimidine tracts. Similar long homopurine sequences also appear in other animal genomes, particularly in the rat, further indicating that such long stretches might be common motifs in animal chromosomes. As of this writing, there is no known biological role for these sequences.
"H-DNA, HUMAN" Query | |||
---|---|---|---|
Sequence 5' THIRD-STRAND 3' 5' binding strand 3' 3' template strand 5' |
Target Size and Mispairs | Locus Accession No. NID |
Notes |
Table 8. Results of database query for homopurine segments. Full sequences were extracted from GenBank using the keywords "H-DNA" and "HUMAN". These sequences were then queried as described in Methods by the GCG program FINDPATTERNS to find third-strand targets. Only the putative binding regions of those genes with possible targets are shown. Bases unfavorable to triplex formation are underlined. They include bases that interrupt homopurine·homopyrimidine target continuity and cytosine triplets in the third-strand. |
|||
CTCTCACCCCTTTTCTTGCTCCCT gagagtggggaaaagaacgaggga ctctcaccccttttcttgctccct |
24.2 | HUMAAE L28809 g454151 |
γ-globin clone="hBP5" |
TTTTCTTTTCCTTTTGTCCTTC aaaagaaaaggaaaacaggaag ttttcttttccttttgtccttc |
22.1 | HS1014CT X16734 g525225 |
chrm 10 t(10;14)(q24;q11) |
CTTCTCTCTCTTCTCCCTTGTTCC gaagagagagaagagggaacaagg cttctctctcttctcccttgttcc |
24.1 | HS1014CT X16734 g525225 |
chrm 10 t(10;14)(q24;q11) |
CTCCCTCCCCTCCCCTCCCCCTCTCCTTCC gagggaggggaggggagggggagaggaagg ctccctcccctcccctccccctctccttcc |
30.0 | HS1014CT X16734 g525225 |
chrm 10 t(10;14)(q24;q11) |
TTTTTTTTTTTGTTTTTGTTTTGTTTTTGT aaaaaaaaaaacaaaaacaaaacaaaaaca tttttttttttgtttttgttttgtttttgt TTTTTTTTTTTGTTTCTTCCTCTTTC aaaaaaaaaaacaaagaaggagaaag tttttttttttgtttcttcctctttc |
56.5 | HSC1INHIB X54486 g29534> |
hum C1 inhib LNIA |
"HOMOPURINE" Query | |||
---|---|---|---|
Sequence 5' THIRD-STRAND 3' 5' binding strand 3' 3' template strand 5' |
Target Size and Mispairs | Locus Accession No. NID |
Notes |
† Third-strand and template strand omitted for clarity. |
|||
Table 9. Results of database query for homopurine segments. Full sequences were extracted from GenBank using the keyword "HOMOPURINE". These sequences were then queried as described in Methods by the GCG program FINDPATTERNS to find third-strand targets. Only the putative binding regions of those genes with possible targets are shown. Bases unfavorable to triplex formation are underlined. They include bases that interrupt homopurine·homopyrimidine target continuity and cytosine triplets in the third-strand. |
|||
tttcatctctgtgtttttctttatttcctt ccttccttccttccctccctccctcaatcc ctccctctcttgctcttcctcttcctttcc tttctttcctttcctttcctgaccttccct tcctttcatttcctttcccttcccttccct ttctttcccttcccttcccttcctttccct tcccttcccttcctttcccttcccttccct tcctttccctccccttcccttccctcccct tcccttccctccccttccctccccttccct ccccttccctcccctcccatcccctcccct ccctttttctttttcttttttctcttctct tctcttcctctcctctcctgtctttttctt tttcttatcttttcttttcttgtttctttt ctc† |
393.12 | HSMDR1I X78081 g587421 |
mdr1 chr 7q21.1 |
gaagaggaagaagaaagaggaggaggagga aagaaggaagaagaaggaggagaagaagaa gaggaggaggaggaagaggatgaggaggaa gaggaggaggtggaagaggaagaggaagaa gagga† |
125.2 | HSNG26 X54171, X53282 g35051 |
nucleophosmin pseudogene LNIA |
TCTCTTTCTTTCTCTCTTTTTCTTCCTTT agagaaagaaagagagaaaaagaaggaaa tctctttctttctctctttttcttccttt |
29.0 | HUMSFTP2A L40486 g1280227 |
BMP1/mTld w. D8S298 & D8S5 |
TTTTTTTTTTTTTTCTTTCTTTCTTTTCTT aaaaaaaaaaaaaagaaagaaagaaaagaa ttttttttttttttctttctttcttttctt TTTT aaaa tttt |
34.0 | HSNG26 X54171, X53282 g35051 |
nucleophosmin pseudogene LNIA |
TTTTTTTCTTTTCTTTTTTCTTC aaaaaaagaaaagaaaaaagaag tttttttcttttcttttttcttc |
23.0 | HSNG26 X54171, X53282 g35051 |
nucleophosmin pseudogene LNIA |
CTCTCTCTTTTTTTCGTC gagagagaaaaaaagcag ctctctctttttttcgtc |
18.1 | HSMDR1I X78081 g587421 |
mdr1 chr 7q21.1 |
TTCTTTTCTTTATTTTCT aagaaaagaaataaaaga ttcttttctttattttct |
18.1 | HSNG26 X54171, X53282 g35051 |
nucleophosmin pseudogene LNIA |
TTTTTTGTCCTTTTTTT aaaaaacaggaaaaaaa ttttttgtccttttttt |
17.1 | HSNG26 X54171, X53282 g35051 |
nucleophosmin pseudogene LNIA |
TTCTTCCCCTCTATCCCTCT aagaaggggagatagggaga ttcttcccctctatccctct |
20.1 | HUMSFTP2A L40486 g1280227 |
BMP1/mTld btw. D8S298 & D8S5 |
TCTGTCCTTCCTCCCTCCGCTC agacaggaaggagggaggcgag tctgtccttcctccctccgctc |
25.2 | HUMSFTP2A L40486 g1280227 |
BMP1/mTld btw. D8S298 & D8S5 |
CTCCCTCCCTCTCCTCCTT gagggagggagaggaggaa ctccctccctctcctcctt |
19.0 | HSNG26 X54171, X53282 g35051 |
nucleophosmin pseudogene LNIA |
TCTCCCTTCCCCTTCTCC agagggaaggggaagagg tctcccttccccttctcc |
18.0 | HUMSFTP2A L40486 g1280227 |
BMP1/mTld btw. D8S298 & D8S5 |