Appendix A: The Human Histone Genes And Other Targets

Introduction

Current fluorescence-based in situ protocols involving third-strand probes require multiple binding sites for target visibility as each probe carries only a single fluor. The minimum copy number allowed when working with such probes is probably 15-20 (MD Johnson, personal communication). In the original literature it was reported that each human histone gene has a copy number of 30-40 (Wilson et al 1977). It was later reported that the entire family of histone genes is clustered (Carozzi et al 1984) in the short arm of chromosome 7 (Chandler et al 1979). This information supported the view that the human histone genes are potential targets to expand third-strand in situ hybridization from α-satellite sequences to non-centromeric targets.

The human genome codes for five main families of histone proteins (H1, H2A, H2B, H3, and H4). Each family has a number of variants. Depending on the family, homology between members can range from 100% (H4) to 68% (H1). A sixth histone, the poorly characterized H5, is often listed as a linker protein and is not always included with the other five members.

Because of this family relationship, and their common roles in eukaryotes, the histone genes, human and otherwise, seemed to be particularly suited for third-strand binding studies for a number of reasons. Their arrangement and structure in the genome, as noted below, matches well with most of the current TISH requirements.

Experimental design and methods

Previous database searches (Grasso 1994; Niederstrasser 1997) have established a methodology for finding and analyzing appropriate genomic sequences for third-strand binding targets. Online search engines and the GCG software suite (University of Wisconsin, © Genetics Computer Group, Inc.) are invaluable tools for retrieving and analyzing nucleic acid sequences.

Human histone sequences were first sought in the GenBank database by keyword search in the nucleotide Entrez browser at the National Center for Biotechnology Information online area. Rapid visual examination was then performed on the retrieved sequences for appropriate third-strand binding sites.

A subsequent search of the literature found a project at the National Human Genome Research Institute/NIH that had brought together and organized all known sequences of the human histone genes and their variants (Baxevanis and Landsman 1997). Data from this "HISTONE Project" was used to locate other sequences suitable for third-strand binding in the published open reading frames and intergenic regions of the histone genes.

Sequences obtained from GenBank and the EMBL database were screened for third-strand binding targets. The GCG program FINDPATTERNS was used to search for homopurine runs of sixteen or more residues allowing for at most one inverted basepair (i.e., pyrimidine residue) within the retrieved histone database. Final targets were screened visually to exclude sequences containing the triplet GGG (as such sequences would require three positively charged C residues in a third-strand using the pyrimidine parallel motif) or long non-random stretches.

As an internal control, a similar search was made for "alpha satellite" and "alphoid" on the GeneMBL database, mimicking M. Grasso's GCG search in 1994. Limited to human entries, each query yielded 260 and 317 replies, respectively. No attempts were made to eliminate double hits. However, even with full overlap of the two returned lists, the results yielded at a minimum slightly more than 300 hits, comparable to the 336 separate α-satellite sequence entries found by Grasso.

Results

The database and literature search described in the Methods for this section returned approximately 40 published nucleotide sequences of human histone genes. Several non-human sequences were also acquired by this process. These latter sequences were discarded without further analysis. Of the remaining human sequences, some redundant entries were eliminated as done by Baxevanis and Landsman (1997). The final data set of possible targets is shown in Table 6. All five main gene families and most variants are represented.

Table 7 lists sequences which were selected by the FINDPATTERNS screen. Two of the five histone gene families are listed, as well as HMG-17, a non-histone chromosomal protein. The genomic structure of HMG-17 is unknown. 8.8% (3 out of 34 sequences) of the retrieved histone database contained possible third-strand binding targets. One of these has a terminal cytosine and is therefore unlikely to be useful.

Table 6
Locus Accession No. Definition

Table 6. The human histone gene families and their variations. Primary GenBank identification codes and description are listed for those genes of the five human histone families that were found as described in Methods.

HUMH1T
HUMHISAB
HUMHISAC
HUMHISH1T
HSHIS10G
HSH11
HSH12
M60094
M60747
M60748
M97755
X03473
X57129
X57130
Human testicular H1 histone (H1) gene
Human histone H1 (H1F3) gene
Human histone H1 (H1F4) gene
Human histone H1T gene
Human gene for histone H1(0)
H.sapiens H1.1 gene for histone H1
H.sapiens H1.2 gene for histone H1
HUMH2A1B
HUMHIS2AZ
HUMHISAG
HSHISH2A
HSH2AX
HSH2AZ
L19778
M37583
M60752
X00089
X14850
X52317, X06885
Homo sapiens histone H2A.1b
Human histone H2A.Z
Human histone H2A.1 gene
Human histone H2a
Human histone H2A.X
Human histone H2A.Z
HUMHISAE
HUMHISAF
HSHISH2B
HSH2B1
HSH2B2H2
M60750
M60751
X00088
X57127
X57138
Human histone H2B.1 gene
Human histone H2B.1 gene
Human histone H2b gene
H.sapiens H2B.1 histone
Human H2B.2 and H2A.1 Histone
HUMHISH3C
HUMHISH3B
HUMHIS3PRM
HUMHISAA
HSHISH3
HSH31
M11353
M11354
M26150
M60746
X00090
X57128
Human H3.3 histone class
Human H3.3 histone, class B
Human histone H3 gene
Human histone H3.1 (H1F3) gene
Human histone H3 gene
H.sapiens H3.1 histone
HUMHIS4
HUMHISAD
HSHIH4
HSH4AHIS
HSH4BHIS
HSH4DHIS
HSH4EHIS
HSH4GHIS
HSH4HHIS
HSH4HIST
M16707
M60749
X00038
X60481
X60482
X60483
X60484
X60486
X60487
X67081
Human histone H4 gene, clone FO108
Human histone H4 (H4) gene
Human H4 histone gene
H.sapiens H4/a histone
H.sapiens H4/b histone
H.sapiens H4/d histone
H.sapiens H4/e histone
H.sapiens H4/g histone
H.sapiens H4/h histone
Table 7
"HISTONE, HUMAN, REPETITIVE" Query
Sequence
5' THIRD-STRAND 3'
5' binding strand 3'
3' template strand 5'
Target Size and Mispairs Locus
Accession No.
NID
Notes

All sequences are from Hillier et al (1995).

Table 7. Results of database query for human histone sequences. Full sequences were extracted from GenBank using the keywords "HUMAN", "HISTONE", and "REPETITIVE". These sequences were then queried as described in Methods by the GCG program FINDPATTERNS to find third-strand binding targets. Only the putative binding regions of those genes with possible third-strand targets are shown. Bases unfavorable to triplex formation are underlined. They include bases that interrupt homopurine·homopyrimidine target continuity and cytosine triplets in the third-strand.

CTTCTTTCGCTTCTTC
gaagaaagcgaagaag
cttctttcgcttcttc
16.1 N92492
N92492
g1264801
cDNA clone 301887
sw:H1D_HUMAN P16403 H1D
TCCGTCTTCCTTTTTTCC
aggcagaaggaaaaaagg
tccgtcttccttttttcc
18.1 AA188780
AA188780
g1775871
cDNA clone 626210
gb:L19779 H2A.1
TCCTTCACTTTACTTTTACC
aggaagtgaaatgaaaatgg
tccttcactttacttttacc
21.3 N91162
N91162
g1444489
cDNA clone 301811
gb:X13546 rna1 HMG-17
CCCTTCTTTCCTGTT
gggaagaaaggacaa
cccttctttcctgtt
15.1 W73650
W73650
g1383864
cDNA clone 344100
gb:M37583 H2A.Z
CTTCTACTTTCCTCTTC
gaagatgaaaggagaag
cttctactttcctcttc
17.1 AA203446
AA203446
g1799157
cDNA clone 446556
gb:X13546 rna1 HMG-17
TTTCTTCCCTCTCTTCC
aaagaagggagagaagg
tttcttccctctcttcc
17.0 AA203446
AA203446
g1799157
cDNA clone 446556
gb:X13546 rna1 HMG-17
TTCCCTCTATTTCGTTTCC
aagggagataaagcaaagg
ttccctctatttcgtttcc
19.2 AA203446
AA203446
g1799157
cDNA clone 446556
gb:X13546 rna1 HMG-17
TTTCCCTTTTTCCCTTTTC
aaagggaaaaagggaaaag
tttccctttttcccttttc
19.0 AA203446
AA203446
g1799157
cDNA clone 446556
gb:X13546 rna1 HMG-17

Discussion

Although less than 10% of the various histone genes contained appropriate targets for third-strand binding, it was determined that these targets would probably not be suitable for the purposes of this project because they are not closely packed into a repetitive arrangement. Newer analysis has shown that although the various histone genes are clustered, they are spread out almost randomly over an area of 260 kb (Albig et al 1997). Furthermore, the copy number of each gene in this area varies from 4 to 8, too small for proper signal production in fluorescent microscopy. This information is at variance with the earlier work of Wilson et al (1977).

The goal of analyzing the human histone genes in order to find suitable targets for TISH was therefore not successful. While several targets were found, their chromosomal arrangement is not suitable for third-strand in situ hybridization. Nevertheless, the methodology employed here improved upon the previously used database search and is a step in the right direction. Further refinements in the process are still required, as only 43 sequences were found here, while Baxevanis and Landsman list 373 non-redundant human histone sequences in the major protein databases.

There are several possible reasons for this discrepancy. To begin with, their database was generated from SWISS-PROT, PIR, the Protein Data Bank (PDB), and CDS translations from GenBank, all protein databases. The search presented here was based solely on GenBank and the EMBL nucleotide databases. It is possible that certain published protein sequences have never been submitted to a nucleic acid database. Also, the initial scope of the search presented here was aimed at finding repetitive sequences. As such, the keyword "repetitive" was added to all searches. Since genomic analysis is often not done on cloned sequences, the database entries do not always include this information. Therefore, many histone sequences were not obtained from the database. Searching only for "histone" and "human" is believed to be too broad a query that will yield inconclusive results with a low signal:noise ratio.

A late ENTREZ search of the databases for H1 histones returned 92 entries. At least thirty of these belonged to an EST project (Hillier et al 1995) and are redundant. Another 30 were sequenced cDNA clones listed only as similar to the histone genes and are probably redundant as well. Narrowing the search to exclude "similar" sequences, the new list extracted from the database contained only 28 entries. Visual examination confirmed that only verified histone sequences were listed. To gauge the importance of this last refinement, a comparable search for H3 sequences excluding "similar" entries reduced the returned list from 568 hits to 36 hits.

Nucleotide sequences extracted from the NIH histone project can be submitted to the same FINDPATTERNS search as other databases. The results should yield a more definitive list of third-strand binding targets in all the histone gene families.

Other possible human targets

H-DNA regions present possible sources of non-centromeric sequences useful for chromosomal binding by third-strands. H-DNA forms when two identical regions of d(G/A)n are mirrored across each other in close proximity on the same DNA strand (Frank-Kamenetskii and Mirkin 1995). During the 'breathing' of the DNA molecule, the two mirror regions can fold back on each other, creating a Y:R·Y motif. H-DNA regions are fast becoming an important focus of genome research since they have been found to be plentiful and are located in control regions upstream of genes (Beasty and Behem 1988). It is not known yet what, if any, biological purpose these sequences have, but they have been shown to affect the normal functioning of the cell's replication and transcription machinery.

To begin with, it has been shown that some nuclear proteins bind to simple repeat sequences like (GAA)n (Epplen et al 1996), a viable area for third-strand binding. This observation refutes the proposed biological insignificance of these interspersed elements. Furthermore, sequences implicated in H-DNA formation have been known to use similar sequences and regions that exhibit the above residue pattern. H-DNA, therefore, might represent genomic targets where nuclear proteins preferentially bind. Sridhara-Rao has shown that such sequences found in the simian virus 40 (SV40) slow down the rate of replication of the virus (1994). Although not conclusive proof, he presents a strong argument that they might have regulatory properties. Interestingly, the regulation seems to be only partial. Grabczyk and Fishman describe how certain H-DNA sequences act as transcriptional diodes, allowing transcription in only one direction but not the other (1995). Whether this is a consequence of sequence or structure is not known.

A search of the nucleotide database at the NCBI for "H-DNA and HUMAN" found three possible human H-DNA sites (Table 8). The low level of hits is surprising considering that H-DNA has been characterized to occur in many regulatory regions. A GeneMBL database search using GCG yielded similarly low (<10) results. Future work will need to identify other databases that provide more concise and useful returns. One such location that might be able to provide better results is the NCBI's GenBank Database Query engine.

Several long homopurine runs have also been identified in the genome (Table 9, top two entries). They are recorded here as they might be useful for future work in the Fresco lab. These sequences obviously provide ample sites for third-strand binding. The known sequences, which range from approximately 60 bp to over 400 bp, suggest that there are probably many more regions of the human genome yet to be sequenced that contain homopurine·homopyrimidine tracts. Similar long homopurine sequences also appear in other animal genomes, particularly in the rat, further indicating that such long stretches might be common motifs in animal chromosomes. As of this writing, there is no known biological role for these sequences.

Table 8
"H-DNA, HUMAN" Query
Sequence
5' THIRD-STRAND 3'
5' binding strand 3'
3' template strand 5'
Target Size and Mispairs Locus
Accession No.
NID
Notes

Table 8. Results of database query for homopurine segments. Full sequences were extracted from GenBank using the keywords "H-DNA" and "HUMAN". These sequences were then queried as described in Methods by the GCG program FINDPATTERNS to find third-strand targets. Only the putative binding regions of those genes with possible targets are shown. Bases unfavorable to triplex formation are underlined. They include bases that interrupt homopurine·homopyrimidine target continuity and cytosine triplets in the third-strand.

CTCTCACCCCTTTTCTTGCTCCCT
gagagtggggaaaagaacgaggga
ctctcaccccttttcttgctccct
24.2 HUMAAE
L28809
g454151
γ-globin
clone="hBP5"
TTTTCTTTTCCTTTTGTCCTTC
aaaagaaaaggaaaacaggaag
ttttcttttccttttgtccttc
22.1 HS1014CT
X16734
g525225
chrm 10
t(10;14)(q24;q11)
CTTCTCTCTCTTCTCCCTTGTTCC
gaagagagagaagagggaacaagg
cttctctctcttctcccttgttcc
24.1 HS1014CT
X16734
g525225
chrm 10
t(10;14)(q24;q11)
CTCCCTCCCCTCCCCTCCCCCTCTCCTTCC
gagggaggggaggggagggggagaggaagg
ctccctcccctcccctccccctctccttcc
30.0 HS1014CT
X16734
g525225
chrm 10
t(10;14)(q24;q11)
TTTTTTTTTTTGTTTTTGTTTTGTTTTTGT
aaaaaaaaaaacaaaaacaaaacaaaaaca
tttttttttttgtttttgttttgtttttgt

TTTTTTTTTTTGTTTCTTCCTCTTTC
aaaaaaaaaaacaaagaaggagaaag
tttttttttttgtttcttcctctttc
56.5 HSC1INHIB
X54486
g29534>
hum C1 inhib
LNIA
Table 9
"HOMOPURINE" Query
Sequence
5' THIRD-STRAND 3'
5' binding strand 3'
3' template strand 5'
Target Size and Mispairs Locus
Accession No.
NID
Notes

† Third-strand and template strand omitted for clarity.

Table 9. Results of database query for homopurine segments. Full sequences were extracted from GenBank using the keyword "HOMOPURINE". These sequences were then queried as described in Methods by the GCG program FINDPATTERNS to find third-strand targets. Only the putative binding regions of those genes with possible targets are shown. Bases unfavorable to triplex formation are underlined. They include bases that interrupt homopurine·homopyrimidine target continuity and cytosine triplets in the third-strand.

tttcatctctgtgtttttctttatttcctt
ccttccttccttccctccctccctcaatcc
ctccctctcttgctcttcctcttcctttcc
tttctttcctttcctttcctgaccttccct
tcctttcatttcctttcccttcccttccct
ttctttcccttcccttcccttcctttccct
tcccttcccttcctttcccttcccttccct
tcctttccctccccttcccttccctcccct
tcccttccctccccttccctccccttccct
ccccttccctcccctcccatcccctcccct
ccctttttctttttcttttttctcttctct
tctcttcctctcctctcctgtctttttctt
tttcttatcttttcttttcttgtttctttt
ctc
393.12 HSMDR1I
X78081
g587421
mdr1
chr 7q21.1
gaagaggaagaagaaagaggaggaggagga
aagaaggaagaagaaggaggagaagaagaa
gaggaggaggaggaagaggatgaggaggaa
gaggaggaggtggaagaggaagaggaagaa
gagga
125.2 HSNG26
X54171, X53282
g35051
nucleophosmin pseudogene
LNIA
TCTCTTTCTTTCTCTCTTTTTCTTCCTTT
agagaaagaaagagagaaaaagaaggaaa
tctctttctttctctctttttcttccttt
29.0 HUMSFTP2A
L40486
g1280227
BMP1/mTld
w. D8S298 & D8S5
TTTTTTTTTTTTTTCTTTCTTTCTTTTCTT
aaaaaaaaaaaaaagaaagaaagaaaagaa
ttttttttttttttctttctttcttttctt

TTTT
aaaa
tttt
34.0 HSNG26
X54171, X53282
g35051
nucleophosmin pseudogene
LNIA
TTTTTTTCTTTTCTTTTTTCTTC
aaaaaaagaaaagaaaaaagaag
tttttttcttttcttttttcttc
23.0 HSNG26
X54171, X53282
g35051
nucleophosmin pseudogene
LNIA
CTCTCTCTTTTTTTCGTC
gagagagaaaaaaagcag
ctctctctttttttcgtc
18.1 HSMDR1I
X78081
g587421
mdr1
chr 7q21.1
TTCTTTTCTTTATTTTCT
aagaaaagaaataaaaga
ttcttttctttattttct
18.1 HSNG26
X54171, X53282
g35051
nucleophosmin pseudogene
LNIA
TTTTTTGTCCTTTTTTT
aaaaaacaggaaaaaaa
ttttttgtccttttttt
17.1 HSNG26
X54171, X53282
g35051
nucleophosmin pseudogene
LNIA
TTCTTCCCCTCTATCCCTCT
aagaaggggagatagggaga
ttcttcccctctatccctct
20.1 HUMSFTP2A
L40486
g1280227
BMP1/mTld
btw. D8S298 & D8S5
TCTGTCCTTCCTCCCTCCGCTC
agacaggaaggagggaggcgag
tctgtccttcctccctccgctc
25.2 HUMSFTP2A
L40486
g1280227
BMP1/mTld
btw. D8S298 & D8S5
CTCCCTCCCTCTCCTCCTT
gagggagggagaggaggaa
ctccctccctctcctcctt
19.0 HSNG26
X54171, X53282
g35051
nucleophosmin pseudogene
LNIA
TCTCCCTTCCCCTTCTCC
agagggaaggggaagagg
tctcccttccccttctcc
18.0 HUMSFTP2A
L40486
g1280227
BMP1/mTld
btw. D8S298 & D8S5