Untitled Document
Microarrays
TSI-Tissue specificity indices
SAGE-Serial Analysis of Gene Expression
Electronic Northern
Binary expression patterns
Human GeneAtlas HG-U133A
GeneAnnot based custom probesets
Microarrays
-
RNA source
PolyA+ RNA samples from twelve normal human tissues were purchased from Clontech (Palo Alto,CA).
This collection of major human tissues includes:
Bone marrow (catalog number: 6573-1), brain (6516-1), heart (6533-1), kidney (6538-1), liver (6510-1), lung (6524-1), pancreas (6539-1), prostate (6546-1), skeletal muscle (6541-1), spinal cord (6593-1), spleen (6542-1) and thymus (6536-1).
-
Data Normalization
Arrays were analyzed and expression values, called signal, was calculated for each gene by using Microarray Suit (MAS) version 5.0 software (Affymetrix, Santa Clara, CA) using default parameter settings. Scaling was not done via a MAS 5.0 option. Instead, the intensities of each array were log10 transformed and scaled to a constant reference value (global normalization). This reference value was the mean of all log intensities in all of the tissues.
-
Expression Profiles
Duplicate measurements were obtained for twelve normal human tissues hybridized against Affymetrix GeneChips HG-U95A-E. The intensity values (shown on the y-axis) were normalized and drawn on a novel scale, which is an intermediate between log and linear scales. This enables displaying several orders of magnitude on the same graph, while emphasizing the differences between them.
-
Aggregate Expression
The bar graphs represent the averaged expression level calculated for a given gene. The calculation is done by averaging all of the probe-sets individual profiles. The detailed expression profile with the annotation for each individual probesets are presented in the table on the given gene web-page.
-
Variation plots
Multiple probe-sets corresponding to the given gene are included for its tissue vector calculation only if their normalized intensity levels reach a threshold in at least one tissue. The variation of included and excluded probe-sets are visualized in the x-y plane: the x-axis shows Pearson's correlations between individual probe-sets vectors and the average tissue vector; the y-axis shows the relative length of an individual probe-set vector (its scalar length divided by that of the average vector). The average is shown as a black square, while individual probe-sets are depicted as colored circles.
-
Probe set annotation
The list of probes and their sequences was obtained from the Affymetrix public database (http://www.affymetrix.com/index.affx), wherein each probe-set on Affymetrix HG-U95 is constructed from 16 probes 25 nucleotide long. 16 probes taken from each probe-set were aligned against the mRNA sequences from the most comprehensive public databases using the GeneAnnot algorithm (http://genecards.weizmann.ac.il/geneannot/). The alignment was performed using the BLAT algorithm (Kent, W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res, 12, 656-64.) allowing one mismatch along the sequence. The quality scores of a probe-set are given per specific genes, while taking into consideration the relationships of other genes that were aligned. Specificity and sensitivity parameters describe the quality of each probe-set.
-
Specificity calculation
If a probe is aligned to a single gene its score will be 1. However, if it is aligned to n genes the value will be 1/n. The specificity is calculated by the weighted summation of all probe values in the set, divided by the total number of aligned probes in the set.
-
Sensitivity calculation
Probe set sensitivity is calculated by the summation of the successfully aligned probes divided by the total number of probes in the set (e.g., 16).
Tissue specificity indices (TSI):
-
Signal Quantilization
The MAS5.0 intensities, ranging on a decimal logarithmic scale from log
10
30 to roughly 4, were converted into a quantile scale.
The expression data, averaged over the two replicates were divided into 11 bins, whereby 10 equal density bins (quantiles) spanned the values above log
10
30, and an 11
th
zero bin included the remaining low intensity values. Henceforth, the quantiled profiles were used in the analysis.
-
Entropy-based tissue specificity index
These indices were defined as follows:
Entropy-based - TSIent - The quantiled profile was first normalized by dividing each intensity by the total intensity of that profile.
The TSI is then based upon Shannon's entropy (Shannon, C.E. (1963)
The mathematical theory of communication,
University of Illinois Press, Champaign,IL.):
where N is the number of tissues (12),and p is the normalized expression.
-
Statistical analysis of differential expression
Single-classification ANOVA with equal sample sizes (Sokal and Rohlf 2000)
was employed on the preprocessed 24 element expression vector composed of 12
tissues in duplicates. ANOVA could be applied due to the near normal shape of
the expression intensity distribution (Shmueli et al. 2003). Henceforth, we
refer to the tissue expression vector of a probeset as its 'profile'.
For each profile, the sum of the squares of the differences between the
replicates was compared with the sum of the squares of the differences between
the averages of the tissue expressions. A P-value was calculated using
the F statistic taking into account the degrees of freedom. To account for the
multiple comparison problem inherent in calculating the P-values for all 62,839
probesets, we calculated the false discovery rate of the P-values (Benjamini and Hochberg 1995). We chose a P-value cutoff of 0.0036 which
estimates a 1% error rate. This resulted in 22,936 profiles that are defined
as "differentially expressed". The rest of the profiles defined as
housekeeping when no differences were shown within replicates or between samples,
not-expressed when the expression in all samples were below the threshold and
uncharacterized when the p-value was above the cutoff.
-
Highest Value Referenced - TSIhvr
The profile is first normalized by dividing each intensity by the highest intensity of that profile. The TSI is then:
where N is the number of tissues (12) and x is the normalized expression vector.
-
Geometrical based - TSIgeo
The profile was first normalized by dividing each intensity by the
highest quantile (10), thus effectively representing the profile as a
point in a one unit 12-dimensional hypercube. We then compared each such
point with the diagonal vector representing the housekeeping profile:
where r is the distance to the diagonal and y is the distance to the closest axis.
-
Gap based - TSIgap
We defined the 'gap' for each expression profile as the maximum
difference between the neighboring values in the sorted quantile vector.
When the same 'gap' was found more than once in a profile, the minimum was
taken. The 'gap' was then scaled relative to the maximum possible gap (10)
such that the index ranges from 0 to 1.
SAGE-Serial Analysis of Gene Expression
-
SAGE method
For ten normal human tissues (currently the relevant SAGE libraries are
not available for spleen and thymus, shown in lower case and flagged with
*) CGAP datasets
Hs.frequencies
and
Hs.libraries
are mined for information
about the number of SAGE tags per tissue. Tags are reassigned to a Unigene
cluster and after that to a particular gene by mining
Hs.best_gene, Hs.best_tag
and
Hs_GeneData.
The expression level of a particular gene in a particular
tissue was calculated as the number of appearances of the corresponding tag
divided by the total number of tags in libraries derived from that tissue.
These fractions were then normalized by multiplying by 1.2M and the obtained
normalized counts are presented on the same root scale as that is used for the
electronic Northern pictures and experimental tissue vectors.
Please note
: Currently, only associations with
minimal ambiguity participate in the analysis.
-
Best matching tag
Tag which is the best mach for that gene and vice versa.
-
Unique tag
Tag that uniquely represents the gene and doesn't correspond with any other gene.
Electronic Northern
-
Electronic Northern method
For the shown set of normal human tissues NCBI's Unigene dataset
Hs.data
is mined for information about the number of unique clones
per gene per tissue. Clones are assigned to particular tissues by
applying data-mining heuristics to Unigene's library information file
Hs.lib.info.
Electronic expression results were calculated by dividing
the number of clones per gene by the number of clones per tissue.
They were then normalized by multiplying by 1M, and the obtained
normalized counts are presented on the same root scale as the experimental
tissue vectors. This scale (shown on the y-axis) is an intermediate between
log and linear scales. This enables displaying several orders of magnitude
on the same graph, while emphasizing the differences between them.
Binary expression patterns
-
Binary expression patterns method
Arbitrary expression profiles are also presented in binary pattern
form when possible, with at most 5 unique binary patterns shown for
each gene. For each expression profile, all entries above a defined
relative cutoff (termed 'gap') receive the value of 1, represented
by black, and are considered as "over expressed", those below,
which receive the value of 0, represented by white, are "under expressed".
Various binary patterns in different tissues are shown per gene, with their
counts on the left. (The grey stripes show undefined binary patterns).
Please note:
"Under expression" does not always mean the lack of
expression.
Human GeneAtlas HG-U133A
-
Human GeneAtlas HG-U133A
Reference:
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G,
Cooke MP, Walker JR, Hogenesch JB (2004) A gene atlas of the mouse and human protein-encoding
transcriptomes. Proc Natl Acad Sci U S A. 2004 Apr 20;101(16):6062-7
GeneAnnot based custom probesets
-
Expression patterns based on custom CDFs method
Custom Chip Definition Files (CDFs) based on GeneAnnot
were used for gene expression data preprocessing using MAS5.0 absolute analysis algorthm.
This novel set of CDFs allows to perform a gene-centered analysis of gene expression data obtained from human Affymetrix GeneChips, removing the noise from probes with ambiguous matches on gene sequences.
MAS55.0 intenities were then normalized as above described. See also the article by
Itai Yanai et al.
for details on data normalization procedure.
|