![]()
Digitalizing biologic diversity: how good is
our scanner, and are we using it correctly?
Rough draft - Last revision
Reviewers
- if any! - are asked to read this disclaimer carefully
By Cesare Brizio
Address: Via Chiesa Vecchia, 45 - 44028 Poggio Renatico FE -
Cladistic
analyses based on discrete characters matrices (DCMs) are presently the only
universally accepted method for displaying phylogenetic data and - with the
assistance of dedicated software like PAUP, MacClade, Phylip, Hennig86 and so on - for analyzing the phyletic
relationships both in paleontology and neontology.
Cladistic,
computer assisted analyses are completely dependant on the underlying DCMs.
Their creation constitutes a digitalization process, whose nature is strictly
comparable with other sampling processes where discrete measures are taken, but
DCM creation has never been analyzed in this respect. The technicalities
involved in the definition of ideal sampling rates, signal/noise ratios and
quality of the original seem not to have touched this particular kind of
hand-made scanning, but something hints at the existence of important analogies
between automatic, computer controlled image scanning and DCM structuring and
compilation.
The
activity of generating a DCM from "raw" data obtained from
observation and measure, is unfortunately too often
performed without taking into account many rules of the thumb that even now,
after many years of penetration of this method into the scientific community,
are overlooked or plainly ignored. Surely it can be said that Cladistic is a
perfect method for a perfect world. But is our world perfect enough? The
problem with DCMs and computer based cladistic systematics is that they are too
fashionable, and that they seem to increase the credibility of any publication.
Sadly - at least from a philosophic point of view - the misuse of the DCM-based
methods puts the entire cladistic analyses at the risk of degenerating in a
non-scientific practice. The purpose of this paper is to illustrate the main
drawbacks of current practice in DCM generation, and to cast some doubts about
the entire process of their use in phylogenetic trees generation, with the hope
of promoting some wider discussion within the scientific community.
The
starting point, too often overlooked, is that this kind of diagram doesn't at
all give more strength to the reasoning behind it: the tree diagram is nothing
more than a way to display the results obtained from the observation of the
examined specimens. The DCM has no intrinsic force,
the force is in the specimens and in the nature of the observations and
measurements taken. So, it's a "This is the conclusion I can draw from my
data", not a "This is why I think so!".
It's not uncommon to have the impression that the author is using the cladogram
like an armor to protect himself and his work from criticism, using it to
prevent - rather than to promote - discussion.
I will
show how weak are the protection and reassurance offered by any DCM, by
specifically addressing three points:
Then, by
a comparison with another digitalization process, computerized image scanning,
I will illustrate some general points about this issue.
DCMs in the genetic context: what future?
It has
been known for many years that a single, punctiform genetic modification can
affect more than one phenotipic expression, even in different anatomical
districts. Strangely enough, nobody seems to have caught the devastating
implication of this fact for DCMs. In fact, as explained below, if any two
columns in a DCM are covariant, this can heavily affect the resulting
phylogenetic trees. And, simply put, especially in paleontology we don't know
for sure how deeply correlated our characters are in this respect. So, all that
can be done is adopting a reasonable degree of caution.
Fundamental precautions: how can I fill my matrix with useful,
meaningful data.
Let's
define the conditions at which DCM can be expected to work properly:
Not
taking into account subtle and purposeful manipulations of the DCMs to make
them lead to predetermined results (what I define as the "political
use" of DCMs), there is always the possibility of bona fide errors. For
the sake of our souls, we should use all the above illustrated cautions.

Comparison among matrices: at which conditions does it make sense?
Discrete
Character Matrices - as used in present day practice - are intrinsically
incomparable. Rather than trying to overcome this limitation, scientists seem
to accept gladly this "incomparability bonus" as one more instrument
to limit, or prevent, information exchange. The always increasing competition
for primacy among scientific institutions requires research materials to be
treated like well kept secrets even well after their publication: the purpose
of scientific papers themselves is increasingly becoming personal and
institutional gratification rather than scientific knowledge diffusion. Results
have to be "publicly known", not "publicly available".
In my
opinion a compromise has to be made for the sake of science, among competing
institutions and the scientific community as a whole. In the DCM fields, an
attempt to enhance comparability - preserving the most jealously kept data and
favoring the widespread circulation of general information in a standard
reference frame - could be an important and riskless effort.
In any
definition of "scientific method" the stress falls on the
REPLICABILITY: the replicability of the phylogenetic results derived from a
given DCM depends upon the following conditions:
Only in
this case the results will be positively replicated, because this is the only
way to preserve the arithmetical integrity of the data flow throughout the
process. Strictly speaking, any row or any column added or removed from a DCM
could imply the generation of a phylogenetic tree different from the one
generated from the original DCM: for a given method of tree generation, there
should be an univocal direct relation between the DCM
and the phylogenetic tree(s) generated. So, the inclusion of portions of a DCM into
another does not imply at all the endorsement of the phylogenetic hypotheses
derived from the original DCM.
Replicability
in science is usually considered an unprescindable requirement for
comparability of results among scientists. As we cannot expect all the
scientists to work over and over on the same DCMs, scientific practice should
provide for methodological instruments capable of ensuring a reasonable degree
of comparability among DCMs. And, unfortunately, this is not the case.
In fact,
it is common practice for each scientist to generate his own DCMs,
that may incorporate a number of rows and columns from previously
published DCMs, also from different authors.
But why
should we strive to make our matrices comparable?
Because
this could promote an unprecedented effort towards some sort of matrix
integration, and this in turn would enhance the quality of information exchange
among scientists. But in the context outlined above, the only possibility to
set a common and firm ground under the feet of any comparison effort stands in
STANDARDIZATION, a process involving METHODS, ROWS, and COLUMNS.
About
METHODS, the scientific community could not endorse any particular software
product, but rather an operational protocol should be defined, putting into
sharper focus the standard acceptable rules for the generation of a
standardized matrix.
About
ROWS, the hypothesis of entirely predefining the set of species to be
included in all the "standardized" analyses is absolutely
unpractical, as it would preclude the inclusion of new taxa into the analyses.
All that can be done is imposing some sort of "most representative
taxa" for any given natural group, to be included into every analysis
pertaining to any (putative) member of that group.
About
COLUMNS, the matter is even more intricate, because imposing the presence of
predefined characters (that is, predefined body districts) in every matrix is
even less justifiable - on logical grounds - than imposing the inclusion of any
particular species. Perhaps, the imposition of some quite general character for
each skeletal zone (appendicular, cranial, axial)
could be quite easily accepted into practice.
Behind
that, a much more difficult task has to be performed: we are not talking about
empty matrix cells! Rather than the definition of a standard empty frame, the
aim of all the process is the creation of a common, undisputed general
aggregate of data, that every "standardized"
DCM dataset should intersect. After all, we are talking about a general frame
of characters for a limited number of reference species in each natural group,
and general acceptance of this practice shouldn’t be impossible to obtain.
Proposal for the constitution of Standardized DCM Worldwide
Dataset
Let's
postulate some sort of standardizing authority, a project that I will call
"Standardized DCM Worldwide Dataset" (SDWD in acronym), something
like an extended Internet-based workgroup. Anyone, let alone myself, could
start this process, e.g. publishing on the Internet ordered subsets of data
matrices collected from literature. But the involvement of reference scientific
institutions is essential to put any hope in some final usable result.
A very
extensive, if not particularly advanced, use of the Internet is surely the most
practical instrument for SDWD management and also for its structuring. Thus, a
collaborative portal should be expressly set up and implemented with all the
security provisions needed to ensure that - particularly in the structuring
phase - only the participating institution can have access to the reserved area
for SDWD member institutions, where votes are to be expressed, while the SDWD
application forms and the general information pages are entirely public.
The
entire process could start from an university or a
cultural institution trying to involve zoological, paleontological, botanical
societies in a survey or a call for proposals about this subject. Three
conditions will be posed to the participants:
A range
in the number of participating institutions should be predefined: a minimum
number of participants under which the initiative has no hope to represent the
scientific community, and a maximum number above which standardization efforts
would become too complicated. As long as the concept of "majority" is
involved in many strategic decisions to be taken, an even number of participants
could arise complications. Provisions should be made,
like the definition of tie-breaking procedures or some sort of external referee
to resolve 50/50 situations.
The
process described hereinafter would start as soon as the minimum number of
participants is reached, or better within a predefined time interval after its
reaching, to allow for other institutions - not exceeding the preset maximum
number - to join the group.
Then a
schedule of the steps that follow has to be set, with reasonable limit dates for
the accomplishment of each different task. The schedule has to be published
(via Web) and accepted by the majority of participating institutions. The
failure of a participant to comply with the expiration dates results into the
exclusion of the participant from that particular step, not from the entire
project. Timely participants should acquire some privilege in the decisional
phases that will follow.
The
first operational step is the definition of the reference species, that should
be performed by subcommittees of the involved societies, preferably by one
subcommittee for each “natural group” (phylogenetic group!) involved.
A first
version of the lists (called
a “Preliminary” or “extended” version) should be made available
on the web to the participating institution, and each species in the list
should be voted for inclusion in the final version of the list (what we could
call “Official” or “restricted” version).
In a
very similar way the selection of the characters could be performed by means of
publication/voting via Internet.
During
these processes for the "rows" and "columns" definition,
another subcommittee should define the ethical code of SDWD, to be published as
soon as possible, and the policies of the SDWD workgroup, with particular
reference to the "Standard Rules in DCM Generation".
The
painstaking effort of filling the cells (defined in the preceding steps) would
follow: the empty matrices should be made available on the web to the
participating institutions' subcommittees and proposals of filled matrices
should be submitted to the SDWD by e-mail.
As soon
as a matrix is received, it is published on the web site and made accessible
only to SDWD participants.
As soon
as the limit date for the submission of matrices is reached, the matrices are
examined. All the unanimously evaluated cells become automatically part of the
standard reference data set. If the matrices submitted show any difference,
decision is automatic as long as a majority of the participants has opted for a
solution. A draw will be treated according to the rules defined in the
preliminary phases.
The
final stage is the publication in final, easily downloadable textual format, of
the standardized DCM data sets that the participants are committed to use for
the generation of standard, SDWD-endorsed DCMs.
In
parallel, another activity could be initiated, that is, the collection of every
DCM published in literature, to be stored in a relational database along with
reference data. Also this work could do much for the scientific community.
CONCLUSION - DIGITALIZING DIVERSITY: what is wrong with our scanner?
The
generation of a DCM containing the data collected from the observation and
measurement of the examined specimens is an apparently simple activity, that
sometimes seems to be managed like some sort of manual craft in which each
scientist expresses his own peculiar talent (from time to time character
descriptions appear, which defy comprehension!).
In a
comparison with electronic image scanning devices, no practical automatic
digitizer could work its own way. The output must respond to preset
requirements, like the compliance with standard formats. A digitized image
would be of no help as long as only I can see it on my computer, while my
colleagues cannot decipher the information contained in the image file, and I
cannot read theirs.
A
properly working image scanner should sample the image in a consistent,
unchanging way. Output quality may change with the original's,
as well as with the scanner's quality, but output generation rules should never
change. Every black and white scanner should read "1" on a black
pixel and "0" on a white one, and allow for a white/black threshold
to be set.
In this
respect, all DCMs look similar: rows, columns, zeros, ones …. This formal
congruence among matrices seems to match well the regularity in scanner
response and seems to suggest a stringent, indisputable cause-effect link
between the original and the character's value.
This
"unavoidable look" has an important psychological effect, as the
reader tends to think that - as long as there is a one-to-one relation among
the specimens' characters and the matrix cells - any conclusion drawn from the
matrix is the only possible conclusion that could be drawn from the specimens,
or that at least the matrix demonstrates the conclusions drawn.
As we
have seen above, this is very far from true: the scientist does not digitize
undisputable measures, but digitizes disputable interpretations that in turn
are themselves conclusions.
DCMs are
not positioned at the same level in automatic image scanning and in
phylogenetic analyses: in the former case they do in fact constitute a digital
surrogate of the original according to the chosen sampling rate. I shoot a
digital photo. In the particular wavelength sampled (audio, or visual) any
process performed on the digital matrix has the same effect as it was performed
on the original. This is how I can digitally paint in blue a white square whose
image I scanned. The result is the same as if I would have scanned a blue
square.
On the
other side, in the DCM case, the matrix has no relation with the original
except the relations arbitrarily set by the scientist. This is a much more
complex kind of "digitalization", and does not compare to a digital
photo, but to a painting: a painting is my representation of the "digital
image" that my eyes sent to my brain.
Another
"audio" more stringent example is hearing a bird song. Excluding
digital or analogic recording, is there a way I can make the music I heard
available to other people? I can write the music, according to the musical
notation rules, so that everybody can play, with his own instrument, the music
I heard. By doing this I comply with predefined, universally accepted standard
rules: the ones that are missing in cladistic practice.
More or
less, this is how DCMs should work, as a best approximation of the observed
phenomenon, with the big difference that we have universally accepted standard
rules for transcribing the music, but still have not such rules for DCM
generation, and everyone, at present, is playing his own music from the same
bird song.
Which kind of measures are undisputable? Length
and weight, for example. And can we digitize them? Sure we can!
Can we
store on the internet 3-d scans of every bone of every specimen in the museums
of the world? Possible, but not easily feasible!
What I
am trying to say is, the only practical way to make
our cladograms comparable is to put down the "universally accepted
rules" for the digitalization of discrete characters.
These
rules are represented by a standardized set of rows and columns for each major
natural group, representing what scientific community as a whole regards as
"undisputable", according to the above outlined "Standard Rules
in DCM Generation".
I think
that such an effort should at least be attempted, for the sake of science. This
could be the only way to improve the quality and the speed of our scanner.
![]()
Last
reviewed on