Digitalizing biologic diversity: how good is our scanner, and are we using it correctly?

 

Rough draft - Last revision 21 January 2005

Reviewers - if any! - are asked to read this disclaimer carefully

 

By Cesare Brizio

Address: Via Chiesa Vecchia, 45 - 44028 Poggio Renatico FE - ITALY

 

 

Introduction

Cladistic analyses based on discrete characters matrices (DCMs) are presently the only universally accepted method for displaying phylogenetic data and - with the assistance of dedicated software like PAUP, MacClade, Phylip, Hennig86 and so on - for analyzing the phyletic relationships both in paleontology and neontology.

Cladistic, computer assisted analyses are completely dependant on the underlying DCMs. Their creation constitutes a digitalization process, whose nature is strictly comparable with other sampling processes where discrete measures are taken, but DCM creation has never been analyzed in this respect. The technicalities involved in the definition of ideal sampling rates, signal/noise ratios and quality of the original seem not to have touched this particular kind of hand-made scanning, but something hints at the existence of important analogies between automatic, computer controlled image scanning and DCM structuring and compilation.

The activity of generating a DCM from "raw" data obtained from observation and measure, is unfortunately too often performed without taking into account many rules of the thumb that even now, after many years of penetration of this method into the scientific community, are overlooked or plainly ignored. Surely it can be said that Cladistic is a perfect method for a perfect world. But is our world perfect enough? The problem with DCMs and computer based cladistic systematics is that they are too fashionable, and that they seem to increase the credibility of any publication. Sadly - at least from a philosophic point of view - the misuse of the DCM-based methods puts the entire cladistic analyses at the risk of degenerating in a non-scientific practice. The purpose of this paper is to illustrate the main drawbacks of current practice in DCM generation, and to cast some doubts about the entire process of their use in phylogenetic trees generation, with the hope of promoting some wider discussion within the scientific community.

The starting point, too often overlooked, is that this kind of diagram doesn't at all give more strength to the reasoning behind it: the tree diagram is nothing more than a way to display the results obtained from the observation of the examined specimens. The DCM has no intrinsic force, the force is in the specimens and in the nature of the observations and measurements taken. So, it's a "This is the conclusion I can draw from my data", not a "This is why I think so!". It's not uncommon to have the impression that the author is using the cladogram like an armor to protect himself and his work from criticism, using it to prevent - rather than to promote - discussion.

I will show how weak are the protection and reassurance offered by any DCM, by specifically addressing three points:

  • The present state of our knowledge casts doubts on the possibility of obtaining accurate phylogenetic reconstructions
  • The widespread misuse of DCMs: the scientists' attitude is very relaxed, and frequently nobody seems to care about very fundamental precautions
  • As long as no effort is put into standardization of DCMs, the non-comparability of results among scientists puts the cladistic methods at the fringe of the common concept of science

Then, by a comparison with another digitalization process, computerized image scanning, I will illustrate some general points about this issue.

DCMs in the genetic context: what future?

It has been known for many years that a single, punctiform genetic modification can affect more than one phenotipic expression, even in different anatomical districts. Strangely enough, nobody seems to have caught the devastating implication of this fact for DCMs. In fact, as explained below, if any two columns in a DCM are covariant, this can heavily affect the resulting phylogenetic trees. And, simply put, especially in paleontology we don't know for sure how deeply correlated our characters are in this respect. So, all that can be done is adopting a reasonable degree of caution.

 

Fundamental precautions: how can I fill my matrix with useful, meaningful data.

Let's define the conditions at which DCM can be expected to work properly:

  • When choosing characters, we should find a good balance between "general" characters (characters relative to a wider group of organisms, widening the scope of our analysis) and "particular" characters (narrowing the scope) to obtain the desired focus and perspective for our work.
  • We should absolutely avoid any not indispensable complication, remembering that the probability of incurring in errors is directly proportional to the matrix dimension. This is why we should ban not indispensable and uninformative taxa and characters from our matrices.
  • We should be absolutely positive about the polarity of the characters selected, unless the aim of the analysis is the determination of the polarity of some character in the taxa being examined.
  • The examined characters should be - if possible - unequivocally and unambiguously observable in every specimen examined or at least in the great majority of specimens relative to the different taxa we have chosen. On the other side, we shouldn't exclude from the analysis interesting taxa just because some of the characters are not observable in the respective specimens.
  • For EVERY character, or group of characters, the customary questions should be posed to decide whether reductive or composite coding has to be used: that is, stress will be put on the research of covariant characters, starting from a very reductive coding strategy, and then making it more composite as long as the presence of covariant characters is observed.
  • Regardless to the coding strategy, we should aim at a matrix WITHOUT MULTISTATE CHARACTERS, and possibly entirely coded in absent/present modality.
  • Once our matrix is complete, we shouldn't hesitate to analyze its consistency and its information potential by comparing it with random-generated matrices of identical size. However, if we performed correctly the steps above, this could be unnecessary.
  • Different versions of our matrix, with different row orders, should be generated by bootstrapping algorithms. It's the group of matrices thus generated that should be analyzed, not just the original matrix.
  • We should use many different analysis methods among those available from our software, and for each method we should generate the most parsimonious cladograms.
  • All the cladograms generated will be used for the determination of the maximum consensus tree(s).
  • As a final consideration, we should avoid any unnecessary destabilization of the zoological nomenclature, made just for the author's personal gratification.

Not taking into account subtle and purposeful manipulations of the DCMs to make them lead to predetermined results (what I define as the "political use" of DCMs), there is always the possibility of bona fide errors. For the sake of our souls, we should use all the above illustrated cautions.

 

 

Comparison among matrices: at which conditions does it make sense?

Discrete Character Matrices - as used in present day practice - are intrinsically incomparable. Rather than trying to overcome this limitation, scientists seem to accept gladly this "incomparability bonus" as one more instrument to limit, or prevent, information exchange. The always increasing competition for primacy among scientific institutions requires research materials to be treated like well kept secrets even well after their publication: the purpose of scientific papers themselves is increasingly becoming personal and institutional gratification rather than scientific knowledge diffusion. Results have to be "publicly known", not "publicly available".

In my opinion a compromise has to be made for the sake of science, among competing institutions and the scientific community as a whole. In the DCM fields, an attempt to enhance comparability - preserving the most jealously kept data and favoring the widespread circulation of general information in a standard reference frame - could be an important and riskless effort.

In any definition of "scientific method" the stress falls on the REPLICABILITY: the replicability of the phylogenetic results derived from a given DCM depends upon the following conditions:

  • the same processing software should be used in all the analyses
  • the same software options should be activated in all the analyses
  • no row or column of the matrix should be altered in all the analyses
  • the same sequence of operations (i.e. bootstrapping the dataset a given number of times) should be performed in all the analyses

Only in this case the results will be positively replicated, because this is the only way to preserve the arithmetical integrity of the data flow throughout the process. Strictly speaking, any row or any column added or removed from a DCM could imply the generation of a phylogenetic tree different from the one generated from the original DCM: for a given method of tree generation, there should be an univocal direct relation between the DCM and the phylogenetic tree(s) generated. So, the inclusion of portions of a DCM into another does not imply at all the endorsement of the phylogenetic hypotheses derived from the original DCM.

Replicability in science is usually considered an unprescindable requirement for comparability of results among scientists. As we cannot expect all the scientists to work over and over on the same DCMs, scientific practice should provide for methodological instruments capable of ensuring a reasonable degree of comparability among DCMs. And, unfortunately, this is not the case.

In fact, it is common practice for each scientist to generate his own DCMs, that may incorporate a number of rows and columns from previously published DCMs, also from different authors.

But why should we strive to make our matrices comparable?

Because this could promote an unprecedented effort towards some sort of matrix integration, and this in turn would enhance the quality of information exchange among scientists. But in the context outlined above, the only possibility to set a common and firm ground under the feet of any comparison effort stands in STANDARDIZATION, a process involving METHODS, ROWS, and COLUMNS.

About METHODS, the scientific community could not endorse any particular software product, but rather an operational protocol should be defined, putting into sharper focus the standard acceptable rules for the generation of a standardized matrix.

About ROWS, the hypothesis of entirely predefining the set of species to be included in all the "standardized" analyses is absolutely unpractical, as it would preclude the inclusion of new taxa into the analyses. All that can be done is imposing some sort of "most representative taxa" for any given natural group, to be included into every analysis pertaining to any (putative) member of that group.

About COLUMNS, the matter is even more intricate, because imposing the presence of predefined characters (that is, predefined body districts) in every matrix is even less justifiable - on logical grounds - than imposing the inclusion of any particular species. Perhaps, the imposition of some quite general character for each skeletal zone (appendicular, cranial, axial) could be quite easily accepted into practice.

Behind that, a much more difficult task has to be performed: we are not talking about empty matrix cells! Rather than the definition of a standard empty frame, the aim of all the process is the creation of a common, undisputed general aggregate of data, that every "standardized" DCM dataset should intersect. After all, we are talking about a general frame of characters for a limited number of reference species in each natural group, and general acceptance of this practice shouldn’t be impossible to obtain.


Proposal for the constitution of Standardized DCM Worldwide Dataset

Let's postulate some sort of standardizing authority, a project that I will call "Standardized DCM Worldwide Dataset" (SDWD in acronym), something like an extended Internet-based workgroup. Anyone, let alone myself, could start this process, e.g. publishing on the Internet ordered subsets of data matrices collected from literature. But the involvement of reference scientific institutions is essential to put any hope in some final usable result.

A very extensive, if not particularly advanced, use of the Internet is surely the most practical instrument for SDWD management and also for its structuring. Thus, a collaborative portal should be expressly set up and implemented with all the security provisions needed to ensure that - particularly in the structuring phase - only the participating institution can have access to the reserved area for SDWD member institutions, where votes are to be expressed, while the SDWD application forms and the general information pages are entirely public.

The entire process could start from an university or a cultural institution trying to involve zoological, paleontological, botanical societies in a survey or a call for proposals about this subject. Three conditions will be posed to the participants:

  • commitment to provide SDWD with all the DCM information available within the participating institution
  • commitment to use SDWD standard, as soon as it is defined for any particular field, into the publication issued or endorsed by the participating institution
  • commitment to provide partnership, consulting and assistance to SDWD, e.g. by concurring to the definition of the reference species and reference characters

A range in the number of participating institutions should be predefined: a minimum number of participants under which the initiative has no hope to represent the scientific community, and a maximum number above which standardization efforts would become too complicated. As long as the concept of "majority" is involved in many strategic decisions to be taken, an even number of participants could arise complications. Provisions should be made, like the definition of tie-breaking procedures or some sort of external referee to resolve 50/50 situations.

The process described hereinafter would start as soon as the minimum number of participants is reached, or better within a predefined time interval after its reaching, to allow for other institutions - not exceeding the preset maximum number - to join the group.

Then a schedule of the steps that follow has to be set, with reasonable limit dates for the accomplishment of each different task. The schedule has to be published (via Web) and accepted by the majority of participating institutions. The failure of a participant to comply with the expiration dates results into the exclusion of the participant from that particular step, not from the entire project. Timely participants should acquire some privilege in the decisional phases that will follow.

The first operational step is the definition of the reference species, that should be performed by subcommittees of the involved societies, preferably by one subcommittee for each “natural group” (phylogenetic group!) involved.

A first version of the lists (called  a “Preliminary” or “extended” version) should be made available on the web to the participating institution, and each species in the list should be voted for inclusion in the final version of the list (what we could call “Official” or “restricted” version).

In a very similar way the selection of the characters could be performed by means of publication/voting via Internet.

During these processes for the "rows" and "columns" definition, another subcommittee should define the ethical code of SDWD, to be published as soon as possible, and the policies of the SDWD workgroup, with particular reference to the "Standard Rules in DCM Generation".

The painstaking effort of filling the cells (defined in the preceding steps) would follow: the empty matrices should be made available on the web to the participating institutions' subcommittees and proposals of filled matrices should be submitted to the SDWD by e-mail.

As soon as a matrix is received, it is published on the web site and made accessible only to SDWD participants.

As soon as the limit date for the submission of matrices is reached, the matrices are examined. All the unanimously evaluated cells become automatically part of the standard reference data set. If the matrices submitted show any difference, decision is automatic as long as a majority of the participants has opted for a solution. A draw will be treated according to the rules defined in the preliminary phases.

The final stage is the publication in final, easily downloadable textual format, of the standardized DCM data sets that the participants are committed to use for the generation of standard, SDWD-endorsed DCMs.

In parallel, another activity could be initiated, that is, the collection of every DCM published in literature, to be stored in a relational database along with reference data. Also this work could do much for the scientific community.

 

CONCLUSION - DIGITALIZING DIVERSITY: what is wrong with our scanner?

 

The generation of a DCM containing the data collected from the observation and measurement of the examined specimens is an apparently simple activity, that sometimes seems to be managed like some sort of manual craft in which each scientist expresses his own peculiar talent (from time to time character descriptions appear, which defy comprehension!).

In a comparison with electronic image scanning devices, no practical automatic digitizer could work its own way. The output must respond to preset requirements, like the compliance with standard formats. A digitized image would be of no help as long as only I can see it on my computer, while my colleagues cannot decipher the information contained in the image file, and I cannot read theirs.

A properly working image scanner should sample the image in a consistent, unchanging way. Output quality may change with the original's, as well as with the scanner's quality, but output generation rules should never change. Every black and white scanner should read "1" on a black pixel and "0" on a white one, and allow for a white/black threshold to be set.

In this respect, all DCMs look similar: rows, columns, zeros, ones …. This formal congruence among matrices seems to match well the regularity in scanner response and seems to suggest a stringent, indisputable cause-effect link between the original and the character's value.

This "unavoidable look" has an important psychological effect, as the reader tends to think that - as long as there is a one-to-one relation among the specimens' characters and the matrix cells - any conclusion drawn from the matrix is the only possible conclusion that could be drawn from the specimens, or that at least the matrix demonstrates the conclusions drawn.

As we have seen above, this is very far from true: the scientist does not digitize undisputable measures, but digitizes disputable interpretations that in turn are themselves conclusions.

DCMs are not positioned at the same level in automatic image scanning and in phylogenetic analyses: in the former case they do in fact constitute a digital surrogate of the original according to the chosen sampling rate. I shoot a digital photo. In the particular wavelength sampled (audio, or visual) any process performed on the digital matrix has the same effect as it was performed on the original. This is how I can digitally paint in blue a white square whose image I scanned. The result is the same as if I would have scanned a blue square.

On the other side, in the DCM case, the matrix has no relation with the original except the relations arbitrarily set by the scientist. This is a much more complex kind of "digitalization", and does not compare to a digital photo, but to a painting: a painting is my representation of the "digital image" that my eyes sent to my brain.

Another "audio" more stringent example is hearing a bird song. Excluding digital or analogic recording, is there a way I can make the music I heard available to other people? I can write the music, according to the musical notation rules, so that everybody can play, with his own instrument, the music I heard. By doing this I comply with predefined, universally accepted standard rules: the ones that are missing in cladistic practice.

More or less, this is how DCMs should work, as a best approximation of the observed phenomenon, with the big difference that we have universally accepted standard rules for transcribing the music, but still have not such rules for DCM generation, and everyone, at present, is playing his own music from the same bird song.

Which kind of measures are undisputable? Length and weight, for example. And can we digitize them? Sure we can!

Can we store on the internet 3-d scans of every bone of every specimen in the museums of the world? Possible, but not easily feasible!

What I am trying to say is, the only practical way to make our cladograms comparable is to put down the "universally accepted rules" for the digitalization of discrete characters.

These rules are represented by a standardized set of rows and columns for each major natural group, representing what scientific community as a whole regards as "undisputable", according to the above outlined "Standard Rules in DCM Generation". 

I think that such an effort should at least be attempted, for the sake of science. This could be the only way to improve the quality and the speed of our scanner.

Last reviewed on 21 January 2005