Recently I have been bug hunting a large dataset (DwC - Archive) that from a casual glance would look OK at the publisher side, but upon hitting the parser several records would be rejected because of the occurrence of line terminating characters in the records themselves (Hex value 0A). On top of that the individual record would be replaced by one empty line due to the illegal line termination AND another empty line would be added to that due to the tail end of the record appearing to the parser as the start of a new record, which of course would not be well-formed (thus being replaced with blank line number two). The parser will see a line that has too few columns and drop it. Since the line was bisected the tail end will also be conceived of as an individual line with an insufficient number of columns.
Here is an example of a record that would be replaced by an empty line:
The line terminating characters seems to have been escaped but without achieving the desired result. The secondary effect of this error is that the record count is miscalculated since the parser merely counts the lines and therefore ends up with a larger number than the publisher expected (remember that the line terminating character breaks the data file by producing two lines with an incorrect number of columns). Incidentally this example can sometimes explain why we harvest MORE than 100% of the target records.
By using the Integrated Publishing Toolkit (IPT) illegal characters can be avoided and the publishers will benefit from a faster transition into data appearing live in the GBIF portal. http://www.gbif.org/informatics/infrastructure/publishing/
Fortunately I am working in a joint effort with the publisher’s team on ironing out the bumps on this resource so we can get the data published fast and prevent future errors of this sort.