Now for yet another chapter in... Tales from Statistical Consulting:
A client came in with data from a survey (which shall remain nameless), looking for assistance with evaluating functional form of logistic regression. The main covariate of interest was proportion- it was positively skewed, with a clump of density at one extreme of the data. This clump was still within the potential range of a percentage, but only assumed one value, and it was distinct from the right tail of the data. From prior experience, I immediately began to suspect that this was not actual data, but metadata- a missingness or exception code. While the client had not read the codebook, they didn’t believe that anyone would use a plausible value for a variable as a missingness code.
People do strange things, and this is precisely why you should always read the codebooks. Sure enough, this was actually metadata, but until this consultation, it was being treated as actual data. This is an important cautionary tale, not just about the value of codebooks, but about the value of good data set design: try to make exception and missing data codes as obvious as possible.