Dealing with Binary Junk in a File

For a number of reasons, data from mainframes will often contain "junk".  That junk is often random binary values that will cause problems when brought into your PC database.  There are four primary reasons for this junk:

  1. Databases that are not properly initialized when created can have literally any value in a byte.
  2. Unused fields are often initialized to nulls (hex 00), and if never populated they remain as nulls.
  3. It's common practice to reserve spare space in "filler" fields.  Filler fields are commonly not initialized, and can therefore contain anything.
  4. Sometimes when you get a file there will be fields in the file for "internal use" that are not specified on the layout.  Since these are not specified, they could be anything, and are often binary values.

This binary junk can cause a number of problems, from funny characters in your data to crashing your database.  One of the most serious is a control-Z (1A hex) in a file; this signifies end-of-file to many PC applications, so the database will stop importing the file when it sees a control-Z.

DISC has written several programs to scan your files to catch these problems and fix them before they cause you any grief.  We routinely scan all jobs for control codes, bytes with the high bit set, irregular records (short or long records, a CR or LF in the middle of a record), control-Z, and other problems.  We don't just blindly convert your file.

When we encounter "junk" in a file, there are several ways to deal with it, depending on what it is.  If it's caused by binary fields, and they contain data you need, then it's not junk at all, and must be converted.  But assuming it's not data you want, there are several ways to fix the problem.

  1. Remove the field from the record and shift the remaining fields up.
  2. Replace the field with one containing spaces or something clean.
  3. Replace any binary values anywhere in the record with a space.

The first two are the cleanest approach, but require programming so are usually the most expensive.  The third approach is an economical compromise.  It simply scans the entire record (so it doesn't require programming) replacing any binary value it finds with a space (or sometimes an "*" so you can distinguish a replacement from a normal space). It can't be used in all cases, but when it works it's fairly inexpensive; we commonly do it for nothing on repeat jobs.

Part of the compromise is that it may leave some strange looking stuff behind.  Binary data can be any value, so sometimes it takes on the value of a character.  Since those are valid character codes, this method doesn't remove them, and they are left in your file. Frequently these are punctuation, so you may end up with a field that looks like ")#  !".  Since these are valid characters the database doesn't usually mind, and the file imports okay.

Additional Information

For more articles on data conversion, see our TechTalk Index.

Disc Interchange Service Company, Inc.
15 Stony Brook Road
Westford, MA 01886

Home