Mainframe to PC Data Conversion Issues


Mainframe files can be very different than PC files, and use concepts foreign to PC languages and applications. To ensure your conversion goes smoothly, you should understand a little bit about mainframe tapes and files. You should also review the layouts of the mainframe files and decide how to deal with any fields or data types your PC application can't handle directly.  The following brief discussion is an overview of the main topics.  For more detailed articles, see our TechTalk Index.

Here are a few things you should know about mainframe tapes and files:

Need to convert Mainframe data? Request a Mainframe conversion quote
That's our business!
Mainframe Media
Mainframe Tape Formats
Record Types
Data Types
Data Representations
Redefined Fields and Records
Binary Junk


Mainframe Media

Mainframe-type files can be written to any tape that supports variable block mode. (Allows blocks of different sizes to be written to the tape.)  Nearly all linear tapes can be written this way, and even the helical scan tapes can occommodate varying block sizes.

9-track round-reel tape was the standard mainframe tape for nearly 50 years, but has been replaced by higher capacity cartridge tapes. The first cartridge tapes to replace 9-track were the IBM 3480 and 3490, followed by the 3490E, then the 3590 series of drives, and currently the IBM 3592 series of drives.

IBM AS/400 midrange computers can also write mainframe tape formats. AS/400 systems may use the above tape drives, but more often use QIC/SLR, 8mm, 3570, and LTO.

Some third-party tape drives, such as the StorageTek 9840 and 9940 drives have features such as quick loading that make them especially attractive in high volume tape libraries.

For pictures of these tapes, plus capacity and recording information, see Identifying Media.

All the drives mentioned are linear tapes except the 8mm, which is helical scan. Although the physical recording on helical scan is very different than linear tapes, and is always a fixed block size, the 8mm drives can emulate variable block recording, and appear like any other linear drive to the computer.

The method of recording files on these tapes is identical for all these media.  All these tapes can contain multiple files on each tape, and a single file can span multiple tapes (a multivolume set).  For a deeper technical discussion, see Mainframe Tape Details.


Mainframe Tape Formats

This describes the method of recording data on the tape.  Mainframe tapes are generally either "Labeled" or "Unlabeled", and contain either "Fixed Blocks" or "Variable Blocks".

Labeled Tape:  Each data file on a labeled tape is preceded by a special file called a "header label", and is followed by a "trailer label" of a similar type.  These "Labels" contain information about the data file they bracket: the DSN (Data Set Name, or file name), record size, block size, creation date, and more.  The label also tells you the type of file and blocking: Fixed, Variable, or Undefined.

Unlabeled Tape: Unlabeled tapes omit the labels and write just the raw data to tape, again in either Fixed block or Variable block files. As such, there is no file name or record size information on the tape.

Fixed Block:  Fixed block tapes are by far the most common.  Fixed-length data records are written to tape in groups, resulting in a tape where all the blocks (except the last) are of the same size.

Variable Block: Variable block tapes write records of varying size to tape blocks which therefore also vary in size.  There are several methods for doing this.

For a more complete description, see Mainframe Tape Details.


Record Types

Fixed Length Records: Most mainframe data is stored in a fixed-field, fixed-record format, where every field, and therefore every record, is fixed in size.  There are no delimiters between either fields or records.  Most PC databases can import this type of file if we append a record delimiter to the mainframe data when we do the conversion.  This is generally the most direct and least expensive way to convert this data.

Variable Length Records: Variable length records can be found on both mainframes and PCs, but the format is different. On a mainframe tape, variable length records are usually preceeded by a binary value that gives the length of the record that follows. These values are called the Record Descriptor Word (RDW) or Record Control Word (RCW). The variable length record follows. There are no delimiters embedded in the data.

PC variable length files do not use a RDW, but add a special code at the end of the record to denote where the record ends. MSDOS and Windows normally use CR-LF (Carriage-Return, Line-Feed), UNIX uses a Newline (LF), and MAC uses CR. When we convert a mainframe variable-length tape to a PC file, we will remove the RDW and add the record delimiter.

Delimited:  A very popular record type for PCs, this trims trailing spaces from each field and puts a delimiter, usually a comma or tab, between fields to mark the end of the field.  Records are usually delimited with CR, LF, or CR-LF as noted above.  Delimited records are almost never found on a mainframe.

Databases:  Database programs store data internally in many proprietary formats.  Most programs will have import and export functions to read in standard data types and write out standard data types.


Data (Field) Types

Many mainframe data types are not compatible with PC data types.  If all the fields in the mainframe record are "character", or "alpha-numeric", meaning the entire file is composed of the letters A-Z, the numbers 0-9, spaces and punctuation (i.e. no binary types), then a simple EBCDIC to ASCII character conversion will usually work.  But mainframe numbers are often stored in a binary format. Some of the most common field types are listed below. There is considerably more detail in the article Mainframe Data Types.

Alpha-Numeric, or Character fields:   These fields are composed of only letters, punctuation, and the numbers 0-9 represented as characters.  Mainframe character fields are in EBCDIC, and can be converted to ASCII for a PC without loss of information by a simple translation table.

Binary fields:  Binary fields can be integer fields, floating point fields, bit or coded fields, and other types.  Mainframe binary values are not usually stored the same way as PC binary.  To convert these we need to know the type of binary field, the number of bytes or words, and the byte and word order.

COBOL comp fields:  These are also binary fields, and the exact type is dependent on both the compiler and the CPU of the computer. We need the same information as for binary fields. See COBOL Computational Fields for more information.

COBOL comp-3 fields:  Also called "packed fields", this is a standard COBOL numeric data type that stores ("packs") two digits into each byte.  The last nybble (half byte) is the sign.  This format is standard across compilers and CPUs.  See our Tech-Talk article COBOL comp-3 Packed Fields for details.

IBM Signed fields:  Also called "Zoned", these fields "overpunch" the sign onto the last (or first) digit of the field.  The rest of the field is numeric (character) data.  These fields should not be converted to ASCII with a translation table because of the sign overpunch.  (But if that's happened to you, see our tech brief EBCDIC to ASCII Conversion of Signed Fields.)

Leading sign numeric:  This is the standard numeric data type on a PC.  It is composed of a leading sign, and all the digits are regular characters.  For example,  "-12345", or "+12345", or " 12345".  COBOL "display" fields are also of this type.

Implied Decimal: Implied decimal can apply to any kind of numeric field (character or binary), and simply means there is a decimal point implied at a specified location, but not actually stored in the file.  For example, the number 123 with an implied decimal of two digits represents the actual value 1.23  Using implied decimal saves space in the file.  See Implied Decimal for more.

Coded Fields:  COBOL programmers, and others,  sometimes assign binary codes or bit patterns to a field (usually a 1 byte field).  For example, hex 00 may represent a certain customer status, hex 01 another status, hex 02 another, etc.  Because these are binary codes they need to be converted for most PC applications.

If your mainframe data includes any of these field types other than alpha-numeric, we will need a layout in order to write a program to convert them, and will need to know what data types to convert each field to.


Data Representations

Aside from the technical issues of the data type is the representation of the data.  We will look at three common situations, with the intent of getting you to think about the data you will be working with after the conversion.  If you can identify data that needs to be altered or cleaned-up prior to ordering your conversion, we may be able to perform that work at a lower cost as part of the conversion than if we do it as a separate job afterwards.  This is only a sampling of the many possibilities.

Dates:  Dates can be stored many ways.  For example, a date of February 1st 1999 could be represented like this:

MMDDYY like 020199
MMDDYYYY like 02011999
DDMMYY like 010299
DDMMYYYY like 01021999
YYYYMMDD like 19990201
YYDDD like 99032
YYYYDDD like 1999032

The last two are called Julian dates (although the YYDDD is a lose interpretation of a Julian date), where the DDD is the day of the year, from 1 to 365. (366 on a leap year).  Prior to 2000 most Julian dates were two year YYDDD format. But since Y2K we are starting to see 4 digit years in Julian dates.  Most PC applications don't understand Julian dates, so you may want us to convert them to Gregorian dates.

Parsing:  In this context, parsing is the process of separating each element of a field into separate fields.  For example, if you have a list of names that you want to sort by last name but the name field is a full name (e.g. "John Smith"), then you need to parse the full name field into first name and last name fields.

Likewise, if you need to sort a list by ZIP CODE for a bulk mailing, but the city, state, and zip are all in one field, you need it parsed into separate fields for city, state, and zip.

Case conversion & list cleanup:   If you buy a mailing list and the names and addresses are all in upper case, you may want to case convert it before mailing to those people.  Likewise, you may want to clean up punctuation and presentation.  Rather than mailing a letter that says "Dear JOHN      SMITH", you could mail one that says "Dear Mr. Smith".  Much better.

This only touches on the possibilities.  If you review your data you will likely find some things that need improvement.  Call us to see if we can improve your data.


Redefined Fields & Records

Redefined Fields:  Mainframe languages, especially COBOL, often reuse, or "redefine" an area in a record to save space. A common example is a mailing list where the addressee may be either a person or a company, but never both.  To include both an individual name field and a company name field would waste space, since only one of them would ever be filled, so the name field can be reused (redefined) as company name.  Further, the individual name is usually composed of two fields, last name and first name, so for example, bytes 1-12 might be last name, and bytes 13-20 first name.  But when redefined, bytes 1-20 would be the company name.  Most PC applications do not deal with this well, especially when the field boundaries are different.

For example, take two records, one with an individual's name of  "Smith      John   " and the other with the company name "Disc Interchange   ".  If you ignore the redefined issue, and treat the field as the company definition, then "Disc interchange" will be correct, but the mail to John Smith will be addressed to "Smith       John".  If you treat the fields as name fields and put the first name before the last name, then the name will be correct, like "John Smith", but the company name will get scrambled, like "ange Disc Interch".   If your application can't deal with this, we can convert the data to a record with both individual name fields and company name fields.

Often the redefined fields are of a different type altogether.  For example, redefining a character field as a binary field.  This is much more serious than the above example, and the original field and the redefined field require different conversions (character and binary).

DISC routinely deals with these situations and can offer several solutions.

Redefined Records: Complex data sets usually cannot store all their data in just one record type, so they have multiple record types.  For example, medical files may have one record type to identify a patient (name, address, etc.), another record type for treatment data, and a third for payment information.  These could be stored in three files, or in one.  If they are stored in one file, then that file has "multiple record types", or "redefined records".  PC databases can make use of relational tables (or files), but usually can't deal with all three record types in one file.  DISC can split the data into three files so you can build a relational database on your PC.


Binary Junk

For a number of reasons, data from mainframes will often contain "junk".  That junk is often random binary values that will cause problems when brought into your PC database.  There are four primary reasons for this junk:

  1. Databases that are not properly initialized when created can have literally any value in a byte.
  2. Unused fields are often initialized to nulls (hex 00), and if never populated they remain as nulls.
  3. It's common practice to reserve spare space in "filler" fields.  Filler fields are commonly not initialized, and can therefore contain anything.
  4. Sometimes when you get a file there will be fields in the file for "internal use" that are not specified on the layout.  Since these are not specified, they could be anything, and are often binary values.

Binary junk can also be caused by various errors. A common cause is data which was not properly converted when upgrading to a new computer or program. Another common cause is when one company buys another and merges the different databases with less than perfect results. This last situation often results in binary data populating character fields. Since those fields are defined as character fields, you will not be expecting binary data in them.

This binary junk can cause a number of problems, from funny characters in your data to crashing your database.  One of the most serious is a control-Z (1A hex) in a file; this signifies end-of-file to many PC applications, so the database will stop importing the file when it sees a control-Z.

DISC has written several programs to scan your files to catch these problems and fix them before they cause you any grief.  We routinely scan all jobs for control codes, bytes with the high bit set, irregular records (short or long records, a CR of LF in the middle of a record), control-Z, and other problems.  We don't just blindly convert your file.

We hope this has given you some useful information from our many years of conversion experience, and has been food for thought when converting your own files.

Thank you for visiting our web site.
We hope this information has been useful to you.


Our Mainframe Conversion Services

Disc Interchange Service Company's primary business is converting mainframe data files to PCs.  From the simplest mailing list to the most complex financial data, we have the tools to properly convert and Q.C. your files efficiently and accurately.  With over 32 years of experience converting millions of files, we have the knowledge to catch problems with the data before they cause you grief.
Mainframe & AS/400 Conversions
Mainframe & AS/400 Conversion to PC

With 32 years experience, we are the experts at transferring mainframe data to PCs.
Get more information on IBM Mainframe conversions
Request a COBOL quote

Disc Interchange Service Company, Inc.
15 Stony Brook Road
Westford, MA 01886