(We have also published a Simple Data Conversion Tutorial.)
Data Conversion is the generic term given to the process of converting computer data between different applications and/or between different computers. Data Conversion usually also involves Media Conversion -- converting the files from one type of tape or disk to another.
Data Conversion is far more complex than this brief article can address, so we have referenced additional detailed articles at the end of this article. This article assumes the files will be exchanged via tape, which is by far the most common method.
The Four Issues of Data Conversion
Data conversion involves up-to four different issues, any combination of which may be required for a particular conversion:
|
That's our business! |
The type of tape does not always indicate the physical recording format, and therefore the drive you need. For example, a DLT IV tape is used in DLT 4000, 7000, and 8000 drives, and they all write different numbers of tracks and densities. Likewise, an 8mm 112M tape could be written in 8200 format without compression, 8200C compressed format, 8500 uncompressed format, or 8505 compressed format. You simply can't tell by the type of tape what the recording format is. This is true of many tapes.
In some cases, such as IBM mainframe tapes, the tape format is dictated by the file type. A file with fixed-length records will dictate a fixed-block (FB) tape, whereas a file with variable-length records will dictate a variable-block (VB) tape. In other cases, such as UNIX tar and PC backup programs, files are written to tape in the same way, regardless of the type of file.
Tape programs vary widely, but each platform has some common methods. Here are a few:
IBM Mainframes
IBM Mainframe computers usually write an "IBM Standard Label (SL) tape". This writes a small file called a "label" before and after each data file. This label defines the file that follows; its name, type, date, etc. We have published several articles on IBM Mainframe tapes; see our TechTalk Index.
ASCII Mainframes
Mainframe computers that operate in ASCII, like CDC, etc., usually write an "ANSI Standard Label (SL) tape". The ANSI SL tape format is very similar to the IBM SL tape format, but both the labels and the data are in ASCII.
IBM AS/400
IBM AS/400 computers running the OS/400 operating system can write IBM SL format, but generally write a "SAV" (Save) format that is proprietary and unique to AS/400 tapes.
DEC VAX VMS and Alpha VMS
DEC (Digital Equipment Corporation / Compaq / HP) VAX VMS and Alpha VMS computers usually write a "Backup" format tape. This format is unique to VMS computers.
UNIX and Linux
UNIX and Linux computers come with a program called TAR (Tape ARchive), for writing tapes. cpio (CoPy In-Out) can also be used to write tapes on UNIX, as can Dump. All three formats are different. There is good interchangeability of TAR and some interchangeability of cpio tapes across UNIX systems.
Microsoft Windows
Although Windows systems come with a backup program, most users opt to use a third-party backup program like Arcserve, Veritas Backup Exec, Nova Backup, etc. Each of these programs writes data to tape in different ways, although some are able to read the tape format from competing products.
Apple Macintosh
Like Windows users, Macintosh users also use third-party backup programs like Retrospect.
The "File Type" we are discussing below is the file type on disk, either before it is written to tape, or after it is restored from tape. As noted above, the file type may determine the Tape Format.
What "File type" and "File content" refer to depends on both the operating system and the kind of file, so it's difficult to make global statements that apply to all situations. The issues are considerably different for mainframes and PCs, and are different for different kinds of files -- word processing files and database files, for example. In most cases the File Type and File Content are closely related, with overlapping issues and interactions.
File type generally refers to how the file is stored on disk, while File Content refers to what is stored in the file, including how the data is coded. However, in some cases, such as certain database files, the file type and file content are inseperably tied and generally referred to simply as the "file type". Furthermore, when the operating system doesn't support different file types, as is the case with UNIX and Windows, "file type" usually refers to the application file type, such as "an Access file" or "an SQL file", or a generic file type such as "a comma-delimited file".
Clearly these terms are not used consistently, and the meaning varies greatly between operating systems. To a mainframe user, "file type" would mean "indexed" or "sequential", and "fixed length" or "variable length", both of which have very specific and different structures on disk. But for a PC user, "file type" would typically mean, for example, a "comma-delimited" file or an "Access file". This makes it difficult to communicate "file type" unless the context of the discussion is understood, and even more difficult to discuss when converting between disparate operating systems.
Because "File type" has such different meaning between mainframes and PCs, we will discuss mainframe files and PC files separately. Following those descriptions we will briefly discuss converting files between mainframes and PCs. To keep this article brief, we will mainly discuss database files.
With this background information, let's look at file type and file content.
There are some fundamental differences in how computers store files. The operating system of Mainframe computers, AS/400, DEC VMS, and others "understand" file and record structure, so you can define the type of file -- indexed or sequential for example -- within the OS. And you can store characteristics of the file, such as the record type (fixed length or variable length for example), and file parameters (such as record length).
But the UNIX and Windows operating systems don't use such concepts; to them a file is just a stream of bytes with no structure. Those computers rely on the application programs to handle the structure. Converting between these systems then means transferring the concept of "file type" from the OS side to the applications-program side, or vice-versa.
Macintosh computers store the data portion of a file in the "Data Fork", and information about the type of file in the "Resource Fork". When converting from Macintosh, you should read the Resource Fork and use that information to interpret the file, and when converting data to a Macintosh, you must create the proper Resource Fork.
Mainframe Computers and Mid-Range Computers
Mainframe and Mid-Range operating systems define not only the name, date, and size of a file, but the type of file. They define and manage the record structure of the file, and even manage the indexing. These computers have file-management services built into the operating system to handle file I/O on a record basis, including handling the indexing. On these computers you normally read and write whole records with each I/O request.
Personal Computers
Personal computers such as Windows, UNIX, and Macintosh store files as a stream of bytes with no structure. The operating system simply reads or writes as many bytes as the application program tells it to, without regard to record boundaries, etc. In fact, the operating system doesn't even know what the record size is; it only regards files as a collection of bytes, with no sturcture.
It's up to the application program to handle the structure -- that is, to separate data into records. The application program will make an I/O request to the operating system which specifies the number of bytes the OS is to return, and the application will then treat that data as one complete record.
Notice that from the operating system point of view there is no information with the data file which specifies the record structure (although many applications will include that information within the file itself). In general, you can't determine the record structure from the disk file; you need separate documentation for that. However, PC files commonly delimit records with a CR-LF, and UNIX computers normally delimit records with a Newline (a LF), and those can be used to quickly determine the record size if there is no other documentation. But because there is no record structure imposed by the OS, there is nothing to prevent shorter or longer records within the file. You normally have to scan the entire file to be sure the records are all the same size.
The topic of file content could occupy many articles. We will briefly suggest a few issues.
File content obviously refers to the data within the file, but that also has different meanings. It can mean the code set, record layout, data types, and the variable data content of each record. IBM mainframe and AS/400 computers encode the alphabet using the EBCDIC code set, while most other computers, including the IBM PC, use ASCII coding. So a simple character field on a mainframe cannot be used on a PC without an EBCDIC to ASCII conversion. Furthermore, the layout you receive with the tape will seldom specify which character set is used. That's assumed from the operating system. So the COBOL field: 05 NAME PIC X(30). will contain EBCDIC characters if it originates on a mainframe, and ASCII characters if it originates on a PC. But the layout generally won't tell you that.
Binary fields are common in mainframe data, but less common on PCs, which tend to store numbers as characters. Even when binary is used on a PC, the binary data type is not the same as binary on a mainframe. PC applications can seldom understand a mainframe binary field, and may often just return the wrong value without reporting an error.
So what's considered a "standard" file type on one computer platform is not the same as a "standard" file on another platform. For example, Mainframe computers almost exclusively use fixed length records with no record delimiters, whereas Windows systems often use variable length records, and almost always use record delimiters, even on fixed length records. Macintosh computers seldom use fixed length records, preferring variable length records with a CR record delimiter.
Media Conversion is the term generally used when you only need to change the media, while leaving the tape format, file type, and file content unchanged. If you have the right operating system and tape program to read the tape, and an application program that can use the files, but you just don't have the right tape drive, then a media conversion is probably all you need.
Be aware, though, that even the same tape program may not write the same way to all tapes. For example, Arcserve, BackupExec, and others write slightly differently to a DLT tape than they do to a 4mm DDS tape, and simply copying from one to the other does not always work.
Mainframe Tape Details A detailed description of physical and logical recording on mainframe tapes.
Mainframe Tape Terminology A brief overview of mainframe tapes, and definition of terms.
Mainframe Data Types Discusses mainframe data types.
Character, Binary, and BCD Fields Explains the three field types.
Understanding Record Size and Record Delimiters Discusses differences between mainframe records and PC records.
Converting IBM Mainframe Tape Files to PCs Discusses some practical considerations when converting mainframe tapes to PC.
For more articles on data conversion,
see our TechTalk Index.
For information on our data conversion services, see
Mainframe & AS/400 Conversion to PC.
Our COBOL Conversion Services
With 24 years experience, we are the experts at transferring mainframe data to PCs.
Disc Interchange Service
Company, Inc.
Media Conversion Specialists
15 Stony Brook Road
Westford, MA 01886
(978) 692-0050