Main Page | See live article | Alphabetical index

File format

A file format is a particular way to encode data for storage in a computer file.

Since hard drives store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for different kinds of information. However, within any format type e.g. word processor documents, there will typically be several different - and sometimes competing - formats.

Formats are typically represented by an addition ("file extension") of 1 to 4 letters onto the file's name. For example, if a picture is stored using the JPEG format, the file would be mypicture.jpeg or the like.

Other operating systems such as older versions of Mac OS did not require file extensions, but instead had file type/creator data that was hidden from the user and managed transparently by the operating system. On Microsoft Windows computers, extensions are required for applications to be recognised as executable (and many applications require them to recognise specific data formats). On Unix and Unix-like systems, an extension can be created, however this is optional, and the use of extensions under these systems is seen as a convenience and not a requirement. Under these systems, all files, basically, are seen as data files, directories (which indeed are a special kind of file), or as executables.

Operating system settings determine which program is executed by default on "opening" a file with a particular extension. For example, if a file has extension .htm, the setting determines whether a web browser is used to interpret the HTML (and which one) or whether it is an editor or text viewer that displays the HTML code.

Many file formats, and some well-known file formats, have a published specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular program treats a particular file format correctly. There are two kinds of exception to this, however. First, some file format developers view their specification documents as trade secrets, and therefore do not release them to the public. A prominent example of this exists in several formats used by Microsoft's Office suite of applications (Word, Excel, Outlook and Powerpoint). Second, some file format developers never bother to write a specification document; rather, the format is defined only implicitly, through the program(s) that manipulate data in the format.

Note that using file formats without a publicly available specification can be costly. Learning how the format works will require either A) reverse-engineering it from a reference implementation or B) acquiring the specification document for a fee from the format developers. (Note that the second case, possible only when there is a specification document, typically requires one to sign a non-disclosure agreement.) Both cases require significant time, money, or both. Therefore, as a general rule, file formats with publicly available specifications are supported by a large number of programs, while non-public formats are supported by only a few programs.

Some file formats are designed to store very particular sorts of data; the JPEG format, for example, is designed only to store still images. Other file formats, however, are designed for storage of several different types of data; the GIF format supports storage of both pictures and simple animations, and the AVI format can support many different types of multimedia. A text file stores any text or numbers with a one-to-one correspondence between the bytes and ordinary readable characters such as letters and digits, and some control characters. The extension may be .txt, but also more specific such as .par for a parameter file, .pas for a Pascal program, etc. On a lower level an HTML file is a text file. The "text" is the coding for a webpage, so considered on a higher level the file is a webpage file.

Since files are seen by programs as streams of data, a method is required to determine the format of the file. One way to indicate such metadata is with a file extension. Another is with off-band data - if supported by the filesystem. Another way is in-band, within the file with an distinctive sequence (often called the magic number).

For example, a GIF file can be recognized by its extension ".gif", by some metadata about type or by its first four bytes "GIF8".

It is sometimes possible to cause a program to read a file encoded in one format as if it were encoded in another format. For example, with a bit of work a music-playing program can be used to play a (specially modified) Microsoft Word document as if it were a song. The result does not sound very musical, however. This is so because a sensible arrangement of bits in one format is almost always nonsensical in another.

It should be noted that it is very difficult to make a principled distinction between a file format and a programming language, or between a "normal program" and a programming language interpreter. A programming language can be seen as a file format for storing algorithms, while even a simple image file viewer can be seen as an "interpreter" for, say, the GIF "language".

The most useful part of intellectual property law for protecting ownership of a file format appears to be patent law. Although patents for file formats are not permitted, some formats require the encoding of data with patented algorithms. For example, the GIF file format requires the use of a patented algorithm. Initially the patent owner did not collect fees for use of the algorithm, but then started to collect fees. This has resulted in a significant decrease in the use of GIFs. However, the patent expired in the US in mid-2003, and will expire in Europe, Japan and Canada in mid-2004.

See also: list of file formats, graphics file format, audio file format, video file format, object file format

External link