
PDF Section
Portable Document Format (PDF)
is a file format developed by Adobe Systems in representing documents
in a manner that is independent of the original application software,
hardware, and operating system used to create those documents.
Read More >>>

HTML Section
In computing, HyperText Markup Language (HTML is a markup language designed for
the creation of web pages and other information viewable in a browser.
Read More >>>

WOL Section
OWL is an acronym for Web Ontology Language, a markup language for publishing and sharing data using ontologies on the Internet.
Read More >>>

SMIL Section
InSMIL (pronounced "smile") is an abbreviation for the Synchronized Multimedia Integration Language.
Read More >>>

VRML Section
VRML (Virtual Reality Modeling Language, usually pronounced vermal)
is a standard file format for representing 3-dimensional (3D)
interactive vector graphics, designed particularly with the World Wide
Web in mind.
Read More >>>
|
|
 |
|
Infoformat.com - Data Format Resource Guide

A file format is a particular way to encode information for storage in a computer file.
Since
a disk drive, or indeed any computer storage, can store only bits, the
computer must have some way of converting information to 0s and 1s and
vice-versa. There are different kinds of formats for different kinds of
information. Within any format type, e.g., word processor documents,
there will typically be several different formats. Sometimes these
formats compete with each other.
Generality
Some
file formats are designed to store very particular sorts of data: the
JPEG format, for example, is designed only to store static images.
Other file formats, however, are designed for storage of several
different types of data: the GIF format supports storage of both still
images and simple animations, and the QuickTime format can act as a
container for many different types of multimedia. A text file is simply
one that stores any text, in a format such as ASCII or Unicode, with
few if any control characters. Some file formats, such as HTML, or the
source code of some particular programming language, are in fact also
text files, but adhere to more specific rules which allow them to be
used for specific purposes.
It is sometimes
possible to cause a program to read a file encoded in one format as if
it were encoded in another format. For example, one can play a
Microsoft Word document as if it were a song by using a music-playing
program that deals in "headerless" audio files. The result does not
sound very musical, however. This is so because a sensible arrangement
of bits in one format is almost always nonsensical in another.
Specifications
Many
file formats, including some of the most well-known file formats, have
a published specification document (often with a reference
implementation) that describes exactly how the data is to be encoded,
and which can be used to determine whether or not a particular program
treats a particular file format correctly. There are, however, two
reasons why this is not always the case. First, some file format
developers view their specification documents as trade secrets, and
therefore do not release them to the public. A prominent example of
this exists in the formats used by the Microsoft office suite of
applications. Second, some file format developers never spend time
writing a separate specification document; rather, the format is
defined only implicitly, through the program(s) that manipulate data in
the format.
Using file formats without a publicly
available specification can be costly. Learning how the format works
will require either reverse engineering it from a reference
implementation or acquiring the specification document for a fee from
the format developers. This second approach is possible only when there
is a specification document, and typically requires the
signing of a non-disclosure agreement. Both strategies require
significant time, money, or both. Therefore, as a general rule, file
formats with publicly available specifications are supported by a large
number of programs, while non-public formats are supported by only a
few programs.
Patent law, rather than copyright, is
more often used to protect a file format. Although patents for file
formats are not directly permitted under US law, some formats require
the encoding of data with patented algorithms. For example, the GIF
file format requires the use of a patented algorithm, and although
initially the patent owner did not enforce it, they later began
collecting fees for use of the algorithm. This has resulted in a
significant decrease in the use of GIFs, and is partly responsible for
the development of the alternative PNG format. However, the patent
expired in the US in mid-2003, worldwide in mid-2004; algorithms are
themselves not currently patentable under European law.
Identifying the type of a file
Since
files are seen by programs as streams of data, a method is required to
determine the format of a particular file within the filesystem—an
example of metadata. Different operating systems have traditionally
taken different approaches to this problem, with each approach having
its own advantages and disadvantages.
Of course,
most modern operating systems, and individual applications, need to use
all of these approaches to process various files, at least to be able
to read 'foreign' file formats, if not work with them completely.
Filename extension
One
popular method in use by several operating systems, including Mac OS X,
CP/M, DOS, and Windows, is to determine the format of a file based on
the section of its name following the final period. This portion of the
filename is known as the filename extension. For example, HTML
documents are identified by names that end with .html (or .htm on older
systems), and GIF images by .gif. In the original FAT filesystem,
filenames were limited to an eight-character identifier and a
three-character extension, which is known as 8-dot-3. Many formats thus
still use three-character extensions, even though modern operating
systems and application programs no longer have this limitation. Since
there is no standard list of extensions, more than one format can use
the same extension, which can confuse the operating system and
consequently users.
One advantage of this approach
is that the system can easily be tricked into treating a file as a
different format simply by renaming it—an HTML file can, for instance,
be easily treated as plain text by renaming it from filename.html to
filename.txt. Although this strategy was useful to expert users who
could easily understand and manipulate this information, it was
frequently confusing to less technical users, who might accidentally
make a file unusable (or 'lose' it) by renaming it incorrectly. This
led more recent operating system shells, such as Windows 95 and Mac OS
X, to hide the extension when displaying lists of recognized files.
This separates the user from the complete filename, preventing the
accidental changing of a file type, while allowing expert users to
still retain the original functionality through enabling the displaying
of file extensions.
Magic number
An
alternative method, often associated with Unix and its derivatives, is
to store a "magic number" inside the file itself. Originally, this term
was used for a specific set of 2-byte identifiers at the beginning of a
file, but since any undecoded binary sequence can be regarded as a
number, any feature of a file format which uniquely distinguishes it
can be used for identification. GIF images, for instance, always begin
with the ASCII representation of either GIF87a or GIF89a, depending
upon the standard to which they adhere. Many file types, most
especially plain-text files, are harder to spot by this method. HTML
files, for example, might begin with the string <html> (which is
not case sensitive), or an appropriate document type definition that
starts with <!DOCTYPE, or, for XHTML, the XML identifier, which
begins with <?xml. The files could also begin with any random text
or several empty lines, but still be usable HTML.
This
approach offers better guarantees that the format will be identified
correctly, and can often determine more precise information about the
file. Since reliable "magic number" tests can be fairly complex, and
each file must effectively be tested against every possibility in the
magic database, this approach is also relatively inefficient,
especially for displaying large lists of files (in contrast, filename
and metadata-based methods need check only one piece of data, and match
it against a sorted index). Also, data must be read from the file
itself, increasing latency as opposed to metadata stored in the
directory. Where filetypes don't lend themselves to recognition in this
way, the system must fall back to metadata. It is, however, the best
way for a program to check if a file it has been told to process is of
the correct format: while the file's name or metadata may be altered
independently of its content, failing a well-designed magic number test
is a pretty sure sign that the file is either corrupt or of the wrong
type.
So-called shebang lines in script files are a
special case of magic numbers. Here, the magic number is human-readable
text that identifies a specific command interpreter and options to be
passed to the command interpreter.
Explicit metadata
A final way of storing the format of a file is to explicitly store information about the format in the file system.
This
approach keeps the metadata separate from both the main data and the
name, but is also less portable than either file extensions or "magic
numbers", since the format has to be converted from filesystem to
filesystem. While this is also true to an extent with filename
extensions — for instance, for compatibility with MS-DOS's three
character limit — most forms of storage have a roughly equivalent
definition of a file's data and name, but may have varying or no
representation of further metadata.
Note that zip
files or archive files solve the problem of handling metadata. A utiliy
program collects multiple files together along with metadata about each
file and the folders/directories they came from all within one new file
(e.g. a zip file with extension .zip). The new file is also compressed
and possibly encrypted, but now is transmissible as a single ascii/text
file across operating systems by ftp systems or attached to email. At
the destination, it must be unzipped by a compatible utility to be
useful, but the problems of transmission are solved this way.
Mac OS type-codes
The Mac OS' Hierarchical File System stores codes for creator and type
as part of the directory entry for each file. These codes are referred
to as OSTypes, and for instance an application written by Apple would
have a creator of AAPL and a type of APPL. RISC OS
uses a similar system, consisting of a 12-bit number which can be
looked up in a table of descriptions — e.g. the hexadecimal number FF5
is "aliased" to PoScript, representing a PostScript file.
Mac OS X Uniform Type Identifiers (UTIs)
A
Uniform Type Identifier (UTI) is a method used in Mac OS X for uniquely
identifying "typed" classes of entity, such as file formats. It was
developed by Apple as a replacement for OSType (type & creator
codes).
The UTI is a Core Foundation string, which
uses a reverse-DNS format. Common or standard types use the public
domain (e.g. public.png for a Portable Network Graphics image), while
other domains can be used for third-party types (e.g. com.adobe.pdf for
Portable Document Format). UTIs can be defined within a hierarchical
structure, known as a conformance hierarchy. Thus, public.png conforms
to a supertype of public.image, which itself conforms to a supertype of
public.data. A UTI can exist in multiple hierarchies, which provides
great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in the OS X file system, including:
- Pasteboard data
- Folders (directories)
- Translatable types (as handled by the Translation Manager)
- Bundles
- Frameworks
- Streaming data
- Aliases and symlinks
OS/2 Extended Attributes
The
HPFS, FAT12 and FAT16 (but not FAT32) filesystems allow the storage of
"extended attributes" with files. These comprise an arbitrary set of
triplets with a name, a coded type for the value and a value, where the
names are unique and values can be up to 64 KB long. There are
standardized meanings for certain types and names (under OS/2). One
such is that the ".TYPE" extended attribute is used to determine the
file type. Its value comprises a list of one or more file types
associated with the file, each of which is a string, such as "Plain
Text" or "HTML document". Thus a file may have several types.
The NTFS filesystem also allows to store OS/2 extended attributes, as one of file forks,
but this feature is merely present to support the OS/2 subsystem (no
more present in XP), so Windows treats this information as an opaque
block of data and does not use it. Instead, it relies on other file
forks to store meta-information in Windows-specific formats. OS/2
extended attributes can still be read and written by programs, but the
data must be entirely parsed by applications.
POSIX extended attributes
On
Unix and Unix-like systems, the ext2, ext3, ReiserFS version 3, XFS,
JFS, FFS, and HFS+ filesystems allow the storage of extended attributes
with files. These include an arbitrary list of "name=value" strings,
where the names are unique, which can be accessed by their "name" parts.
PRONOM Unique Identifiers (PUIDs)
The
PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of
persistent, unique and unambiguous identifiers for file formats, which
has been developed by The National Archives of the UK as part of its
PRONOM technical registry service. PUIDs can be expressed as Uniform
Resource Identifiers using the info:pronom/ namespace. Although not yet
widely-used outside of UK government and some digital preservation
programmes, the PUID scheme does provide greater granularity than most
alternative schemes.
MIME types
MIME
types are widely used in many Internet-related applications, and
increasingly elsewhere, although their usage for on-disc type
information is rare. These consist of a standardised system of
identifiers (managed by IANA) consisting of a type and a sub-type,
separated by a slash — for instance, text/html or image/gif. These were
originally intended as a way of identifying what type of file was
attached to an e-mail, independent of the source and target operating
systems. MIME types are used to identify files on BeOS, as well as
store unique application signatures for application launching.
There
are problems with the MIME types though, several organisations and
people have created their own MIME types without registring them
properly with IANA, which makes the use of this standard awkward in
some cases.
File format identifiers (FFIDs)
File
format identifiers is another, not widely used way to identify file
formats according to their origin and their file category. It was
created for the description explorer suit of software. It is composed
of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first part
indicates the organisation origin/maintainer (this number represents a
value in a company/standards organisation database), the 2 following
digits are used to categorize the type of file in hexadecimal. The
final part is composed of the usual file extension of the file or the
international standard number of the file, padded left with zeros. For
example, the PNG file specification has the FFID of
000000001-31-0015948 where 31 indicates an image file, 0015948 is the
standard number and 000000001 indicates the ISO Organisation.
|