Text file

As a text file, a file is referred to in the information technology that contains displayable characters. These can be broken down by control characters such as line and page breaks. The counterpart of the text file is a binary file. Basically, text files are stored as binary, the terms are, however, used complementarily, because the interpretation of the binary content is the decisive factor: In a text file, the content is interpreted as a sequential series of characters in a font in a binary file is any otherwise interpret the contents possible. Consequently, a text file as opposed to a binary file is readable without the use of special programs and can use a simple text editor - be viewed and edited - such as Notepad on Microsoft Windows or vi on Unix.

In contrast to this technical language definition of a text file in which the file format is crucial to the colloquial use of the term oriented often primarily the visible to the end user contents of the file: Here, all files are slightly out of focus referred to as "Text File ", with the target were created to present a readable text, regardless of the form in which they are stored. The files generated by the usual word processing and publishing software for storage, it is, however, often complex file formats, which, besides the text meta-information describing the text layout, structure, and the fonts used; Moreover, images or graphics may be embedded. Therefore, there is no text files in multiple linguistic sense, as the file formats are often binary and special software is required for viewing.

In a text file in the technical language meaning the amount of available characters is determined by the underlying encoding. The most common are hereby ASCII or UTF8 encoding of Unicode. Such a text file does not have to necessarily contain text - it is for example also be ASCII art, so to pictograms on the basis of the available characters. However, if it is text and understand the importance of no special processing steps nor the knowledge of a special notation is required, the content is referred to as Plain text. The character set is also often limited by a natural or formal language. Text files that require a specific notation - such as HTML files - can be edited while using a simple text editor, but there are often special purpose programs that facilitate the processing - for example, by special emphasis or automatic formatting.

7.1 Tabular Data
7.2 XML
7.3 Additional file formats

History

In the early days of electronic data processing, the distinction between text and binary files was easier than today. In a text file one character is always converted directly into a special bit pattern. The file could without detours - that is, character by character, without any conversion by a special program - are transmitted to a terminal, printer or teletype. The used between telex upon transmission Baudot code is also the origin of to be found in text files control character " line feed " or " carriage return ".

To implement the physically stored bit sequences in a text character encoding is being used. Previously it a sign, a group of 8 bits, so 256 (equivalent ) was almost exclusively always translated into exactly one byte, so normally allowed different characters. When encoding using ASCII in the original definition only 7 bits are actually used.

With the 7 - or 8 -bit character sets Only one font can be used in a file; the use of different languages is limited. The East Asian writing systems, such as Japanese, Chinese and Korean, can practically not be mapped. With ISO 2022, there were in 1986 for the first time a standard that allowed the use of different fonts in a text file, and which also provided fonts that use more than 256 different characters. However, this standard became appreciable spread only in the Far East and was replaced by the 1991 first published in Unicode, the long-term aim to map all existing writing systems.

Ever since the introduction of Unicode, the implementation of a character is in its binary representation is more complicated because it this are several variants and a sign does not always happen with the same number of bytes.

As the exchange of files between different computer systems become more important, not least through the Internet and enable text files compared to binaries in a simpler way a system-independent editing of files, the text format has gained in importance. However, especially through the use of a variety of text files, the term itself has become unsuitable blurrier.

Distinction between binary and text files

In many operating systems conventions exist in relation to the extension of the file name to identify the file type. On Windows, the name of a text file is often txt appended the suffix..

The standardization of the technical format designed by emails Multipurpose Internet Mail Extensions (MIME) define so-called media types that are now used in addition to the e -mail traffic in many other areas to identify the file type. The media type text characterizes text. The complete type specification is complemented by a subtype that specifies the purpose of the text. For text files, which contain either the "actual" text that is not intended for a specific machine processing, is the complete type specification text / plain.

For the text contained in a text file, no special formatting, such as italics can be determined by bolding. Some codes allow stacking diacritics or the display of bidirectional text.

A file with a word processor ( such as Microsoft Word) created is not a text file in the normal case, even if only text was written, as the text can be viewed and edited again only using a suitable text processing system. A Portable Document Format (PDF ) of the present text is not a text file, because these binary encoded format information contains. Likewise, it is in texts that are read by a scanner, not text files. These are rather image files, unless they are converted to a text file after the scan by means of a text recognition software (OCR).

With a data compression significantly greater saving can be achieved in the memory size as when binary files text files generally. This is because, for text files, the information density is lower than in most binaries that the common compression algorithms take advantage of - for example, using Huffman coding.

Marking the end of line

There are basically two ways to determine at what point you want to start a new line in the text: the definition of a constant number of characters per line, or the use of defined special characters to mark the end of line.

Defining a constant line length

The use of a fixed line length has the advantage that the position of a particular row within the character ( sequence of bytes ) of the file can be determined without having to read the file line by line. However, it has the disadvantage that lines need to be "filled " with a shorter content (see padding); This is generally carried out with blanks. This claims the file more space than necessary, if the line length is exhausted. Such a fixed line length is common only on mainframe systems. The record length is in this case managed by the file system or must be specified when accessing the file. Quite often, the record length of 80 characters, as this number of characters can be displayed in character-based terminals on a line.

Labelling means of tax stamps

The usual definition of the sign marking the end of line is reminiscent of the original direct data output text files to remote printers or scribes, which corresponded in its design a typewriter. The "Commands" carriage return (Carriage Return, CR) and line feed (LF Line Feed ) There were necessary to induce the continuation of the print edition at the beginning of the next line - at a teletype were the two separate buttons. These two control characters were therefore the most promising candidates to be used as line end markings for electronic storage of files. In principle, however, a sign of both is sufficient, and this choice meant that the definition was inconsistent, which to this day is a complication in cross-system exchange of files:

Is mainly on Microsoft Windows and the precursor system MS-DOS, the sequence of CR and LF for line end markings used.
On Unix, Linux and related systems, the end of the line is marked solely by means of LF.
For older operating systems from Apple with the exclusive use of CR is a third possibility was common.
In the IBM mainframe world one more special character ( New Line, NL ) is used in EBCDIC besides these two characters yet.

The Most problems arise in this respect when exchanging files between Windows and Unix platforms, as they use the same character code over a wide range and with the exception of the newline character in general, no conversion of the files is required.

Other control characters

In addition to marking the end of line other control characters can occur especially when using the ASCII text files. These were especially common when the contents of the text files has been transferred or directly to the terminal or printer. The most important here are the character form feed (FF ), which marks the location of a page break in the text, and horizontal tabulation (HT), the tab character, featuring an indentation of the text.

To be able to directly influence the display of the text even more differentiated, partially escape sequences were used in conjunction with text files. They consist of the introductory control characters Escape ( ESC) and a series of other characters that encode a representation statement. Here, standards had been established for the control of terminals placed the Digital Equipment Corporation (DEC ) with their VT models the Standard ( ANSI X3.41 -1974 and X3.64 -1977 ). When printing, introduced by Epson standard ESC / P was widespread, so that such escape sequences were also found in text files at the time of the dot matrix printer.

Character encoding

The physically present in binary text files content is converted to text according to a fixed predetermined for each file rule. The following character encodings are used:

ASCII represents the most widely used format - in particular when the extensions of the various standards are included.
ISO 8859-1 (aka Latin-1) and ISO 8859-15 are standardized extensions of ASCII, which form the basis of the code used in Microsoft Windows in English and Western European languages Windows 1252.
EBCDIC is a common on mainframes IBM coding.
Unicode is an international standard, which reflects all appropriate supporting characters around the world. In contrast to the above encodings Unicode case does not come with 8 bits ( ie one byte ) because Unicode defines far more than 256 different characters.

When using Unicode, the general implementation of a character in a byte is not applicable. There are different methods to implement Unicode to a byte sequence. Are the most used in this connection with the codes so as to minimize the file size at the occurrence of the most common characters. For this, however, the rule is "sacrificed" that each character is always encoded with the same number of bytes. An example of this is the common UTF-8 encoding, which also has the characteristic that all characters contained in the original ASCII are encoded in exactly the same way as in a byte ASCII. The binary contents of a file that consists exclusively of such signs, that is identical, regardless of whether it was encoded in ASCII or UTF -8.

For Unicode also exists the Convention, at the beginning of a file by means of specific byte sequences (called Byte Order Marks ) to identify which Unicode encoding. This is also necessary because on many systems - which are used in parallel with existing ASCII-based encoding and Unicode - even on Windows. In such an encoding, the limit for binary begins to blur.

If a text file interpreted using the wrong character encoding, they may be completely illegible when completely incompatible encodings are used - such as ASCII and EBCDIC. If, however, a different, derived from the original ASCII encoding used, only the special characters - for example, the German umlauts - misrepresented as these are not part of the first 128 characters of the ASCII standard.

Exchange between different systems

When text files are transferred from one system to another system type, it must be considered whether the character encoding used by the systems match. Furthermore, the method used for marking the end of line is taken into account ( see above). The exchange of files that use only the first 128 characters of the ASCII is on systems that use this or a derived coding, usually quite easily. The Unicode encoding UTF -8 agrees with exclusive use of this mark exactly with ASCII. However, additional characters are used, a conversion is often required. Note, however, that a conversion is to be performed only if the file is also displayed on the target system itself. If the file is only stored on this system and transmitted to the display back on a system that uses the original encoding, conversion would be unnecessary and possibly even harmful, as can be lost through this double conversion information.

When exchanging text files attached to an e -mail inconsistencies may occur. The problem usually lies with the sender as its mail client often the encoding of the text file can not correctly identify the user but does not require this information for the sake of ease of use and so no corresponding or false information in the mail enters. In principle, the most commonly used today mail clients are able to convert the encoding, if required.

In a direct file transfer (File Transfer ) between systems usually a special program is used for transmission. This also handles the necessary conversions, even if the encodings of the two systems are completely different - such as the exchange between Windows and IBM mainframes. For a transfer must be specified in the rule, whether it is in the file to be transferred to a text or binary file to determine whether a conversion of the file to be, or to refrain is - the contents of a binary file would be destroyed by such a conversion.

Using text files

The original and simplest use case of text files is the transmission of the text contained in the actual information (plain text). But text files can be used using a determined beforehand formal structure to convey complex data. The file is then usually not primarily intended for direct use by the user, but is further processed by a particular program or maintained by a system administrator.

In many cases, text files are used in this way today, where actually binaries appear predestined, because only a further machine processing. The crucial disadvantage of the binaries here is that their structure across system boundaries is far less homogeneous than the text files ( for example, see byte order ). For text files have the disadvantage that more memory is required to store the same information and that the data must be converted into binary format only again during further processing in many cases. However, since - especially through the Internet - the cross-system exchange of data has become increasingly important to provide data storage in text files is often the case today.

Also for by administrators or privileged users to be cared for configuration files to text format is often used. In a binary format a special configuration program would be necessary in each case when using the text format, the configuration file can be edited directly with a text editor. This is in the Unix and Linux world has always been the common approach; with the widespread adoption of XML configuration information is stored but mainly in text files on all systems.

Tabular data

Text files are used for various reasons for storing data with a table structure. So structured files can be processed using a spreadsheet program (such as Calc from the packages LibreOffice and OpenOffice or Microsoft Excel). Database data is often exported as such to this exchange between the most different application programs - even if today the XML format for such a case seems predestined.

There are various methods for tabular arrangement of data in text files, of which the following are the most common:

Separation of the columns by tabs: The tab character, a special control character is used within a line to indicate the column boundaries.
CSV format: This format, originally Comma Separated Values meant is similar to the separation by a tab, only usually in the English language just the comma used in German but the semicolon as a delimiter.
Defining a constant number of characters per column: In order to use such a file must be known, which has width each column. This definition is not itself stored in the file.

XML

XML ( Extensible Markup Language ) is a meta - file format. Thus defined, it is defined in the format of how the structure of a file looks like. XML is aware of a text format and should be readable by humans and machines alike, is also a system-wide exchange of XML data are made possible problems.

XML files are text files, so basically, the coarse structure is standardized and are used mainly for data exchange or data storage - the exact purpose is not imposed by XML itself. An example of an XML-based format is SVG (Scalable Vector Graphics) is a graphics format that is readable thus encoded in principle to a text file.

The file formats of word processors OpenOffice.org ( OpenDocument ), and the newer versions of Microsoft Word ( Office Open XML, identified by the file extension. Docx instead of doc. ) Based on XML, and the stored files are text files accordingly. However, it must be noted that the "text" which is visible in direct processing of such a file is not the " real" text content of the document, but the description of the text document on a meta level.

Additional file formats

In addition to XML formats still exist some mostly elderly quite widespread markup languages that are commonly used and stored in the form of a text file.

HTML, the language for the design of content on the World Wide Web, is related in structure to XML.
Rich Text Format (RTF ) is a language for the exchange of formatted text between word processing programs on different platforms.
PostScript is a file format that allows professional printing and formatting is stored in the form of a text file. The binary data contained graphics are implemented as hexadecimal digits into text. Since many printers can interpret this format directly, give many word processing or desktop publishing programs from their results in PostScript format. PostScript is, however, displaced in some areas of PDF.

In addition there are many more and also proprietary formats, the structure opens up only with a corresponding specification availability.

View and edit text files

Text editors used for direct viewing and editing text files. Of practically all text editors allow to search in a file directly for specific text content. Many text editors also provide support in the preparation of special file formats, various syntax elements are highlighted according to their importance ( for example, by coloring ). Using a text editor, a file can normally be printed.

Both when viewed in a text editor as well as the expression of the problem may arise that the indentation of lines is not displayed correctly. This is usually because that in the file, the tab control character is included, for this is not uniformly defined, how far to indent should occur. How many characters will be indented, so is a configuration information of the editor or printer. To make matters worse, that when displayed in the text editor of the difference between space and a tab character or is not difficult to see most.

Text editors often add automatically "soft" line breaks, if the width of the screen window used to display the entire row is not sufficient. Even when printing it can to insert such a " soft" line breaks come. These line breaks are not included in the file itself and can be done for output to another medium elsewhere. Often these user only with difficulty from the actual "hard " line breaks must be distinguished - that is, the line breaks that the users themselves - has inserted in the file and which are stored in the file - for example, using the corresponding key.

ASCII-Art Text (literary theory) Baudot-Code Filename Entropy (information theory) ESC/P Umlaut (linguistics) Specification (technical standard)

18619