Russian text codec




















The second must be an integer and can be additional state info. The implementation should make sure that 0 is the most common additional state info. If this additional state info is 0 it must be possible to set the decoder to the state which has no input buffered and 0 as the additional state info, so that feeding the previously buffered input to the decoder returns it to the previous state without producing any output.

Set the state of the decoder to state. The StreamWriter and StreamReader classes provide generic working interfaces which can be used to implement new encoding submodules very easily. See encodings. The StreamWriter class is a subclass of Codec and defines the following methods which every stream writer must define in order to be compatible with the Python codec registry.

Constructor for a StreamWriter instance. All stream writers must provide this constructor interface. The stream argument must be a file-like object open for writing text or binary data, as appropriate for the specific codec.

The StreamWriter may implement different error handling schemes by providing the errors keyword argument. See Error Handlers for the standard error handlers the underlying stream codec may support. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the StreamWriter object.

Writes the concatenated list of strings to the stream possibly by reusing the write method. The standard bytes-to-bytes codecs do not support this method. Calling this method should ensure that the data on the output is put into a clean state that allows appending of new fresh data without having to rescan the whole stream to recover state.

In addition to the above methods, the StreamWriter must also inherit all other methods and attributes from the underlying stream. The StreamReader class is a subclass of Codec and defines the following methods which every stream reader must define in order to be compatible with the Python codec registry.

Constructor for a StreamReader instance. All stream readers must provide this constructor interface. The stream argument must be a file-like object open for reading text or binary data, as appropriate for the specific codec. The StreamReader may implement different error handling schemes by providing the errors keyword argument. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the StreamReader object.

The chars argument indicates the number of decoded code points or bytes to return. The read method will never return more data than requested, but it might return less, if there is not enough available. The size argument indicates the approximate maximum number of encoded bytes or code points to read for decoding. The decoder can modify this setting as appropriate. The default value -1 indicates to read and decode as much as possible.

This parameter is intended to prevent having to decode huge files in one step. The firstline flag indicates that it would be sufficient to only return the first line, if there are decoding errors on later lines. The method should use a greedy read strategy meaning that it should read as much data as is allowed within the definition of the encoding and the given size, e. Note that no stream repositioning should take place.

This method is primarily intended to be able to recover from decoding errors. In addition to the above methods, the StreamReader must also inherit all other methods and attributes from the underlying stream. The StreamReaderWriter is a convenience class that allows wrapping streams which work in both read and write modes. The design is such that one can use the factory functions returned by the lookup function to construct the instance. Creates a StreamReaderWriter instance. Reader and Writer must be factory functions or classes providing the StreamReader and StreamWriter interface resp.

Error handling is done in the same way as defined for the stream readers and writers. They inherit all other methods and attributes from the underlying stream.

The StreamRecoder translates data from one encoding to another, which is sometimes useful when dealing with different encoding environments. Creates a StreamRecoder instance which implements a two-way conversion: encode and decode work on the frontend — the data visible to code calling read and write , while Reader and Writer work on the backend — the data in stream.

The encode and decode arguments must adhere to the Codec interface. Reader and Writer must be factory functions or classes providing objects of the StreamReader and StreamWriter interface respectively. Strings are stored internally as sequences of code points in range 0x0 — 0x10FFFF. See PEP for more details about the implementation. Once a string object is used outside of CPU and memory, endianness and how these arrays are stored as bytes become an issue.

As with other codecs, serialising a string into a sequence of bytes is known as encoding , and recreating the string from the sequence of bytes is known as decoding. There are a variety of different text serialisation codecs, which are collectivity referred to as text encodings. To see how this is done simply open e. All of these encodings can only encode of the code points defined in Unicode. A simple and straightforward way that can store each Unicode code point, is to store each code point as four consecutive bytes.

There are two possibilities: store the bytes in big endian or in little endian order. Their disadvantage is that if e. UTF avoids this problem: bytes will always be in natural endianness. When these bytes are read by a CPU with a different endianness, then bytes have to be swapped though.

The byte swapped version of this character 0xFFFE is an illegal character that may not appear in a Unicode text. It can e. With Unicode 4. Each byte in a UTF-8 byte sequence consists of two parts: marker bits the most significant bits and payload bits.

The marker bits are a sequence of zero to four 1 bits followed by a 0 bit. Unicode characters are encoded like this with x being payload bits, which when concatenated give the Unicode character :. U … UF. U … UFF. Each charmap encoding can decode any random byte sequence. So here the BOM is not used to be able to determine the byte order used for generating the byte sequence, but as a signature that helps in guessing the encoding.

On encoding the utfsig codec will write 0xef , 0xbb , 0xbf as the first three bytes to the file. On decoding utfsig will skip those three bytes if they appear as the first three bytes in the file.

Python comes with a number of codecs built-in, either implemented as C functions or with dictionaries as mapping tables. The following table lists the codecs by name, together with a few common aliases, and the languages for which the encoding is likely used. Neither the list of aliases nor the list of languages is meant to be exhaustive.

Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e. CPython implementation detail: Some common encodings can bypass the codecs lookup machinery to improve performance.

These optimization opportunities are only recognized by CPython for a limited set of case insensitive aliases: utf-8, utf8, latin-1, latin1, iso, iso, mbcs Windows only , ascii, us-ascii, utf, utf16, utf, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution.

Many of the character sets support the same languages. They vary in individual characters e. For the European languages in particular, the following variants typically exist:. A number of predefined codecs are specific to Python, so their codec names have no meaning outside Python. These are listed in the tables below based on the expected input and output types note that while text encodings are the most common use case for codecs, the underlying codec infrastructure supports arbitrary data transforms rather than just text encodings.

For asymmetric codecs, the stated meaning describes the encoding direction. The following codecs provide str to bytes encoding and bytes-like object to str decoding, similar to the Unicode text encodings.

Implement RFC , see also encodings. Implement RFC Stateful codecs are not supported. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.

Decode from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default. The following codecs provide binary transforms: bytes-like object to bytes mappings. They are not supported by bytes. The following codec provides a text transform: a str to str mapping. It is not supported by str. It builds upon the punycode encoding and stringprep. This conversion is carried out in the application; if possible invisible to the user: The application should transparently convert Unicode domain labels to IDNA on the wire, and convert back ACE labels to Unicode before presenting them to the user.

Python supports this conversion in several ways: the idna codec performs conversion between Unicode and ACE, separating an input string into labels based on the separator characters defined in section 3.

Furthermore, the socket module transparently converts Unicode host names to ACE, so that applications need not be concerned about converting host names themselves when they pass them to the socket module. On top of that, modules that have host names as function parameters, such as http. When receiving host names from the wire such as in reverse name lookup , no automatic conversion to Unicode is performed: applications wishing to present such host names to the user should decode them to Unicode.

The module encodings. The nameprep functions can be used directly if desired. Return the nameprepped version of label. The implementation currently assumes query strings, so AllowUnassigned is true. Convert a label to Unicode, as specified in RFC Availability : Windows only. This module implements a variant of the UTF-8 codec.

For the stateful encoder this is only done once on the first write to the byte stream. Navigation index modules next previous Python ». The module defines the following functions for encoding and decoding with any codec: codecs.

The full details for each codec can also be looked up directly: codecs. To simplify access to the various codec components, the module provides these additional functions which use lookup for the codec lookup: codecs.

Custom codecs are made available by registering a suitable codec search function: codecs. Note Underlying encoded files are always opened in binary mode. The following string values are defined and implemented by all standard Python codecs: Value Meaning 'strict' Raise UnicodeError or a subclass ; this is the default. In addition, the following error handler is specific to the given codecs: Value Codecs Meaning 'surrogatepass' utf-8, utf, utf, utfbe, utfle, utfbe, utfle Allow encoding and decoding of surrogate codes.

If keepends is false line-endings will be stripped from the lines returned. You can use these objects to do transparent transcodings, e.

The stream argument must be a file-like object. See History and License for more information. The Python Software Foundation is a non-profit corporation.

Please donate. Last updated on Jan 14, Found a bug? Created using Sphinx 3. German New in version 3. Ukrainian New in version 3. Tajik New in version 3. Support for new text encodings can be added to Qt by creating QTextCodec subclasses. The pure virtual functions describe the encoder to the system and the coder is used as required in the different text file formats supported by QTextStream , and under X11, for the locale-specific character input and output.

To add support for another encoding to Qt, make a subclass of QTextCodec and implement the functions listed in the table below. You may find it more convenient to make your codec class available as a plugin; see How to Create Qt Plugins for details. It stores an OR combination of ConversionFlag values. Constructs a QTextCodec , and gives it the highest precedence.

The QTextCodec should always be constructed on the heap i. Qt takes ownership and will delete it when the application terminates. Destroys the QTextCodec. Note that you should not delete codecs yourself: once created they become Qt's responsibility.

Standard aliases for codecs can be found in the IANA character-sets encoding file. Returns the list of all available codecs, by name. See also availableMibs , name , and aliases. Returns the list of MIBs for all available codecs. See also availableCodecs and mibEnum. Returns true if the Unicode character ch can be fully encoded with this codec; otherwise returns false.

If this function returns 0 the default , QString assumes Latin Tries to detect the encoding of the provided snippet of HTML in the given byte array, ba , by checking the BOM Byte Order Mark and the content-type meta header and returns a QTextCodec instance that is capable of decoding the html to unicode.

If the codec cannot be detected from the content provided, defaultCodec is returned. If the codec cannot be detected, this overload returns a Latin-1 QTextCodec. On Windows, the codec will be based on a system locale. On Unix systems, starting with Qt 4.

Note that in both cases the codec's name will be "System". Searches all installed QTextCodec objects and returns the one which best matches name ; the match is case-insensitive. Returns 0 if no codec matching the name name could be found.

Returns the codec used by QObject::tr on its argument. If this function returns 0 the default , tr assumes Latin Tries to detect the encoding of the provided snippet ba by using the BOM Byte Order Mark and returns a QTextCodec instance that is capable of decoding the text to unicode. Converts the first number of characters from the input array from Unicode to the encoding of the subclass, and returns the result in a QByteArray.

If state is not 0, the codec should save the state after the conversion in state , and adjust the remainingChars and invalidChars members of the struct. Converts the first len characters of chars from the encoding of the subclass to Unicode, and returns the result in a QString. Converts str from Unicode to the encoding of this codec, and returns the result in a QByteArray. Converts the first number of characters from the input array from Unicode to the encoding of this codec, and returns the result in a QByteArray.

Subclasses of QTextCodec must reimplement this function. It is important that each QTextCodec subclass returns the correct unique value for this function.

QTextCodec subclasses must reimplement this function. It returns the name of the encoding supported by the subclass. If the codec is registered as a character set in the IANA character-sets encoding file this method should return the preferred mime name for the codec if defined, otherwise its name. If the codec is 0 the default , QString assumes Latin To avoid undesirable side-effects, we recommend avoiding such codecs with setCodecsForCString. Set the codec to c ; this will be returned by codecForLocale.

If c is a null pointer, the codec is reset to the default. This might be needed for some applications that want to use their own mechanism for setting the locale.



0コメント

  • 1000 / 1000