java

Detect at a low level if a file is UTF-8 from Java

Although initially it seems a very simple task, those who have had to detect if a file is UTF-8 will have realized that the subject is not as obvious as it seems.

First a theoretical introduction

The files are stored as arrays of bytes that are later associated to characters, to make this association different encodings are used (ASCII, ISO-8859-1, UTF-8, etc.).

In order to establish a relationship between its code and any character used by any language in the world, Unicode was created, which is nothing more than a gigantic numeric-graphic code association to allow its computer representation.

In this context, UTF-8 is simply a way of encoding Unicode text to allow it to be serialized in files or data streams.

Since Unicode tries to associate codes to all essential characters, we need more than one byte to encode them, so UTF-8 uses a variable structure of 1 to 4 bytes to encode the different characters.

This variable size is the reason why sometimes files saved in one format are displayed with strange characters when retrieved using the wrong encoding.

Algorithm Approach

The process is very simple, just read the file byte by byte and check that all bytes comply with the UTF-8 standard.

  • If the byte read is less than 0111 1111 (0x7F) it is a valid byte. In this case the byte represents a UTF-8 character (1 byte).
  • If the byte read matches the mask 110xxxxx, I check that the next byte matches the mask 10xxxxxx. In this case the two bytes read form the UTF-8 character.
  • Similarly, it is possible to detect whether they are 3 or 4 byte characters.

If at any point in the processing of the file, any of the conditions are not met, the file is not UTF-8, otherwise it has a UTF-8 compatible encoding.

Although I am sure that there are much more efficient implementations in Java, after some searches on the Internet I did not find anything, so I started to program my own validator.

Code snippet to detect if a file is UTF-8 encoded

Contact

    La mejor solución de firma electrónica para tu empresa

    Update cookies preferences
    Scroll to Top