{"id":37690,"date":"2007-12-20T08:36:21","date_gmt":"2007-12-20T08:36:21","guid":{"rendered":"https:\/\/www.viafirma.com\/detect-at-a-low-level-if-a-file-is-utf-8-from-java\/"},"modified":"2026-01-23T12:05:31","modified_gmt":"2026-01-23T12:05:31","slug":"detect-at-a-low-level-if-a-file-is-utf-8-from-java","status":"publish","type":"post","link":"https:\/\/www.viafirma.com\/en\/detect-at-a-low-level-if-a-file-is-utf-8-from-java\/","title":{"rendered":"Detect at a low level if a file is UTF-8 from Java"},"content":{"rendered":"<p>Although initially it seems a very simple task, those who have had to detect if a file is UTF-8 will have realized that the subject is not as obvious as it seems.<\/p>\n<p><strong>First a theoretical introduction<\/strong><\/p>\n<p>The files are stored as arrays of bytes that are later associated to characters, to make this association different encodings are used (ASCII, ISO-8859-1, UTF-8, etc.).<\/p>\n<p>In order to establish a relationship between its code and any character used by any language in the world, Unicode was created, which is nothing more than a gigantic numeric-graphic code association to allow its computer representation.<\/p>\n<p>In this context, UTF-8 is simply a way of encoding Unicode text to allow it to be serialized in files or data streams.<\/p>\n<p>Since Unicode tries to associate codes to all essential characters, we need more than one byte to encode them, so UTF-8 uses a variable structure of 1 to 4 bytes to encode the different characters.<\/p>\n<p>This variable size is the reason why sometimes files saved in one format are displayed with strange characters when retrieved using the wrong encoding.<\/p>\n<p><strong>Algorithm Approach<\/strong><\/p>\n<p>The process is very simple, just read the file byte by byte and check that all bytes comply with the UTF-8 standard.<\/p>\n<ul>\n<li>If the byte read is less than 0111 1111 (0x7F) it is a valid byte. In this case the byte represents a UTF-8 character (1 byte).<\/li>\n<li>If the byte read matches the mask 110xxxxx, I check that the next byte matches the mask 10xxxxxx. In this case the two bytes read form the UTF-8 character.<\/li>\n<li>Similarly, it is possible to detect whether they are 3 or 4 byte characters.<\/li>\n<\/ul>\n<p>If at any point in the processing of the file, any of the conditions are not met, the file is not UTF-8, otherwise it has a UTF-8 compatible encoding.<\/p>\n<p>Although I am sure that there are much more efficient implementations in Java, after some searches on the Internet I did not find anything, so I started to program my own validator.<\/p>\n<h2><a title=\"Direct link to file\" href=\"https:\/\/www.viafirma.com\/wp-content\/uploads\/2007\/12\/testencodingjava1.txt\">Code snippet to detect if a file is UTF-8 encoded<\/a><\/h2>\n<p><span style=\"font-weight: 400;\"><div class=\"vf_related_posts_wrapper\"><h2 class=\"vf_related_posts_title\">Related information<\/h2><div class=\"vf_related_posts\"><article class=\"vc_gitem-post-data-source-post\"><a href=\"https:\/\/www.viafirma.com\/en\/eclipse-does-not-autocomplete-for-me\/\" class=\"vc_gitem-post-link\"><div class=\"vc_gitem-post-image\"><img loading=\"lazy\" decoding=\"async\" width=\"768\" height=\"295\" src=\"https:\/\/www.viafirma.com\/wp-content\/uploads\/2008\/01\/eclipse-768x295.webp\" class=\" wp-post-image\" alt=\"eclipse\" srcset=\"https:\/\/www.viafirma.com\/wp-content\/uploads\/2008\/01\/eclipse-768x295.webp 768w, https:\/\/www.viafirma.com\/wp-content\/uploads\/2008\/01\/eclipse-300x115.webp 300w, https:\/\/www.viafirma.com\/wp-content\/uploads\/2008\/01\/eclipse.jpg 1000w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/div><\/a><div class=\"vc_gitem-post-data\"><h3 class=\"vc_gitem-post-data-source-post_title\"><a href=\"https:\/\/www.viafirma.com\/en\/eclipse-does-not-autocomplete-for-me\/\">Eclipse autocomplete<\/a><\/h3><p class=\"vc_gitem-post-data-source-post_excerpt\">Fixes autocompletion in Eclipse<\/p><\/div><\/article><article class=\"vc_gitem-post-data-source-post\"><a href=\"https:\/\/www.viafirma.com\/en\/i-cannot-run-viafirma-java-desktop-in-java\/\" class=\"vc_gitem-post-link\"><div class=\"vc_gitem-post-image\"><img loading=\"lazy\" decoding=\"async\" width=\"768\" height=\"475\" src=\"https:\/\/www.viafirma.com\/wp-content\/uploads\/2019\/12\/blog_java-768x475.webp\" class=\" wp-post-image\" alt=\"mujer sentada en sofa con libros de programaci\u00f3n\" srcset=\"https:\/\/www.viafirma.com\/wp-content\/uploads\/2019\/12\/blog_java-768x475.webp 768w, https:\/\/www.viafirma.com\/wp-content\/uploads\/2019\/12\/blog_java-300x186.webp 300w, https:\/\/www.viafirma.com\/wp-content\/uploads\/2019\/12\/blog_java-1024x634.webp 1024w, https:\/\/www.viafirma.com\/wp-content\/uploads\/2019\/12\/blog_java.webp 1228w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/div><\/a><div class=\"vc_gitem-post-data\"><h3 class=\"vc_gitem-post-data-source-post_title\"><a href=\"https:\/\/www.viafirma.com\/en\/i-cannot-run-viafirma-java-desktop-in-java\/\">I cannot run Viafirma Java Desktop in Java<\/a><\/h3><p class=\"vc_gitem-post-data-source-post_excerpt\">If you\\&#8217;re having trouble running Viafirma Java Desktop on Mac,<\/p><\/div><\/article><\/div><\/div><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"Although initially it seems a very simple task, those who...","protected":false},"author":1,"featured_media":35648,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"default","adv-header-id-meta":"","stick-header-meta":"default","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"set","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[157],"class_list":["post-37690","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-others"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/posts\/37690","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/comments?post=37690"}],"version-history":[{"count":2,"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/posts\/37690\/revisions"}],"predecessor-version":[{"id":97660,"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/posts\/37690\/revisions\/97660"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/media\/35648"}],"wp:attachment":[{"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/media?parent=37690"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.viafirma.com\/en\/wp-json\/wp\/v2\/categories?post=37690"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}