Skip to main content

Unicode Normalization

Unicode is a standard for representing characters from all worldwide languages in one machine-readable format. Files.com uses Unicode when working with text. Because paths are sensitive in a filesystem, Files.com follows a specific pattern for normalizing Unicode values in paths.

Files.com uses the NFKC (Normalization Form Compatibility Composition) algorithm for normalizing Unicode as part of path comparison.

Files.com is Unicode-preserving: the path name is stored using the actual Unicode representation used when the file or folder is first created, even though path comparison runs against the normalized form.

Exact Algorithm For Path Normalization

Files.com uses two algorithms for path normalization. The Normalize algorithm is applied to all paths provided to the Files.com service to remove noncompliance with our path requirements. If you are building an SDK or manual API integration to Files.com, implement this algorithm before sending paths to the Files.com API so they are treated on the server side the same way you provided them.

The Normalize For Comparison algorithm is used to compare two paths to determine whether they are the same. If your SDK or API integration needs to determine whether two file paths are the same, implement this algorithm as well.

The official Files.com SDKs implement both algorithms natively. We recommend using the SDKs rather than implementing either algorithm by hand. The algorithms are described below, and sample code is available in the SDKs.

Normalize Algorithm

  • Convert the path to UTF-8
  • Remove any characters with byte value of 0
  • Convert any backslash \ characters to a forward slash /
  • Remove any trailing or leading slashes
  • Remove any path parts that are . or ..
  • Replace any duplicate forward slashes (such as /// with a single forward slash /)

Normalize For Comparison Algorithm

  • Run the path through the Normalize Algorithm
  • Unicode Normalize the path using the Unicode NFKC algorithm
  • Transliterate and remove accent marks using the official Files.com transliteration map below. Replace any instance of the first character in each map entry with the remaining characters.
  • Convert the path to lowercase using the case mapping found in Unicode 9.0. This version is older than the Unicode 15.0 implemented by many modern programming languages, but the only differences affect two very rare languages and have never caused issues in practice at Files.com. Use whichever version of Unicode your environment supports.
  • Remove any trailing whitespace (\r, \n, \t, or the space " " character)

Any two paths with the same resulting string from this algorithm are considered the same file on Files.com.

TRANSLITERATION_MAP = "ÀA,ÁA,ÂA,ÃA,ÄA,ÅA,ÆAE,ÇC,ÈE,ÉE,ÊE,ËE,ÌI,ÍI,ÎI,ÏI,ÐD,ÑN,ÒO,ÓO,ÔO,ÕO,ÖO,ØO,ÙU,ÚU,ÛU,ÜU,ÝY,ßss,àa,áa,âa,ãa,äa,åa,æae,çc,èe,ée,êe,ëe,ìi,íi,îi,ïi,ðd,ñn,òo,óo,ôo,õo,öo,øo,ùu,úu,ûu,üu,ýy,ÿy,ĀA,āa,ĂA,ăa,ĄA,ąa,ĆC,ćc,ĈC,ĉc,ĊC,ċc,ČC,čc,ĎD,ďd,ĐD,đd,ĒE,ēe,ĔE,ĕe,ĖE,ėe,ĘE,ęe,ĚE,ěe,ĜG,ĝg,ĞG,ğg,ĠG,ġg,ĢG,ģg,ĤH,ĥh,ĦH,ħh,ĨI,ĩi,ĪI,īi,ĬI,ĭi,ĮI,įi,İI,IJIJ,ijij,ĴJ,ĵj,ĶK,ķk,ĹL,ĺl,ĻL,ļl,ĽL,ľl,ŁL,łl,ŃN,ńn,ŅN,ņn,ŇN,ňn,ʼn'n,ŌO,ōo,ŎO,ŏo,ŐO,őo,ŒOE,œoe,ŔR,ŕr,ŖR,ŗr,ŘR,řr,ŚS,śs,ŜS,ŝs,ŞS,şs,ŠS,šs,ŢT,ţt,ŤT,ťt,ŨU,ũu,ŪU,ūu,ŬU,ŭu,ŮU,ůu,ŰU,űu,ŲU,ųu,ŴW,ŵw,ŶY,ŷy,ŸY,ŹZ,źz,ŻZ,żz,ŽZ,žz"