feat: change all narrow filesystem paths and operations to speak UTF-8 #59

zao · 2024-04-19T11:40:45Z

Motivation

Historically Path of Building's Lua code, SimpleGraphic and LuaJIT has used narrow byte-based paths encoded in the user's codepage, typically a single-byte western one with ASCII in the low half and various latin-adjacent characters in the high half.

As we predominantly live in the user's profile directory with both the default user/build path in Documents and the preferred program installation path in Application Data, the user's chosen username often appears in absolute paths that involve said build and script paths. This happens to work somewhat OK if it can be expressed in the user's codepage but if it can't, it breaks miserably without much in the way of meaningful workarounds.

A recent complicating factor is that OneDrive is transparently enabled on more and more computers that are set up with Microsoft accounts out of the box. When Documents is managed under OneDrive, the true location for Documents is in places like C:\Users\username\OneDrive\Документи with the folder name using the user's human language term. The fallout and individual workarounds can be seen in issues like PathOfBuildingCommunity/PathOfBuilding#7431.

While newer Windows 10 and 11 builds have opt-in support for using UTF-8 as the locale, we can't realistically have all affected users configure that knob with indeterminate system impact. It would also not put a stop to the stream of issues filed about the problem.

Changes

This PR aims to fix this by a number of deep modifications to the runtime as well as some changes to the Lua code in the main project in PathOfBuildingCommunity/PathOfBuilding#7586.
The Launcher also has a PR in PathOfBuildingCommunity/PathOfBuilding-Launcher#6 to pass Unicode paths on the command line instead of ACP paths.

The individual commits have extensive essays on what they change and why, but in short here we:

change file search (NewFileSearch) to accept and yield UTF-8 encoded paths and wildcard patterns;
add a compiled Lua extension module starwing/luautf8 usable as lua-utf8 to perform UTF-8 string operations in our Lua;
expose UTF-8 paths in GetScriptPath and GetUserPath, resolving any symbolic links along the way;
enhance the text renderer to represent Unicode codepoints outside of what the font supports with a "tofu" representation where the codepoint ordinal is expressed as [U+1234] in a slightly smaller font size;
enhance the text measurement and caret positioning code to understand UTF-8 and codepoints, always yielding byte offsets at codepoint boundaries;
make all internal SimpleGraphic I/O use std::filesystem::path and Unicode when operating on the filesystem;
patch LuaJIT itself to load DLLs with Unicode paths, do filesystem operations with Unicode paths, open files with wide paths, etc.

It also does some slightly related cleanup like:

remove unused GIF and BLP file format support.

Some commits in the PR are later backed out as their functionality became superseded by later changes, like the monkey-patching of io.open/os.remove/os.rename which were made to interpret given paths as UTF-8 and calling wide CRT functions for those tasks. They have been superseded by doing the same thing but in more extensive patches to LuaJIT itself.

Of note is that the updater has not been touched and still uses the same old non-JIT Lua interpreter with the user codepage narrow functions. It can't be updated but it mostly works if you take care to only feed it relative paths.

On Windows this function opens a placeholder file with the standard io:open function to get a proper file object, then replacing the internal C file pointer with the actual file opened through _wfopen.

Previously the file search objects exposed to Lua accepted and generated paths encoded in the user's narrow codepage. As a move toward UTF-8 awareness for paths and strings on the Lua side, we now instead accept and generate UTF-8 strings. This should be robust against legacy paths represented in the user's codepage, the only source of such paths should be if they're historically persisted in locations like `Settings.xml`.

By squirreling away the original `io.open` function from Lua and replacing the entry in the interpreter with our own wrapper that interprets the path as UTF-8, we can now open files in locations that cannot be represented in the user's codepage, as long as the path is supplied as UTF-8. Lua itself is very encoding agnostic, assuming that all strings are soups of bytes and that paths are encoded in whatever the platform's `FILE*` accepts. On Linux and macOS that's fairly friendly and in the past dependent on the locale, but these days is de-facto standardized on UTF-8. On Windows the story is worse and narrow paths are assumed to be in the user's current codepage, which can only express a fixed subset of Unicode and which breaks if paths from the filesystem cannot be encoded in the codepage. On Windows with NTFS the underlying filesystem is UTF-16LE and supports all of Unicode, so it's good for consistency and future work if we standardize on UTF-8, as Lua is very byte-oriented when it comes to text and paths.

The luautf8 library provides analogues of string functions that operate on UTF-8 strings instead of byte soups, allowing for intelligent editing and iterating through such strings, needed for things like Unicode paths and other internationalization.

Store the user path and base path as canonicalized std::filesystem paths and expect the Lua side to be aware enough of UTF-8 to manipulate them properly. This together with the monkeypatched UTF-8 `io.open` and functions like MakeDir and RemoveDir allows for use of a wider variety of installation and user profile locations.

As we need the ability to display UTF-8 encoded text onward in Lua code this change alters text rendering to transcode human strings to UTF-32 codepoints keeping an association between original byte offset and codepoint index. This allows for font rendering that can intelligently translate unhandled codepoints as tofu symbols, for now represented as a bracketed codepoint ordinal in a smaller font like `[U+0123]`. This slightly misaligns mouse selection and navigation of text areas if they're in "scaled" font heights in-between or outside the "crisp" font sizes available in the runtime. The Lua code predominantly uses crisp fonts, a survey only found a few scaled/zero/negative heights, all of empty strings. The handling of zero-height and negative height strings has also been made more robust, fixing a division by zero and an array overflow.

The LuaJIT implementations of `os.rename` and `os.remove` are using narrow functions in the system encoding. This change redirects the functions to variants that widens UTF-8 strings from Lua on Windows and calls wide CRT versions of the functions to properly rename and remove files.

No current code seems to use GIF or BLP anymore, removing the code cuts down the amount of surface area we have.

This change adds some foundational changes to widen the rest of the runtime to use `std::filesystem::path` and try to detect misuse where narrow strings are used as-is. Dependent code in the codebase will fail to build with this change.

Script paths are now Unicode and canonicalized to remove any extraneous "./" and "../" segments, resolving the full path, these typically arise from how the program is lauched during development. Most API functions that use these are updated to use the new path types.

Use modern paths in the configuration system as well as in utilities like subprocess spawning, cloud provider info and file search.

The texture library needs string keys to identify "shaders", they remain narrow UTF-8 as all image paths in use are ASCII and while there's now proper path use for texture loading it doesn't need to permeate quite that far into that kind of performance-sensitive place.

As these functions can be fixed inside of the LuaJIT interpreter itself, we don't need these hooked into the main interpreter anymore. A positive side effect of this is that subscripts now also can use UTF-8 in paths as the changes are universal across all LuaJIT interpreters. The fallback regular Lua interpreter in `Update.exe` is still narrow ACP and care needs to be made when writing update op-files and modifying `UpdateApply.lua`

This patch to the vendored LuaJIT build changes much of the internal machinery to use UTF-8 for paths, using W-style Windows API functions and _w-type CRT functions with UTF-16LE wchar_t transcoding internally to UTF-8 at the edge. This also patches several functions exposed to Lua like `os.getenv`, `os.remove`, `os.rename`, `os.system`, `os.tmpnam` as well as `io.open`.

For `fs::path` the `u8string()` function yields paths in the native OS format with backward slashes on Windows. For consistency with existing forward slash use in PoB Lua code they're now instead exposed as `generic_u8string()` which has consistent separators across platforms.

File search patterns previously split up and escaped individual UTF-8 code units instead of matching on the actual codepoint. This change transcodes to UTF-32 codepoints emits non-special codepoints as curly hex escapes like `\x{10FFFF}`.

zao added 17 commits April 2, 2024 23:44

feat: add OpenFile function to open UTF-8 paths

52c0486

On Windows this function opens a placeholder file with the standard io:open function to get a proper file object, then replacing the internal C file pointer with the actual file opened through _wfopen.

feat: remove legacy GIF and BLP file formats

c8c4c4d

No current code seems to use GIF or BLP anymore, removing the code cuts down the amount of surface area we have.

fix: harden string conversion against invalid data

af24736

feat: make images and file streams use fs::path

e4998ec

This change adds some foundational changes to widen the rest of the runtime to use `std::filesystem::path` and try to detect misuse where narrow strings are used as-is. Dependent code in the codebase will fail to build with this change.

feat: revise config paths and various utilities

05db02a

Use modern paths in the configuration system as well as in utilities like subprocess spawning, cloud provider info and file search.

zao mentioned this pull request Apr 19, 2024

Fix fallout from SimpleGraphic upgrade with wider Unicode support PathOfBuildingCommunity/PathOfBuilding#7586

Draft

zao marked this pull request as ready for review May 20, 2024 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: change all narrow filesystem paths and operations to speak UTF-8 #59

feat: change all narrow filesystem paths and operations to speak UTF-8 #59

zao commented Apr 19, 2024

feat: change all narrow filesystem paths and operations to speak UTF-8 #59

Are you sure you want to change the base?

feat: change all narrow filesystem paths and operations to speak UTF-8 #59

Conversation

zao commented Apr 19, 2024

Motivation

Changes