Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change all narrow filesystem paths and operations to speak UTF-8 #59

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

zao
Copy link
Contributor

@zao zao commented Apr 19, 2024

Motivation

Historically Path of Building's Lua code, SimpleGraphic and LuaJIT has used narrow byte-based paths encoded in the user's codepage, typically a single-byte western one with ASCII in the low half and various latin-adjacent characters in the high half.

As we predominantly live in the user's profile directory with both the default user/build path in Documents and the preferred program installation path in Application Data, the user's chosen username often appears in absolute paths that involve said build and script paths. This happens to work somewhat OK if it can be expressed in the user's codepage but if it can't, it breaks miserably without much in the way of meaningful workarounds.

A recent complicating factor is that OneDrive is transparently enabled on more and more computers that are set up with Microsoft accounts out of the box. When Documents is managed under OneDrive, the true location for Documents is in places like C:\Users\username\OneDrive\Документи with the folder name using the user's human language term. The fallout and individual workarounds can be seen in issues like PathOfBuildingCommunity/PathOfBuilding#7431.

While newer Windows 10 and 11 builds have opt-in support for using UTF-8 as the locale, we can't realistically have all affected users configure that knob with indeterminate system impact. It would also not put a stop to the stream of issues filed about the problem.

Changes

This PR aims to fix this by a number of deep modifications to the runtime as well as some changes to the Lua code in the main project in PathOfBuildingCommunity/PathOfBuilding#7586.
The Launcher also has a PR in PathOfBuildingCommunity/PathOfBuilding-Launcher#6 to pass Unicode paths on the command line instead of ACP paths.

The individual commits have extensive essays on what they change and why, but in short here we:

  • change file search (NewFileSearch) to accept and yield UTF-8 encoded paths and wildcard patterns;
  • add a compiled Lua extension module starwing/luautf8 usable as lua-utf8 to perform UTF-8 string operations in our Lua;
  • expose UTF-8 paths in GetScriptPath and GetUserPath, resolving any symbolic links along the way;
  • enhance the text renderer to represent Unicode codepoints outside of what the font supports with a "tofu" representation where the codepoint ordinal is expressed as [U+1234] in a slightly smaller font size;
  • enhance the text measurement and caret positioning code to understand UTF-8 and codepoints, always yielding byte offsets at codepoint boundaries;
  • make all internal SimpleGraphic I/O use std::filesystem::path and Unicode when operating on the filesystem;
  • patch LuaJIT itself to load DLLs with Unicode paths, do filesystem operations with Unicode paths, open files with wide paths, etc.

It also does some slightly related cleanup like:

  • remove unused GIF and BLP file format support.

Some commits in the PR are later backed out as their functionality became superseded by later changes, like the monkey-patching of io.open/os.remove/os.rename which were made to interpret given paths as UTF-8 and calling wide CRT functions for those tasks. They have been superseded by doing the same thing but in more extensive patches to LuaJIT itself.

Of note is that the updater has not been touched and still uses the same old non-JIT Lua interpreter with the user codepage narrow functions. It can't be updated but it mostly works if you take care to only feed it relative paths.

zao added 17 commits April 2, 2024 23:44
On Windows this function opens a placeholder file with the standard
io:open function to get a proper file object, then replacing the
internal C file pointer with the actual file opened through _wfopen.
Previously the file search objects exposed to Lua accepted and generated
paths encoded in the user's narrow codepage. As a move toward UTF-8
awareness for paths and strings on the Lua side, we now instead accept
and generate UTF-8 strings.

This should be robust against legacy paths represented in the user's
codepage, the only source of such paths should be if they're
historically persisted in locations like `Settings.xml`.
By squirreling away the original `io.open` function from Lua and
replacing the entry in the interpreter with our own wrapper that
interprets the path as UTF-8, we can now open files in locations that
cannot be represented in the user's codepage, as long as the path is
supplied as UTF-8.

Lua itself is very encoding agnostic, assuming that all strings are
soups of bytes and that paths are encoded in whatever the platform's
`FILE*` accepts. On Linux and macOS that's fairly friendly and in the
past dependent on the locale, but these days is de-facto standardized
on UTF-8.

On Windows the story is worse and narrow paths are assumed to be in the
user's current codepage, which can only express a fixed subset of
Unicode and which breaks if paths from the filesystem cannot be encoded
in the codepage. On Windows with NTFS the underlying filesystem is
UTF-16LE and supports all of Unicode, so it's good for consistency and
future work if we standardize on UTF-8, as Lua is very byte-oriented
when it comes to text and paths.
The luautf8 library provides analogues of string functions that operate
on UTF-8 strings instead of byte soups, allowing for intelligent editing
and iterating through such strings, needed for things like Unicode paths
and other internationalization.
Store the user path and base path as canonicalized std::filesystem paths
and expect the Lua side to be aware enough of UTF-8 to manipulate them
properly.

This together with the monkeypatched UTF-8 `io.open` and functions like
MakeDir and RemoveDir allows for use of a wider variety of installation
and user profile locations.
As we need the ability to display UTF-8 encoded text onward in Lua code
this change alters text rendering to transcode human strings to UTF-32
codepoints keeping an association between original byte offset and
codepoint index.

This allows for font rendering that can intelligently translate
unhandled codepoints as tofu symbols, for now represented as a bracketed
codepoint ordinal in a smaller font like `[U+0123]`.

This slightly misaligns mouse selection and navigation of text areas if
they're in "scaled" font heights in-between or outside the "crisp" font
sizes available in the runtime. The Lua code predominantly uses crisp
fonts, a survey only found a few scaled/zero/negative heights, all of
empty strings.

The handling of zero-height and negative height strings has also been
made more robust, fixing a division by zero and an array overflow.
The LuaJIT implementations of `os.rename` and `os.remove` are using
narrow functions in the system encoding.

This change redirects the functions to variants that widens UTF-8
strings from Lua on Windows and calls wide CRT versions of the functions
to properly rename and remove files.
No current code seems to use GIF or BLP anymore, removing the code cuts
down the amount of surface area we have.
This change adds some foundational changes to widen the rest of the
runtime to use `std::filesystem::path` and try to detect misuse where
narrow strings are used as-is.

Dependent code in the codebase will fail to build with this change.
Script paths are now Unicode and canonicalized to remove any extraneous
"./" and "../" segments, resolving the full path, these typically arise
from how the program is lauched during development.

Most API functions that use these are updated to use the new path types.
Use modern paths in the configuration system as well as in utilities
like subprocess spawning, cloud provider info and file search.
The texture library needs string keys to identify "shaders", they remain
narrow UTF-8 as all image paths in use are ASCII and while there's now
proper path use for texture loading it doesn't need to permeate quite
that far into that kind of performance-sensitive place.
As these functions can be fixed inside of the LuaJIT interpreter itself,
we don't need these hooked into the main interpreter anymore.

A positive side effect of this is that subscripts now also can use UTF-8
in paths as the changes are universal across all LuaJIT interpreters.

The fallback regular Lua interpreter in `Update.exe` is still narrow ACP
and care needs to be made when writing update op-files and modifying
`UpdateApply.lua`
This patch to the vendored LuaJIT build changes much of the internal
machinery to use UTF-8 for paths, using W-style Windows API functions
and _w-type CRT functions with UTF-16LE wchar_t transcoding internally
to UTF-8 at the edge.

This also patches several functions exposed to Lua like `os.getenv`,
`os.remove`, `os.rename`, `os.system`, `os.tmpnam` as well as `io.open`.
For `fs::path` the `u8string()` function yields paths in the native OS
format with backward slashes on Windows. For consistency with existing
forward slash use in PoB Lua code they're now instead exposed as
`generic_u8string()` which has consistent separators across platforms.
File search patterns previously split up and escaped individual UTF-8
code units instead of matching on the actual codepoint.

This change transcodes to UTF-32 codepoints emits non-special codepoints
as curly hex escapes like `\x{10FFFF}`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant