-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: change all narrow filesystem paths and operations to speak UTF-8 #59
Open
zao
wants to merge
17
commits into
PathOfBuildingCommunity:master
Choose a base branch
from
zao:feat/wide-path-io
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
On Windows this function opens a placeholder file with the standard io:open function to get a proper file object, then replacing the internal C file pointer with the actual file opened through _wfopen.
Previously the file search objects exposed to Lua accepted and generated paths encoded in the user's narrow codepage. As a move toward UTF-8 awareness for paths and strings on the Lua side, we now instead accept and generate UTF-8 strings. This should be robust against legacy paths represented in the user's codepage, the only source of such paths should be if they're historically persisted in locations like `Settings.xml`.
By squirreling away the original `io.open` function from Lua and replacing the entry in the interpreter with our own wrapper that interprets the path as UTF-8, we can now open files in locations that cannot be represented in the user's codepage, as long as the path is supplied as UTF-8. Lua itself is very encoding agnostic, assuming that all strings are soups of bytes and that paths are encoded in whatever the platform's `FILE*` accepts. On Linux and macOS that's fairly friendly and in the past dependent on the locale, but these days is de-facto standardized on UTF-8. On Windows the story is worse and narrow paths are assumed to be in the user's current codepage, which can only express a fixed subset of Unicode and which breaks if paths from the filesystem cannot be encoded in the codepage. On Windows with NTFS the underlying filesystem is UTF-16LE and supports all of Unicode, so it's good for consistency and future work if we standardize on UTF-8, as Lua is very byte-oriented when it comes to text and paths.
The luautf8 library provides analogues of string functions that operate on UTF-8 strings instead of byte soups, allowing for intelligent editing and iterating through such strings, needed for things like Unicode paths and other internationalization.
Store the user path and base path as canonicalized std::filesystem paths and expect the Lua side to be aware enough of UTF-8 to manipulate them properly. This together with the monkeypatched UTF-8 `io.open` and functions like MakeDir and RemoveDir allows for use of a wider variety of installation and user profile locations.
As we need the ability to display UTF-8 encoded text onward in Lua code this change alters text rendering to transcode human strings to UTF-32 codepoints keeping an association between original byte offset and codepoint index. This allows for font rendering that can intelligently translate unhandled codepoints as tofu symbols, for now represented as a bracketed codepoint ordinal in a smaller font like `[U+0123]`. This slightly misaligns mouse selection and navigation of text areas if they're in "scaled" font heights in-between or outside the "crisp" font sizes available in the runtime. The Lua code predominantly uses crisp fonts, a survey only found a few scaled/zero/negative heights, all of empty strings. The handling of zero-height and negative height strings has also been made more robust, fixing a division by zero and an array overflow.
The LuaJIT implementations of `os.rename` and `os.remove` are using narrow functions in the system encoding. This change redirects the functions to variants that widens UTF-8 strings from Lua on Windows and calls wide CRT versions of the functions to properly rename and remove files.
No current code seems to use GIF or BLP anymore, removing the code cuts down the amount of surface area we have.
This change adds some foundational changes to widen the rest of the runtime to use `std::filesystem::path` and try to detect misuse where narrow strings are used as-is. Dependent code in the codebase will fail to build with this change.
Script paths are now Unicode and canonicalized to remove any extraneous "./" and "../" segments, resolving the full path, these typically arise from how the program is lauched during development. Most API functions that use these are updated to use the new path types.
Use modern paths in the configuration system as well as in utilities like subprocess spawning, cloud provider info and file search.
The texture library needs string keys to identify "shaders", they remain narrow UTF-8 as all image paths in use are ASCII and while there's now proper path use for texture loading it doesn't need to permeate quite that far into that kind of performance-sensitive place.
As these functions can be fixed inside of the LuaJIT interpreter itself, we don't need these hooked into the main interpreter anymore. A positive side effect of this is that subscripts now also can use UTF-8 in paths as the changes are universal across all LuaJIT interpreters. The fallback regular Lua interpreter in `Update.exe` is still narrow ACP and care needs to be made when writing update op-files and modifying `UpdateApply.lua`
This patch to the vendored LuaJIT build changes much of the internal machinery to use UTF-8 for paths, using W-style Windows API functions and _w-type CRT functions with UTF-16LE wchar_t transcoding internally to UTF-8 at the edge. This also patches several functions exposed to Lua like `os.getenv`, `os.remove`, `os.rename`, `os.system`, `os.tmpnam` as well as `io.open`.
For `fs::path` the `u8string()` function yields paths in the native OS format with backward slashes on Windows. For consistency with existing forward slash use in PoB Lua code they're now instead exposed as `generic_u8string()` which has consistent separators across platforms.
File search patterns previously split up and escaped individual UTF-8 code units instead of matching on the actual codepoint. This change transcodes to UTF-32 codepoints emits non-special codepoints as curly hex escapes like `\x{10FFFF}`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Historically Path of Building's Lua code, SimpleGraphic and LuaJIT has used narrow byte-based paths encoded in the user's codepage, typically a single-byte western one with ASCII in the low half and various latin-adjacent characters in the high half.
As we predominantly live in the user's profile directory with both the default user/build path in Documents and the preferred program installation path in Application Data, the user's chosen username often appears in absolute paths that involve said build and script paths. This happens to work somewhat OK if it can be expressed in the user's codepage but if it can't, it breaks miserably without much in the way of meaningful workarounds.
A recent complicating factor is that OneDrive is transparently enabled on more and more computers that are set up with Microsoft accounts out of the box. When Documents is managed under OneDrive, the true location for Documents is in places like
C:\Users\username\OneDrive\Документи
with the folder name using the user's human language term. The fallout and individual workarounds can be seen in issues like PathOfBuildingCommunity/PathOfBuilding#7431.While newer Windows 10 and 11 builds have opt-in support for using UTF-8 as the locale, we can't realistically have all affected users configure that knob with indeterminate system impact. It would also not put a stop to the stream of issues filed about the problem.
Changes
This PR aims to fix this by a number of deep modifications to the runtime as well as some changes to the Lua code in the main project in PathOfBuildingCommunity/PathOfBuilding#7586.
The Launcher also has a PR in PathOfBuildingCommunity/PathOfBuilding-Launcher#6 to pass Unicode paths on the command line instead of ACP paths.
The individual commits have extensive essays on what they change and why, but in short here we:
NewFileSearch
) to accept and yield UTF-8 encoded paths and wildcard patterns;lua-utf8
to perform UTF-8 string operations in our Lua;GetScriptPath
andGetUserPath
, resolving any symbolic links along the way;[U+1234]
in a slightly smaller font size;std::filesystem::path
and Unicode when operating on the filesystem;It also does some slightly related cleanup like:
Some commits in the PR are later backed out as their functionality became superseded by later changes, like the monkey-patching of
io.open
/os.remove
/os.rename
which were made to interpret given paths as UTF-8 and calling wide CRT functions for those tasks. They have been superseded by doing the same thing but in more extensive patches to LuaJIT itself.Of note is that the updater has not been touched and still uses the same old non-JIT Lua interpreter with the user codepage narrow functions. It can't be updated but it mostly works if you take care to only feed it relative paths.