-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix line ending handling in reverse_readfile/readline
across OS, and not skipping empty lines
#712
Fix line ending handling in reverse_readfile/readline
across OS, and not skipping empty lines
#712
Conversation
WalkthroughThe changes in this pull request primarily enhance file handling capabilities in the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #712 +/- ##
==========================================
+ Coverage 83.90% 84.19% +0.28%
==========================================
Files 27 27
Lines 1640 1670 +30
Branches 302 314 +12
==========================================
+ Hits 1376 1406 +30
Misses 202 202
Partials 62 62 ☔ View full report in Codecov by Sentry. |
reverse_readfile
and reverse_readline
reverse_readfile
and reverse_readline
If, for some reason, the OUTCAR file has mixed style line endings, then how should it be handled? Therefore, I believe that the safest approach might be to first convert the given OUTCAR file to a unified UNIX line ending style. |
Hi @hongyi-zhao thanks for the comment.
I believe that's technically possible, but I don't think |
It should be |
Either serves the same purpose to unify line ending ( BTW, |
Don't mind, I mean you have a typo in your previous posting: |
Ah okay, didn't notice that, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing this issue @DanielYang59! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Outside diff range and nitpick comments (7)
benchmark/benchmark.py (2)
169-171
: Includemonty
module version in the outputFor better reproducibility and clarity, it's beneficial to display the version of the
monty
module being used in the benchmark. This information can be crucial when comparing results across different environments or after updates.You can modify your script to include the
monty
version:import sys +import monty os_info = platform.platform() python_version = sys.version.split()[0] print(f"\nRunning on OS: {os_info}, Python {python_version}") +print(f"Monty version: {monty.__version__}")
6-6
: Evaluate the necessity of the__future__
importThe
from __future__ import annotations
statement is used to postpone the evaluation of type annotations, which can be helpful in some cases. However, since the script isn't extensively utilizing annotations that require this import, and if you're running on Python 3.7 or later, this import may not be necessary.Consider removing the import if it's not needed:
-from __future__ import annotations
Alternatively, if you plan to expand the use of annotations, especially with forward references or complex type hints, you might decide to retain it.
benchmark/pypi-7.12-win11.txt (2)
5-5
: Ensure Consistent Phrasing for Timing OutputsThe lines reporting the creation of test files use the phrase "time used," whereas other timing outputs use "time taken." For consistency and clarity, consider changing "time used" to "time taken" in these lines.
Apply this diff to maintain consistency:
-Test file of size 1 MB created with 40757 lines, time used 0.02 seconds. +Test file of size 1 MB created with 40757 lines, time taken: 0.02 seconds. -Test file of size 10 MB created with 392476 lines, time used 0.17 seconds. +Test file of size 10 MB created with 392476 lines, time taken: 0.17 seconds. -Test file of size 100 MB created with 3784596 lines, time used 1.71 seconds. +Test file of size 100 MB created with 3784596 lines, time taken: 1.71 seconds. -Test file of size 500 MB created with 18462038 lines, time used 8.34 seconds. +Test file of size 500 MB created with 18462038 lines, time taken: 8.34 seconds. -Test file of size 1000 MB created with 36540934 lines, time used 16.35 seconds. +Test file of size 1000 MB created with 36540934 lines, time taken: 16.35 seconds. -Test file of size 5000 MB created with 178466370 lines, time used 85.79 seconds. +Test file of size 5000 MB created with 178466370 lines, time taken: 85.79 seconds.Also applies to: 29-29, 53-53, 77-77, 101-101, 125-125
1-145
: Consider Excluding Benchmark Result Files from the RepositoryIncluding raw benchmark result files like
benchmark/pypi-7.12-win11.txt
in the repository can increase its size unnecessarily and may not provide significant value to other developers. Benchmark results can vary between environments and are typically regenerated as needed.Consider removing the benchmark result file from the repository. Instead, you can summarize key findings in the project's documentation or the pull request description. This approach keeps the repository lean and focuses on the most relevant information.
tests/test_io.py (3)
46-49
: Typo in variable namestart_pot
; consider renaming tostart_pos
The variable
start_pot
in lines 46 and 48 appears to be a typo. Renaming it tostart_pos
would better represent its purpose as the starting position of the file pointer.Apply this diff to correct the variable name:
with open(test_file, "r", encoding="utf-8") as f: - start_pot = f.tell() + start_pos = f.tell() assert _get_line_ending(f) == l_end - assert f.tell() == start_pot + assert f.tell() == start_pos
52-55
: Typo in variable namestart_pot
; consider renaming tostart_pos
Similarly, in the binary mode test, consider renaming
start_pot
tostart_pos
for consistency.Apply this diff to correct the variable name:
with open(test_file, "rb") as f: - start_pot = f.tell() + start_pos = f.tell() assert _get_line_ending(f) == l_end - assert f.tell() == start_pot + assert f.tell() == start_pos
198-198
: Clarify the failure message in the empty file testThe failure message
"No error should be thrown."
might be misleading, as the test ensures no lines are read from an empty file. Consider updating the message for clarity.Apply this diff to update the message:
for _line in reverse_readline(f): - pytest.fail("No error should be thrown.") + pytest.fail("No lines should be read from an empty file.")
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (3)
tests/test_files/3000_lines.txt.bz2
is excluded by!**/*.bz2
tests/test_files/3000_lines.txt.gz
is excluded by!**/*.gz
tests/test_files/3000lines.txt.gz
is excluded by!**/*.gz
Files selected for processing (12)
- .github/workflows/test.yml (2 hunks)
- benchmark/benchmark.py (1 hunks)
- benchmark/develop-ubuntu2204.txt (1 hunks)
- benchmark/develop-win11.txt (1 hunks)
- benchmark/pypi-7.12-ubuntu2204.txt (1 hunks)
- benchmark/pypi-7.12-win11.txt (1 hunks)
- src/monty/io.py (5 hunks)
- tests/test_files/3000_lines.txt (1 hunks)
- tests/test_io.py (4 hunks)
- tests/test_multiprocessing.py (1 hunks)
- tests/test_shutil.py (1 hunks)
- tests/test_tempfile.py (2 hunks)
Files skipped from review due to trivial changes (3)
- tests/test_files/3000_lines.txt
- tests/test_multiprocessing.py
- tests/test_shutil.py
Additional comments not posted (4)
.github/workflows/test.yml (2)
9-9
: Confirm the necessity offail-fast: false
in the build strategyAdding
fail-fast: false
ensures that all matrix jobs continue to run even if one fails, allowing for comprehensive test results across all environments. Please verify that this aligns with your intended testing workflow and that the extended build times are acceptable.
33-33
: Ensure compatibility with Codecov Action V4Updating the Codecov action from
@v3
to@v4
may introduce changes that affect coverage report uploads. Please confirm that the new version is compatible with your configuration and update any settings if necessary.You can review the release notes for
codecov/codecov-action@v4
to check for breaking changes or required adjustments:tests/test_tempfile.py (2)
4-4
: Addition ofplatform
import is appropriate and necessary.The
platform
module is required for the operating system check usingplatform.system()
later in the code.
124-124
: OS check updated for clarity and compatibility.Replacing the OS check with
platform.system() != "Windows"
provides a clearer and more explicit way to determine if the code is not running on a Windows system, enhancing readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (4)
tests/test_tempfile.py (1)
125-125
: LGTM: Improved OS check for better clarity and consistency.The change from
os.name != "nt"
toplatform.system() != "Windows"
improves the clarity and reliability of the OS check. This aligns well with similar changes made in other files of the project.For even better consistency across the codebase, consider using a constant for the "Windows" string:
WINDOWS = "Windows" # ... if platform.system() != WINDOWS:This would make it easier to maintain if you need to change the string in the future.
tests/test_shutil.py (1)
Line range hint
170-170
: Appropriate test skipping for Windows platformThe addition of
@unittest.skipIf(platform.system() == "Windows", "Skip on windows")
to all test methods in the TestRemove class is a good improvement. This change aligns well with the PR objectives of enhancing cross-platform compatibility and prevents test failures on Windows due to unsupported operations.Consider making the skip message more informative, e.g., "Skipping file removal test on Windows due to different file system behavior". This would provide more context about why the test is being skipped.
Also applies to: 178-178, 186-186, 198-198
src/monty/io.py (1)
209-212
: Consider simplifying line ending handlingSince
l_end
can only be"\n"
or"\r\n"
, and their lengths are known, you may simplify the code by directly assigning the length without usinglen()
andcast
.Apply this diff to simplify:
l_end: Literal["\r\n", "\n"] = _get_line_ending(m_file) - len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end)) + len_l_end: int = 2 if l_end == "\r\n" else 1This removes the need for
cast
and makes the code clearer.tests/test_io.py (1)
264-266
: Clarify line ending handling in assertions for consistency.The assertion combines
line.rstrip(os.linesep)
withl_end
, which may lead to confusion due to OS-specific line endings.Consider normalizing the line endings before comparison to make the assertion clearer:
- assert ( - line.rstrip(os.linesep) + l_end - == contents[len(contents) - idx - 1] - ) + expected_line = contents[len(contents) - idx - 1].replace('\r\n', '\n').rstrip('\n') + actual_line = line.replace('\r\n', '\n').rstrip('\n') + assert actual_line == expected_line
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- .github/workflows/test.yml (2 hunks)
- src/monty/io.py (5 hunks)
- tests/test_io.py (4 hunks)
- tests/test_shutil.py (1 hunks)
- tests/test_tempfile.py (2 hunks)
🧰 Additional context used
🪛 actionlint
.github/workflows/test.yml
10-10: key "fail-fast" is duplicated in "strategy" section. previously defined at line:8,col:7
(syntax-check)
🪛 yamllint
.github/workflows/test.yml
[error] 10-10: duplication of key "fail-fast" in mapping
(key-duplicates)
🔇 Additional comments (15)
tests/test_tempfile.py (2)
4-4
: LGTM: Import statement added correctly.The
platform
module import is correctly placed and necessary for the subsequent changes in the file.
Line range hint
1-153
: Summary: Improved OS detection for better cross-platform compatibility.The changes in this file enhance the OS detection mechanism, making it more explicit and consistent with other parts of the project. The addition of the
platform
module and the updated condition intest_symlink
method improve the clarity and reliability of the tests on different operating systems.These modifications align well with the PR objectives of addressing line ending handling issues across different OS, particularly for Windows systems. The changes do not introduce any new issues and maintain the existing functionality of the tests.
tests/test_shutil.py (1)
34-34
: Improved platform check for symlink creationThe change from
os.name != "nt"
toplatform.system() != "Windows"
is a good improvement. It makes the condition more explicit and easier to understand, enhancing code readability and maintainability. This change is consistent with similar updates in other files, as mentioned in the PR summary, and aligns well with the goal of improving cross-platform compatibility.src/monty/io.py (11)
16-16
: LGTMThe import of
warnings
is appropriate and enhances the module's ability to issue warnings.
18-19
: LGTMImporting
TYPE_CHECKING
,Literal
, andcast
fromtyping
is appropriate for type annotations and conditional imports used later in the code.
26-26
: LGTMImporting
IO
,Iterator
, andUnion
inside theTYPE_CHECKING
block is standard practice for optional type annotations that assist with static analysis without impacting runtime.
115-116
: LGTMThe updated function signature for
reverse_readfile
includes type hints, improving code readability and static analysis.
165-168
: LGTMThe function
reverse_readline
now includes detailed type annotations and defaults, improving clarity and type safety.
210-211
: Verify the correctness ofcast
usage forlen_l_end
The use of
cast
to assign the length ofl_end
tolen_l_end
with typeLiteral[1, 2]
is intended to aid static type checking. Ensure that this use ofcast
is appropriate and does not introduce type inconsistencies.Review the necessity of the
cast
and consider if a simple type annotation would suffice.
264-266
: Ensure correct decoding based on file modeWhen reading blocks from the file, ensure that decoding is handled properly depending on whether the file is opened in text or binary mode.
135-138
:⚠️ Potential issueAddress potential high memory usage when reading compressed files
In the
reverse_readfile
function, these lines usefile.readlines()
to read compressed files, which loads the entire file into memory. For large gzip or bz2 files, this can lead to excessive memory consumption.Consider processing the compressed files in a memory-efficient way. For example, you could read and yield lines one at a time:
if isinstance(file, (gzip.GzipFile, bz2.BZ2File)): - for line in reversed(file.readlines()): - # "readlines" would keep the line end character - yield line.decode("utf-8") + lines = [] + for line in file: + lines.append(line) + if len(lines) >= max_mem // blk_size: + for rev_line in reversed(lines): + yield rev_line.decode("utf-8") + lines.clear() + for rev_line in reversed(lines): + yield rev_line.decode("utf-8")This approach processes the file in chunks, reducing peak memory usage.
Likely invalid or redundant comment.
105-112
:⚠️ Potential issueConsider supporting additional line endings or clarifying the error message in
_get_line_ending
The
_get_line_ending
function currently checks only for"\r\n"
(Windows) and"\n"
(Unix/Linux) line endings. If a file uses an older Mac OS line ending ("\r"
), the function will raise aValueError
. Consider adding support for"\r"
line endings or updating the error message to specify the supported line endings.To handle
"\r"
line endings, apply this diff:if first_line.endswith(b"\r\n"): return "\r\n" if first_line.endswith(b"\n"): return "\n" + if first_line.endswith(b"\r"): + return "\r" # It's likely the line is missing a line ending for the first line raise ValueError(f"Unknown line ending in line {repr(first_line)}.")Alternatively, clarify the error message:
# It's likely the line is missing a line ending for the first line - raise ValueError(f"Unknown line ending in line {repr(first_line)}.") + raise ValueError(f"Unsupported line ending in line {repr(first_line)}. Only '\\n' and '\\r\\n' are supported.")Likely invalid or redundant comment.
20-23
: Ensure conditional import oflzma
handles availability correctlyThe conditional import of
lzma
allows the code to handle environments wherelzma
is unavailable. Verify that subsequent code gracefully handleslzma
beingNone
.Run the following script to check usage of
lzma
when it isNone
:
129-132
: Ensure efficient retrieval of line endingsThe use of
_get_line_ending
enhances the robustness of line ending detection. Verify that this does not introduce significant overhead for large files.Run the following script to benchmark the performance impact:
tests/test_io.py (1)
3-4
: Necessary imports for compression handling.The imports
bz2
andgzip
are correctly included to handle bzip2 and gzip files in the tests.
7515632
to
3baea3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (5)
src/monty/io.py (4)
45-45
: LGTM: Enhanced file format support and Path handlingThe addition of
.XZ
and.LZMA
file support improves the function's versatility. The explicit string conversion ofPath
objects is a good practice for ensuring compatibility.Consider using
os.fspath()
for convertingPath
objects to strings, as it's the recommended way to handle both string and path-like objects:- if filename is not None and isinstance(filename, Path): - filename = str(filename) + if filename is not None: + filename = os.fspath(filename)This change would handle both
str
andPath
objects uniformly.Also applies to: 50-51
55-107
: LGTM: Well-implemented line ending detectionThe
_get_line_ending
function is a robust implementation for detecting file line endings. It handles various file types, follows standards, and uses appropriate error handling.Consider adding a comment explaining why
\r
(CR) line endings are not supported, as some legacy systems might still use them:# Note: We don't check for '\r' (CR) line endings as they are rarely used in modern systems. # If support for CR line endings is needed, add an additional check here. if first_line.endswith(b"\r\n"): return "\r\n" if first_line.endswith(b"\n"): return "\n"This addition would clarify the design decision and make it easier to add support for CR line endings if needed in the future.
Line range hint
110-157
: LGTM: Improved line ending handling and file type supportThe changes to
reverse_readfile
enhance its robustness and consistency in handling different file types and line endings. The use of_get_line_ending
and yielding lines with original endings are good improvements.Consider caching the result of
len(filemap)
to avoid repeated calls:filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) +filemap_len = len(filemap) -file_size = len(filemap) +file_size = filemap_len while file_size > 0: # ... existing code ... - elif file_size != len(filemap): + elif file_size != filemap_len:This minor optimization could slightly improve performance, especially for large files.
160-272
: LGTM: Enhanced type safety and line ending handlingThe updates to
reverse_readline
improve type safety with better annotations and ensure consistent line ending handling across different file types. The logic for different file handling scenarios is well-implemented.Consider adding a comment explaining the rationale behind the in-memory reversal for small files and gzip files:
# For small files or gzip files, we reverse the entire file in memory. # This is more efficient for small files and necessary for gzip files # which don't support reverse seeking. if file_size < max_mem or isinstance(m_file, gzip.GzipFile): for line in reversed(m_file.readlines()): yield line if isinstance(line, str) else cast(bytes, line).decode("utf-8")This comment would clarify the design decision and make the code more maintainable.
tests/test_io.py (1)
281-281
: Consider consistent naming for class variablesFor consistency, consider using uppercase for all class-level constants. In this case,
NUM_LINES
is uppercase, whileNUMLINES
in theTestReverseReadline
class is not.Apply this diff for consistency:
- NUMLINES = 3000 + NUM_LINES = 3000Also update all references to
NUMLINES
in theTestReverseReadline
class toNUM_LINES
.Also applies to: 284-284
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- .github/workflows/test.yml (1 hunks)
- src/monty/io.py (5 hunks)
- src/monty/json.py (1 hunks)
- src/monty/re.py (1 hunks)
- tests/test_io.py (4 hunks)
✅ Files skipped from review due to trivial changes (2)
- src/monty/json.py
- src/monty/re.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/test.yml
🧰 Additional context used
🔇 Additional comments (6)
src/monty/io.py (3)
17-17
: LGTM: Improved imports and type annotationsThe new imports and type annotations enhance the code's type safety and provide better tooling support. The use of
warnings
is appropriate for the new functionality.Also applies to: 19-19, 22-22
374-374
: LGTM: Improved docstring clarityThe updated docstring now clearly states that the function only works on UNIX-like operating systems, which is an important clarification for users.
Line range hint
1-384
: LGTM: Significant improvements to file I/O handlingThe changes in this file represent a substantial improvement in file I/O operations, particularly in handling line endings and supporting various file types. Key improvements include:
- Centralized line ending detection with the new
_get_line_ending
function.- Enhanced support for compressed file formats.
- Improved type annotations for better type safety and code clarity.
- More robust handling of edge cases, such as empty files.
These changes make the code more reliable, maintainable, and consistent across different file types and operating systems. The additions are well-integrated with the existing code and follow good coding practices.
tests/test_io.py (3)
24-118
: Excellent addition of comprehensive test cases for_get_line_ending
The new
TestGetLineEnding
class provides a thorough set of test cases for the_get_line_ending
function. It covers various scenarios including different line endings, file types (text, binary, compressed), and edge cases. This comprehensive approach ensures robust testing of the function's behavior across different situations.
Line range hint
122-277
: Improved test coverage forreverse_readline
The updates to
TestReverseReadline
class enhance the test coverage by including tests for different line endings, empty files, and both text and binary modes. This ensures that thereverse_readline
function works correctly across various scenarios.
Line range hint
279-370
: Enhanced test coverage forreverse_readfile
The updates to
TestReverseReadfile
class improve the test coverage by including tests for different line endings, empty files, and files with empty lines. This ensures that thereverse_readfile
function works correctly across various scenarios.
Pls fix the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (5)
src/monty/io.py (3)
143-156
: Consider adding support for more file types.The function handles common file types but could be extended to support more compressed formats (e.g., xz/lzma) for consistency with
zopen
.Apply this diff to add support for more file types:
- elif isinstance(file, (io.BufferedReader, gzip.GzipFile, bz2.BZ2File)): + elif isinstance(file, (io.BufferedReader, gzip.GzipFile, bz2.BZ2File, lzma.LZMAFile)):
157-169
: Consider adding support for classic Mac OS line endings.The function only handles Windows (
\r\n
) and Unix (\n
) line endings. While classic Mac OS line endings (\r
) are rare, supporting them would improve compatibility.Apply this diff to add support for classic Mac OS line endings:
if first_line.endswith(b"\r\n"): return "\r\n" if first_line.endswith(b"\n"): return "\n" + if first_line.endswith(b"\r"): + return "\r" # It's likely the line is missing a line ending for the first line - raise ValueError(f"Unknown line ending in line {repr(first_line)}.") + raise ValueError( + f"Unknown line ending in line {repr(first_line)}. " + "Supported line endings are: \\r\\n (Windows), \\n (Unix), \\r (Classic Mac OS)." + )
262-268
: Consider using a type guard for line ending length.The cast to
Literal[1, 2]
could be replaced with a type guard for better type safety.Apply this diff to improve type safety:
- len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end)) + def get_line_ending_length(l_end: Literal["\r\n", "\n"]) -> Literal[1, 2]: + if l_end == "\r\n": + return 2 + return 1 + len_l_end = get_line_ending_length(l_end)tests/test_io.py (2)
152-180
: Consider adding performance benchmarks.The big file test verifies functionality but doesn't measure performance. Consider adding benchmarks to ensure performance doesn't regress.
Add performance benchmarks using
timeit
:def test_big_file_performance(self, l_end): """Test performance of reading big files.""" import timeit setup_code = """ file_name = "big_file.txt" with open(file_name, "rb") as file: lines = list(reverse_readline(file, max_mem=4096)) """ # Measure time time_taken = timeit.timeit(setup=setup_code, number=100) assert time_taken < 1.0 # Adjust threshold as needed
244-274
: Consider testing mixed line endings.While the current implementation doesn't handle mixed line endings, it would be valuable to have a test that verifies the behavior (raising an error) when encountering mixed line endings.
Add a test for mixed line endings:
def test_mixed_line_endings(self): """Test behavior with mixed line endings.""" contents = ("Line1\n", "Line2\r\n", "Line3\n") file_name = "mixed_endings.txt" with ScratchDir("."): with open(file_name, "wb") as file: for line in contents: file.write(line.encode()) with pytest.raises(ValueError, match="Inconsistent line endings"): with open(file_name, "rb") as file: list(reverse_readline(file))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
src/monty/io.py
(5 hunks)src/monty/json.py
(1 hunks)src/monty/re.py
(1 hunks)tests/test_io.py
(4 hunks)tests/test_shutil.py
(1 hunks)tests/test_tempfile.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/test_shutil.py
- tests/test_tempfile.py
- src/monty/re.py
- src/monty/json.py
🧰 Additional context used
🪛 Ruff (0.8.0)
src/monty/io.py
173-173: Undefined name Iterator
(F821)
225-225: Undefined name Iterator
(F821)
tests/test_io.py
197-198: Use a single with
statement with multiple contexts instead of nested with
statements
(SIM117)
🪛 GitHub Check: codecov/patch
src/monty/io.py
[warning] 274-274: src/monty/io.py#L274
Added line #L274 was not covered by tests
🔇 Additional comments (9)
src/monty/io.py (6)
19-19
: LGTM!
The Literal
import is correctly placed and used for type hints in the new _get_line_ending
function.
116-142
: LGTM!
The function signature and docstring are well-defined. The docstring includes:
- Clear description of functionality
- POSIX standard reference
- Return values and error cases
- Warning cases
186-189
: LGTM!
The line ending handling is improved by using the new _get_line_ending
helper function.
222-225
: LGTM!
The type hints improve type safety by:
- Specifying supported file types
- Using
Iterator[str]
return type
🧰 Tools
🪛 Ruff (0.8.0)
225-225: Undefined name Iterator
(F821)
190-195
: 🛠️ Refactor suggestion
Consider optimizing memory usage for compressed files.
Loading the entire compressed file into memory with readlines()
could be inefficient for large files.
Consider processing compressed files in chunks:
if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
- for line in reversed(file.readlines()):
- # "readlines" would keep the line end character
- yield line.decode("utf-8")
+ lines = []
+ chunk_size = 1024 * 1024 # 1MB chunks
+ while True:
+ chunk = file.read(chunk_size)
+ if not chunk:
+ break
+ lines.extend(chunk.splitlines(keepends=True))
+ for line in reversed(lines):
+ yield line.decode("utf-8")
Likely invalid or redundant comment.
196-218
: 🛠️ Refactor suggestion
Consider improving error handling for memory mapping.
The broad except ValueError
could mask other issues. Consider checking file size before attempting memory mapping.
Apply this diff to improve error handling:
else:
+ file_size = os.path.getsize(filename)
+ if file_size == 0:
+ warnings.warn("File is empty.", stacklevel=2)
+ return
try:
- filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
- except ValueError:
+ filemap = mmap.mmap(file.fileno(), file_size, access=mmap.ACCESS_READ)
+ except mmap.error as e:
+ warnings.warn(f"Memory mapping failed: {e}", stacklevel=2)
return
Likely invalid or redundant comment.
tests/test_io.py (3)
25-120
: LGTM! Comprehensive test coverage.
The test class thoroughly covers:
- Different line endings (Unix, Windows)
- Various file types (text, binary, compressed)
- Edge cases (empty files, missing line endings)
- Error cases (unknown file types, unknown line endings)
202-243
: LGTM! Thorough testing of empty lines.
The test ensures empty lines are preserved across different file types and memory modes.
327-371
: LGTM! Consistent test coverage with reverse_readline
.
The tests ensure consistent behavior between reverse_readfile
and reverse_readline
for:
- Empty lines
- Different line endings
- Various file types
Thanks, I was about to fix these :) |
I. Fix line ending handling (mainly in Windows with
"\r\n"
)\r
for Windows1
preventsreverse_readline
from working for Windowsrstrip
):monty/src/monty/io.py
Lines 131 to 133 in 1270c7b
II. Performance benchmark
Compare with current implementation to make sure there's no performance regression, with Python 3.12, using script e4940e0.
Ubuntu 22.04-WSL2
Windows 11
III. Test downstream packages
pymatgen
: Fixmonty
imports, enhance test forOutcar
parser to cover uncompressed format materialsproject/pymatgen#4068