diff --git a/.devel/sphinx/install.md b/.devel/sphinx/install.md index 9f501e09..f5c03f8f 100644 --- a/.devel/sphinx/install.md +++ b/.devel/sphinx/install.md @@ -101,7 +101,7 @@ Some environment variables: -## Final Notes +## Getting Help If you do not manage to set up a successful build, do not hesitate to [file a bug report](https://github.com/gagolews/stringi/issues). diff --git a/.devel/sphinx/news.md b/.devel/sphinx/news.md index 24073d08..d8e62663 100644 --- a/.devel/sphinx/news.md +++ b/.devel/sphinx/news.md @@ -1,26 +1,36 @@ # Changelog -## 1.8.1 (2023-11-08) +## 1.8.1 (2023-11-09) * [GENERAL] ICU bundle updated to version 74.1 (Unicode 15.1, CLDR 44). -* [BACKWARD INCOMPATIBLE] [BUILD TIME] Support for Solaris has now been dropped. - The package is no longer shipped with the very outdated ICU55 bundle. +* [BACKWARD INCOMPATIBILITY] [BUILD TIME] Support for Solaris has now been + dropped. The package is no longer shipped with the very outdated ICU55 bundle. A compiler supporting at least C++11 as well as ICU >= 61 are now required. -* [BACKWARD INCOMPATIBLE] #469: Missing date-time fields in +* [BACKWARD INCOMPATIBILITY] #469: Missing date-time fields in `stri_datetime_parse` and `stri_datetime_create` now default to today's midnight local time. +* [BACKWARD INCOMPATIBILITY] Removed the long-deprecated and defunct + `fallback_encoding` parameter of `stri_read_lines` and the ellipsis + parameter of `stri_opts_collator`, `stri_opts_regex`, `stri_opts_fixed`, + and `stri_opts_regex`. + * [BUILD TIME] As per the suggestion of Prof. Brian Ripley, `icudt74l` (ICU data - little endian) is now included in the source tarball (compressed with xz to save space). This allows for building *stringi* on systems with no internet access. -* [NEW FEATURE] #476: A warning is emitted when selecting an unknown locale - for collation as it most likely indicates that a wrong resource is being - returned. +* [NEW FEATURE] #476: In break iterator-, date-time-, and collator-based + operations (e.g., `stri_sort`), a warning is emitted when the *root* ICU + resource bundle is returned when using an *explicitly* requested locale. + This might happen when we pass an 'unknown' `locale` argument to these + functions. Note that when relying on the default `locale=NULL` argument, + no warning is emitted. In such a case, checking + if the default locale as returned by `stri_enc_get` is amongst + those listed in `stri_enc_list` is recommended. * [NEW FEATURE] The `C` locale identifier now resolves to `en_US_POSIX`. diff --git a/.devel/sphinx/rapi/about_locale.md b/.devel/sphinx/rapi/about_locale.md index dd8ca149..b342f04f 100644 --- a/.devel/sphinx/rapi/about_locale.md +++ b/.devel/sphinx/rapi/about_locale.md @@ -34,7 +34,7 @@ Your program should avoid changing the default locale. All locale-sensitive func One of many examples of locale-dependent services is the Collator, which performs a locale-aware string comparison. It is used for string comparing, ordering, sorting, and searching. See [`stri_opts_collator`](stri_opts_collator.md) for the description on how to tune its settings, and its `locale` argument in particular. -When choosing a resource bundle that is not available in the requested locale nor in its more general variants (e.g., \'es_ES\' vs \'es\'), a warning is emitted. +When choosing a resource bundle that is not available in the explicitly requested locale (but not when using the default locale) nor in its more general variants (e.g., \'es_ES\' vs \'es\'), a warning is emitted. Other locale-sensitive functions include, e.g., [`stri_trans_tolower`](stri_trans_casemap.md) (that does character case mapping). diff --git a/.devel/sphinx/rapi/stri_datetime_add.md b/.devel/sphinx/rapi/stri_datetime_add.md index b0764572..a5c287a0 100644 --- a/.devel/sphinx/rapi/stri_datetime_add.md +++ b/.devel/sphinx/rapi/stri_datetime_add.md @@ -68,7 +68,7 @@ print(x) ``` ``` -## [1] "2024-01-08 17:00:48 AEDT" +## [1] "2024-01-09 11:24:49 AEDT" ``` ```r @@ -76,7 +76,7 @@ stri_datetime_add(x, -2, units='months') ``` ``` -## [1] "2023-11-08 17:00:48 AEDT" +## [1] "2023-11-09 11:24:49 AEDT" ``` ```r diff --git a/.devel/sphinx/rapi/stri_datetime_create.md b/.devel/sphinx/rapi/stri_datetime_create.md index 2271b27d..4dbbf9e1 100644 --- a/.devel/sphinx/rapi/stri_datetime_create.md +++ b/.devel/sphinx/rapi/stri_datetime_create.md @@ -96,5 +96,5 @@ stri_datetime_create(hour=15, minute=59) ``` ``` -## [1] "2023-11-08 15:59:00 AEDT" +## [1] "2023-11-09 15:59:00 AEDT" ``` diff --git a/.devel/sphinx/rapi/stri_datetime_fields.md b/.devel/sphinx/rapi/stri_datetime_fields.md index 38cc2c08..1c75bd3b 100644 --- a/.devel/sphinx/rapi/stri_datetime_fields.md +++ b/.devel/sphinx/rapi/stri_datetime_fields.md @@ -77,9 +77,9 @@ stri_datetime_fields(stri_datetime_now()) ``` ## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth -## 1 2023 11 8 17 0 49 139 46 2 +## 1 2023 11 9 11 24 49 982 46 2 ## DayOfYear DayOfWeek Hour12 AmPm Era -## 1 312 4 5 2 2 +## 1 313 5 11 1 2 ``` ```r @@ -88,9 +88,9 @@ stri_datetime_fields(stri_datetime_now(), locale='@calendar=hebrew') ``` ## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth -## 1 5784 2 24 17 0 49 148 9 4 +## 1 5784 2 25 11 24 49 986 9 4 ## DayOfYear DayOfWeek Hour12 AmPm Era -## 1 54 4 5 2 1 +## 1 55 5 11 1 1 ``` ```r diff --git a/.devel/sphinx/rapi/stri_datetime_format.md b/.devel/sphinx/rapi/stri_datetime_format.md index ac5933d1..26679e31 100644 --- a/.devel/sphinx/rapi/stri_datetime_format.md +++ b/.devel/sphinx/rapi/stri_datetime_format.md @@ -221,5 +221,5 @@ stri_datetime_format(stri_datetime_now(), 'datetime_relative_medium') ``` ``` -## [1] "today, 5:00:49 pm" +## [1] "today, 11:24:50 am" ``` diff --git a/.devel/sphinx/rapi/stri_enc_detect2.md b/.devel/sphinx/rapi/stri_enc_detect2.md index 52f70bcc..b8187cf2 100644 --- a/.devel/sphinx/rapi/stri_enc_detect2.md +++ b/.devel/sphinx/rapi/stri_enc_detect2.md @@ -12,10 +12,10 @@ stri_enc_detect2(str, locale = NULL) ## Arguments -| | | -|----------|-------------------------------------------------------------------------------------------------------------------------| -| `str` | character vector, a raw vector, or a list of `raw` vectors | -| `locale` | `NULL` or `''` for default locale, `NA` for just checking the UTF-\* family, or a single string with locale identifier. | +| | | +|----------|-----------------------------------------------------------------------------------| +| `str` | character vector, a raw vector, or a list of `raw` vectors | +| `locale` | `NULL` or `''` for the default locale, or a single string with locale identifier. | ## Details diff --git a/.devel/sphinx/rapi/stri_opts_brkiter.md b/.devel/sphinx/rapi/stri_opts_brkiter.md index 79747583..8dc19461 100644 --- a/.devel/sphinx/rapi/stri_opts_brkiter.md +++ b/.devel/sphinx/rapi/stri_opts_brkiter.md @@ -18,8 +18,7 @@ stri_opts_brkiter( skip_line_soft, skip_line_hard, skip_sentence_term, - skip_sentence_sep, - ... + skip_sentence_sep ) ``` @@ -38,7 +37,6 @@ stri_opts_brkiter( | `skip_line_hard` | logical; perform no action for hard, or mandatory line breaks | | `skip_sentence_term` | logical; perform no action for sentences ending with a sentence terminator (\'`.`\', \'`,`\', \'`?`\', \'`!`\'), possibly followed by a hard separator (`CR`, `LF`, `PS`, etc.) | | `skip_sentence_sep` | logical; perform no action for sentences that do not contain an ending sentence terminator, but are ended by a hard separator or end of input | -| `...` | \[DEPRECATED\] any other arguments passed to this function generate a warning; this argument will be removed in the future | ## Details diff --git a/.devel/sphinx/rapi/stri_opts_collator.md b/.devel/sphinx/rapi/stri_opts_collator.md index b3158cff..b031d7a0 100644 --- a/.devel/sphinx/rapi/stri_opts_collator.md +++ b/.devel/sphinx/rapi/stri_opts_collator.md @@ -16,8 +16,7 @@ stri_opts_collator( case_level = FALSE, normalization = FALSE, normalisation = normalization, - numeric = FALSE, - ... + numeric = FALSE ) stri_coll( @@ -29,8 +28,7 @@ stri_coll( case_level = FALSE, normalization = FALSE, normalisation = normalization, - numeric = FALSE, - ... + numeric = FALSE ) ``` @@ -47,7 +45,6 @@ stri_coll( | `normalization` | single logical value; if `TRUE`, then incremental check is performed to see whether the input data is in the FCD form. If the data is not in the FCD form, incremental NFD normalization is performed | | `normalisation` | alias of `normalization` | | `numeric` | single logical value; when turned on, this attribute generates a collation key for the numeric value of substrings of digits; this is a way to get \'100\' to sort AFTER \'2\'; note that negative or non-integer numbers will not be ordered properly | -| `...` | \[DEPRECATED\] any other arguments passed to this function generate a warning; this argument will be removed in the future | ## Details diff --git a/.devel/sphinx/rapi/stri_opts_fixed.md b/.devel/sphinx/rapi/stri_opts_fixed.md index 583454eb..f67e9d01 100644 --- a/.devel/sphinx/rapi/stri_opts_fixed.md +++ b/.devel/sphinx/rapi/stri_opts_fixed.md @@ -7,16 +7,15 @@ A convenience function used to tune up the behavior of `stri_*_fixed` functions, ## Usage ``` r -stri_opts_fixed(case_insensitive = FALSE, overlap = FALSE, ...) +stri_opts_fixed(case_insensitive = FALSE, overlap = FALSE) ``` ## Arguments -| | | -|--------------------|----------------------------------------------------------------------------------------------------------------------------| -| `case_insensitive` | logical; enable simple case insensitive matching | -| `overlap` | logical; enable overlapping matches\' detection | -| `...` | \[DEPRECATED\] any other arguments passed to this function generate a warning; this argument will be removed in the future | +| | | +|--------------------|--------------------------------------------------| +| `case_insensitive` | logical; enable simple case insensitive matching | +| `overlap` | logical; enable overlapping matches\' detection | ## Details diff --git a/.devel/sphinx/rapi/stri_opts_regex.md b/.devel/sphinx/rapi/stri_opts_regex.md index f69464c2..5f2d6eff 100644 --- a/.devel/sphinx/rapi/stri_opts_regex.md +++ b/.devel/sphinx/rapi/stri_opts_regex.md @@ -19,8 +19,7 @@ stri_opts_regex( uword, error_on_unknown_escapes, time_limit = 0L, - stack_limit = 0L, - ... + stack_limit = 0L ) ``` @@ -40,7 +39,6 @@ stri_opts_regex( | `error_on_unknown_escapes` | logical; whether to generate an error on unrecognized backslash escapes; if set, fail with an error on patterns that contain backslash-escaped ASCII letters without a known special meaning; otherwise, these escaped letters represent themselves | | `time_limit` | integer; processing time limit, in \~milliseconds (but not precisely so, depends on the CPU speed), for match operations; setting a limit is desirable if poorly written regexes are expected on input; 0 for no limit | | `stack_limit` | integer; maximal size, in bytes, of the heap storage available for the match backtracking stack; setting a limit is desirable if poorly written regexes are expected on input; 0 for no limit | -| `...` | \[DEPRECATED\] any other arguments passed to this function generate a warning; this argument will be removed in the future | ## Details diff --git a/.devel/sphinx/rapi/stri_rand_lipsum.md b/.devel/sphinx/rapi/stri_rand_lipsum.md index 6d213ca7..05bb04b5 100644 --- a/.devel/sphinx/rapi/stri_rand_lipsum.md +++ b/.devel/sphinx/rapi/stri_rand_lipsum.md @@ -16,7 +16,7 @@ stri_rand_lipsum(n_paragraphs, start_lipsum = TRUE, nparagraphs = n_paragraphs) |----------------|------------------------------------------------------------------------------------------| | `n_paragraphs` | single integer, number of paragraphs to generate | | `start_lipsum` | single logical value; should the resulting text start with *Lorem ipsum dolor sit amet*? | -| `nparagraphs` | deprecated alias of `n_paragraphs` | +| `nparagraphs` | \[DEPRECATED\] alias of `n_paragraphs` | ## Details diff --git a/.devel/sphinx/rapi/stri_read_lines.md b/.devel/sphinx/rapi/stri_read_lines.md index ec121e44..8ae3ea0a 100644 --- a/.devel/sphinx/rapi/stri_read_lines.md +++ b/.devel/sphinx/rapi/stri_read_lines.md @@ -12,12 +12,11 @@ stri_read_lines(con, encoding = NULL, fname = con, fallback_encoding = NULL) ## Arguments -| | | -|---------------------|---------------------------------------------------------------------------------| -| `con` | name of the output file or a connection object (opened in the binary mode) | -| `encoding` | single string; input encoding; `NULL` or `''` for the current default encoding. | -| `fname` | deprecated alias of `con` | -| `fallback_encoding` | deprecated argument, no longer used | +| | | +|------------|---------------------------------------------------------------------------------| +| `con` | name of the output file or a connection object (opened in the binary mode) | +| `encoding` | single string; input encoding; `NULL` or `''` for the current default encoding. | +| `fname` | \[DEPRECATED\] alias of `con` | ## Details diff --git a/.devel/sphinx/rapi/stri_read_raw.md b/.devel/sphinx/rapi/stri_read_raw.md index 77f9d3c0..a9319486 100644 --- a/.devel/sphinx/rapi/stri_read_raw.md +++ b/.devel/sphinx/rapi/stri_read_raw.md @@ -15,7 +15,7 @@ stri_read_raw(con, fname = con) | | | |---------|----------------------------------------------------------------------------| | `con` | name of the output file or a connection object (opened in the binary mode) | -| `fname` | deprecated alias of `con` | +| `fname` | \[DEPRECATED\] alias of `con` | ## Details diff --git a/.devel/sphinx/rapi/stri_sprintf.md b/.devel/sphinx/rapi/stri_sprintf.md index fa94534a..0a357fc7 100644 --- a/.devel/sphinx/rapi/stri_sprintf.md +++ b/.devel/sphinx/rapi/stri_sprintf.md @@ -188,7 +188,7 @@ stri_sprintf("UNIX time %1$f is %1$s.", Sys.time()) ``` ``` -## [1] "UNIX time 1699423258.691701 is 2023-11-08 17:00:58.691701." +## [1] "UNIX time 1699489499.508911 is 2023-11-09 11:24:59.508911." ``` ```r @@ -213,7 +213,7 @@ stri_sprintf("%1$s is %1$f UNIX time.", Sys.time()) # re-coercion needed ``` ``` -## [1] "2023-11-08 17:00:58.69345 is 1699423258.693450 UNIX time." +## [1] "2023-11-09 11:24:59.510592 is 1699489499.510592 UNIX time." ``` ```r diff --git a/.devel/sphinx/rapi/stri_write_lines.md b/.devel/sphinx/rapi/stri_write_lines.md index 7c0dad5d..10068821 100644 --- a/.devel/sphinx/rapi/stri_write_lines.md +++ b/.devel/sphinx/rapi/stri_write_lines.md @@ -24,7 +24,7 @@ stri_write_lines( | `con` | name of the output file or a connection object (opened in the binary mode) | | `encoding` | output encoding, `NULL` or `''` for the current default one | | `sep` | newline separator | -| `fname` | deprecated alias of `con` | +| `fname` | \[DEPRECATED\] alias of `con` | ## Details diff --git a/.devel/tinytest/test-count-coll.R b/.devel/tinytest/test-count-coll.R index 61853f46..d6dc09a2 100644 --- a/.devel/tinytest/test-count-coll.R +++ b/.devel/tinytest/test-count-coll.R @@ -33,7 +33,7 @@ expect_equivalent(stri_count_coll("bababababaab", "aab", opts_collator = stri_op 1L) old_loc <- stri_locale_set("UNKNOWN") -expect_warning(stri_count_coll("bababababaab", "aab")) +expect_equivalent(stri_count_coll("bababababaab", "aab"), 1L) stri_locale_set(old_loc) expect_equivalent(stri_count_coll("bababababaab", "aab", opts_collator = stri_opts_collator(locale = "C")), diff --git a/.devel/tinytest/test-enc-detect.R b/.devel/tinytest/test-enc-detect.R index 2a51f53f..436f32ae 100644 --- a/.devel/tinytest/test-enc-detect.R +++ b/.devel/tinytest/test-enc-detect.R @@ -124,108 +124,4 @@ expect_equivalent(stri_enc_isascii(x1), !c(T, NA, T, T, T)) expect_equivalent(stri_enc_isutf8(x2), c(T, NA, T, T, T)) expect_equivalent(stri_enc_isutf8(x1), c(T, NA, T, T, T)) - - expect_equivalent(stri_enc_detect(as.raw(c(65:100)))[[1]]$Encoding[1], "UTF-8") - -expect_equivalent(stri_enc_detect2("abc")[[1]]$Encoding, "US-ASCII") - -expect_error(stri_enc_detect2("abc", encodings=c("don't know what's that"))) - -expect_equivalent(stri_enc_detect2(stri_encode("abc", "UTF-8", "UTF-16", to_raw=TRUE))[[1]]$Encoding, "UTF-16") -expect_equivalent(stri_enc_detect2(stri_encode("abc", "UTF-8", "UTF-16LE", to_raw=TRUE))[[1]]$Encoding, "UTF-16LE") -expect_equivalent(stri_enc_detect2(stri_encode("abc", "UTF-8", "UTF-16BE", to_raw=TRUE))[[1]]$Encoding, "UTF-16BE") -expect_equivalent(stri_enc_detect2(stri_encode("abc", "UTF-8", "UTF-32", to_raw=TRUE))[[1]]$Encoding, "UTF-32") -expect_equivalent(stri_enc_detect2(stri_encode("abc", "UTF-8", "UTF-32LE", to_raw=TRUE))[[1]]$Encoding, "UTF-32LE") -expect_equivalent(stri_enc_detect2(stri_encode("abc", "UTF-8", "UTF-32BE", to_raw=TRUE))[[1]]$Encoding, "UTF-32BE") - -text <- stri_flatten(stri_enc_fromutf32(65:127)) -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-8", to_raw=TRUE))[[1]]$Encoding, "US-ASCII") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16", to_raw=TRUE))[[1]]$Encoding, "UTF-16") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16LE", to_raw=TRUE))[[1]]$Encoding, "UTF-16LE") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16BE", to_raw=TRUE))[[1]]$Encoding, "UTF-16BE") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32", to_raw=TRUE))[[1]]$Encoding, "UTF-32") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32LE", to_raw=TRUE))[[1]]$Encoding, "UTF-32LE") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32BE", to_raw=TRUE))[[1]]$Encoding, "UTF-32BE") - -text <- stri_flatten(stri_enc_fromutf32(65:1024)) -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-8", to_raw=TRUE))[[1]]$Encoding, "UTF-8") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16", to_raw=TRUE))[[1]]$Encoding, "UTF-16") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16LE", to_raw=TRUE))[[1]]$Encoding, "UTF-16LE") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16BE", to_raw=TRUE))[[1]]$Encoding, "UTF-16BE") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32", to_raw=TRUE))[[1]]$Encoding, "UTF-32") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32LE", to_raw=TRUE))[[1]]$Encoding, "UTF-32LE") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32BE", to_raw=TRUE))[[1]]$Encoding, "UTF-32BE") - -if (file.exists('datasets/CS_utf8.txt')) { - path <- 'datasets' -} else if (file.exists('../datasets/CS_utf8.txt')) { - path <- '../datasets' -} else if (file.exists('../../datasets/CS_utf8.txt')) { - path <- '../../datasets' -} - -fnames <- c(file.path(path, 'CS_utf8.txt'), - file.path(path, 'DE_utf8.txt'), - file.path(path, 'PL_utf8.txt'), - file.path(path, 'ES_utf8.txt'), - file.path(path, 'RU_utf8.txt') -# file.path(path, 'TH_utf8.txt') - ) - -for (f in fnames) { - text <- stri_read_raw(f) - expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-8", to_raw=TRUE))[[1]]$Encoding, "UTF-8") - expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16", to_raw=TRUE))[[1]]$Encoding, "UTF-16") - expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16LE", to_raw=TRUE))[[1]]$Encoding, "UTF-16LE") - expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-16BE", to_raw=TRUE))[[1]]$Encoding, "UTF-16BE") - expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32", to_raw=TRUE))[[1]]$Encoding, "UTF-32") - expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32LE", to_raw=TRUE))[[1]]$Encoding, "UTF-32LE") - expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-32BE", to_raw=TRUE))[[1]]$Encoding, "UTF-32BE") -} - -text <- stri_read_raw(file.path(path, 'PL_utf8.txt')) -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "cp1250", to_raw=TRUE), - "pl_PL")[[1]]$Encoding[1], "windows-1250") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "latin2", to_raw=TRUE), - "pl_PL")[[1]]$Encoding[1], "ISO-8859-2") -#expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "cp852", to_raw=TRUE), -# "pl_PL")[[1]]$Encoding[1], "cp852") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "x-mac-centraleurroman", to_raw=TRUE), - "pl_PL")[[1]]$Encoding[1], "x-mac-centraleurroman") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-8", to_raw=TRUE), - "pl_PL")[[1]]$Encoding[1], "UTF-8") - -text <- stri_read_raw(file.path(path, 'CS_utf8.txt')) -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "cp1250", to_raw=TRUE), - "cs_CZ")[[1]]$Encoding[1], "windows-1250") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "latin2", to_raw=TRUE), - "cs_CZ")[[1]]$Encoding[1], "ISO-8859-2") -#expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "cp852", to_raw=TRUE), -# "cs_CZ")[[1]]$Encoding[1], "cp852") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "x-mac-centraleurroman", to_raw=TRUE), - "cs_CZ")[[1]]$Encoding[1], "x-mac-centraleurroman") -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "UTF-8", to_raw=TRUE), - "cs_CZ")[[1]]$Encoding[1], "UTF-8") - -text <- stri_read_raw(file.path(path, 'DE_utf8.txt')) -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "latin1", to_raw=TRUE), - "de_DE")[[1]]$Encoding[1], "ISO-8859-1") # windows-1250 is a superset -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "utf-8", to_raw=TRUE), - "de_DE")[[1]]$Encoding[1], "UTF-8") - -text <- stri_read_raw(file.path(path, 'ES_utf8.txt')) -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "latin1", to_raw=TRUE), - "es_ES")[[1]]$Encoding[1], "ISO-8859-1") # windows-1250 is a superset -expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "utf-8", to_raw=TRUE), - "es_ES")[[1]]$Encoding[1], "UTF-8") - -# distinguishing between KOI8-R and Windows-1251 is not so easy -#text <- stri_encode(stri_read_raw(file.path(path, 'RU_utf8.txt')), "UTF-8", "UTF-8") -#expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "KOI8-R", to_raw=TRUE), -# "ru_RU")[[1]]$Encoding[1], "KOI8-R") -#expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "windows-1251", to_raw=TRUE), -# "ru_RU")[[1]]$Encoding[1], "windows-1251") -#expect_equivalent(stri_enc_detect2(stri_encode(text, "UTF-8", "utf-8", to_raw=TRUE), -# "ru_RU")[[1]]$Encoding[1], "UTF-8") - diff --git a/DESCRIPTION b/DESCRIPTION index 3c654f35..a5b16622 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: stringi Version: 1.8.1 -Date: 2023-11-08 +Date: 2023-11-09 Title: Fast and Portable Character String Processing Facilities Description: A collection of character string/text/natural language processing tools for pattern searching (e.g., with 'Java'-like regular diff --git a/INSTALL b/INSTALL index 9f501e09..f5c03f8f 100644 --- a/INSTALL +++ b/INSTALL @@ -101,7 +101,7 @@ Some environment variables: -## Final Notes +## Getting Help If you do not manage to set up a successful build, do not hesitate to [file a bug report](https://github.com/gagolews/stringi/issues). diff --git a/NEWS b/NEWS index 24073d08..d8e62663 100644 --- a/NEWS +++ b/NEWS @@ -1,26 +1,36 @@ # Changelog -## 1.8.1 (2023-11-08) +## 1.8.1 (2023-11-09) * [GENERAL] ICU bundle updated to version 74.1 (Unicode 15.1, CLDR 44). -* [BACKWARD INCOMPATIBLE] [BUILD TIME] Support for Solaris has now been dropped. - The package is no longer shipped with the very outdated ICU55 bundle. +* [BACKWARD INCOMPATIBILITY] [BUILD TIME] Support for Solaris has now been + dropped. The package is no longer shipped with the very outdated ICU55 bundle. A compiler supporting at least C++11 as well as ICU >= 61 are now required. -* [BACKWARD INCOMPATIBLE] #469: Missing date-time fields in +* [BACKWARD INCOMPATIBILITY] #469: Missing date-time fields in `stri_datetime_parse` and `stri_datetime_create` now default to today's midnight local time. +* [BACKWARD INCOMPATIBILITY] Removed the long-deprecated and defunct + `fallback_encoding` parameter of `stri_read_lines` and the ellipsis + parameter of `stri_opts_collator`, `stri_opts_regex`, `stri_opts_fixed`, + and `stri_opts_regex`. + * [BUILD TIME] As per the suggestion of Prof. Brian Ripley, `icudt74l` (ICU data - little endian) is now included in the source tarball (compressed with xz to save space). This allows for building *stringi* on systems with no internet access. -* [NEW FEATURE] #476: A warning is emitted when selecting an unknown locale - for collation as it most likely indicates that a wrong resource is being - returned. +* [NEW FEATURE] #476: In break iterator-, date-time-, and collator-based + operations (e.g., `stri_sort`), a warning is emitted when the *root* ICU + resource bundle is returned when using an *explicitly* requested locale. + This might happen when we pass an 'unknown' `locale` argument to these + functions. Note that when relying on the default `locale=NULL` argument, + no warning is emitted. In such a case, checking + if the default locale as returned by `stri_enc_get` is amongst + those listed in `stri_enc_list` is recommended. * [NEW FEATURE] The `C` locale identifier now resolves to `en_US_POSIX`. diff --git a/R/encoding_detection.R b/R/encoding_detection.R index 79cb8699..7e61763c 100644 --- a/R/encoding_detection.R +++ b/R/encoding_detection.R @@ -311,11 +311,9 @@ stri_enc_detect <- function(str, filter_angle_brackets = FALSE) #' \code{\link{stri_enc_detect}} (uses \pkg{ICU} facilities). #' #' @param str character vector, a raw vector, or -#' a list of \code{raw} vectors -#' @param locale \code{NULL} or \code{''} -#' for default locale, -#' \code{NA} for just checking the UTF-* family, -#' or a single string with locale identifier. +#' a list of \code{raw} vectors +#' @param locale \code{NULL} or \code{''} for the default locale, +#' or a single string with locale identifier. #' #' @return #' Just like \code{\link{stri_enc_detect}}, @@ -335,6 +333,7 @@ stri_enc_detect <- function(str, filter_angle_brackets = FALSE) #' @export stri_enc_detect2 <- function(str, locale = NULL) { + warning("stri_enc_detect2 is deprecated and will be removed in a future release of 'stringi'.") suppressWarnings(lapply(.Call(C_stri_enc_detect2, str, locale), as.data.frame, stringsAsFactors = FALSE)) } diff --git a/R/files.R b/R/files.R index 9669f272..dcb0a9a9 100644 --- a/R/files.R +++ b/R/files.R @@ -46,7 +46,7 @@ #' #' @param con name of the output file or a connection object #' (opened in the binary mode) -#' @param fname deprecated alias of \code{con} +#' @param fname [DEPRECATED] alias of \code{con} #' #' @return #' Returns a vector of type \code{raw}. @@ -56,7 +56,7 @@ stri_read_raw <- function(con, fname = con) { if (!missing(fname) && missing(con)) { # DEPRECATED - # TODO: warning + warning("The 'fname' argument in stri_read_raw is a deprecated alias of 'con' and will be removed in a future release of 'stringi'.") con <- fname } @@ -102,8 +102,7 @@ stri_read_raw <- function(con, fname = con) #' (opened in the binary mode) #' @param encoding single string; input encoding; #' \code{NULL} or \code{''} for the current default encoding. -#' @param fname deprecated alias of \code{con} -#' @param fallback_encoding deprecated argument, no longer used +#' @param fname [DEPRECATED] alias of \code{con} #' #' @return #' Returns a character vector, each text line is a separate string. @@ -115,14 +114,10 @@ stri_read_lines <- function(con, encoding = NULL, fname = con, fallback_encoding = NULL) { if (!missing(fname) && missing(con)) { # DEPRECATED - # TODO: warning + warning("The 'fname' argument in stri_read_lines is a deprecated alias of 'con' and will be removed in a future release of 'stringi'.") con <- fname } - if (!missing(fallback_encoding)) { # DEPRECATED - warning('`fallback_encoding` is no longer used and has been scheduled for removal') - } - stopifnot(is.null(encoding) || is.character(encoding)) if (is.null(encoding) || encoding == "") @@ -158,7 +153,7 @@ stri_read_lines <- function(con, encoding = NULL, #' @param encoding output encoding, \code{NULL} or \code{''} for #' the current default one #' @param sep newline separator -#' @param fname deprecated alias of \code{con} +#' @param fname [DEPRECATED] alias of \code{con} #' #' @return #' This function returns nothing noteworthy. @@ -171,7 +166,7 @@ stri_write_lines <- function(str, con, fname = con) { if (!missing(fname) && missing(con)) { # DEPRECATED - # TODO: warning + warning("The 'fname' argument in stri_write_lines is a deprecated alias of 'con' and will be removed in a future release of 'stringi'.") con <- fname } diff --git a/R/locale.R b/R/locale.R index 2757dfd8..b902c31d 100644 --- a/R/locale.R +++ b/R/locale.R @@ -106,6 +106,8 @@ #' behavior of \pkg{stringi} while using a modified default locale. #' #' +#' +#' #' @section Locale-Sensitive Functions in \pkg{stringi}: #' #' One of many examples of locale-dependent services is the Collator, which @@ -114,8 +116,9 @@ #' for the description on how to tune its settings, and its \code{locale} #' argument in particular. #' -#' When choosing a resource bundle that is not available in the requested -#' locale nor in its more general variants (e.g., `es_ES` vs `es`), +#' When choosing a resource bundle that is not available in the explicitly +#' requested locale (but not when using the default locale) +#' nor in its more general variants (e.g., `es_ES` vs `es`), #' a warning is emitted. #' #' Other locale-sensitive functions include, e.g., diff --git a/R/opts.R b/R/opts.R index 4d8a2613..a5c01bba 100644 --- a/R/opts.R +++ b/R/opts.R @@ -79,8 +79,6 @@ #' the numeric value of substrings of digits; #' this is a way to get '100' to sort AFTER '2'; #' note that negative or non-integer numbers will not be ordered properly -#' @param ... [DEPRECATED] any other arguments passed to this function -#' generate a warning; this argument will be removed in the future #' #' @return #' Returns a named list object; missing settings are left with default values. @@ -108,12 +106,11 @@ #' stri_cmp('number100', 'number2', numeric=TRUE) # equivalent #' stri_cmp('above mentioned', 'above-mentioned') #' stri_cmp('above mentioned', 'above-mentioned', alternate_shifted=TRUE) -stri_opts_collator <- function(locale = NULL, strength = 3L, alternate_shifted = FALSE, +stri_opts_collator <- function( + locale = NULL, strength = 3L, alternate_shifted = FALSE, french = FALSE, uppercase_first = NA, case_level = FALSE, normalization = FALSE, - normalisation = normalization, numeric = FALSE, ...) { - if (!missing(...)) - warning("Unknown option to `stri_opts_collator`.") - + normalisation = normalization, numeric = FALSE +) { opts <- list() if (!missing(locale)) opts["locale"] <- locale @@ -192,8 +189,6 @@ stri_coll <- stri_opts_collator #' @param stack_limit integer; maximal size, in bytes, of the heap storage available #' for the match backtracking stack; setting a limit is desirable if poorly #' written regexes are expected on input; 0 for no limit -#' @param ... [DEPRECATED] any other arguments passed to this function -#' generate a warning; this argument will be removed in the future #' #' @return #' Returns a named list object; missing settings are left with default values. @@ -214,17 +209,14 @@ stri_coll <- stri_opts_collator #' stri_detect_regex('ala', 'ALA', opts_regex=stri_opts_regex(case_insensitive=TRUE)) #' stri_detect_regex('ala', 'ALA', case_insensitive=TRUE) # equivalent #' stri_detect_regex('ala', '(?i)ALA') # equivalent -stri_opts_regex <- function(case_insensitive, comments, +stri_opts_regex <- function( + case_insensitive, comments, dotall, dot_all = dotall, literal, multiline, multi_line = multiline, unix_lines, uword, error_on_unknown_escapes, - time_limit = 0L, stack_limit = 0L, - ...) -{ - if (!missing(...)) - warning("Unknown option to `stri_opts_regex`.") - + time_limit = 0L, stack_limit = 0L +) { opts <- list() if (!missing(case_insensitive)) opts["case_insensitive"] <- case_insensitive @@ -303,8 +295,6 @@ stri_opts_regex <- function(case_insensitive, comments, #' @param skip_sentence_sep logical; perform no action for sentences #' that do not contain an ending sentence terminator, but are ended #' by a hard separator or end of input -#' @param ... [DEPRECATED] any other arguments passed to this function -#' generate a warning; this argument will be removed in the future #' #' @return #' Returns a named list object. @@ -319,13 +309,11 @@ stri_opts_regex <- function(case_insensitive, comments, #' #' \emph{Boundary Analysis} -- ICU User Guide, #' \url{https://unicode-org.github.io/icu/userguide/boundaryanalysis/} -stri_opts_brkiter <- function(type, locale, skip_word_none, skip_word_number, +stri_opts_brkiter <- function( + type, locale, skip_word_none, skip_word_number, skip_word_letter, skip_word_kana, skip_word_ideo, skip_line_soft, - skip_line_hard, skip_sentence_term, skip_sentence_sep, ...) -{ - if (!missing(...)) - warning("Unknown option to `stri_opts_brkiter`.") - + skip_line_hard, skip_sentence_term, skip_sentence_sep +) { opts <- list() if (!missing(type)) opts["type"] <- type @@ -373,8 +361,6 @@ stri_opts_brkiter <- function(type, locale, skip_word_none, skip_word_number, #' #' @param case_insensitive logical; enable simple case insensitive matching #' @param overlap logical; enable overlapping matches' detection -#' @param ... [DEPRECATED] any other arguments passed to this function -#' generate a warning; this argument will be removed in the future #' #' @return #' Returns a named list object. @@ -390,11 +376,8 @@ stri_opts_brkiter <- function(type, locale, skip_word_none, skip_word_number, #' stri_detect_fixed('ala', 'ALA') # case-sensitive by default #' stri_detect_fixed('ala', 'ALA', opts_fixed=stri_opts_fixed(case_insensitive=TRUE)) #' stri_detect_fixed('ala', 'ALA', case_insensitive=TRUE) # equivalent -stri_opts_fixed <- function(case_insensitive = FALSE, overlap = FALSE, ...) +stri_opts_fixed <- function(case_insensitive = FALSE, overlap = FALSE) { - if (!missing(...)) - warning("Unknown option to `stri_opts_fixed`.") - opts <- list() if (!missing(case_insensitive)) opts["case_insensitive"] <- case_insensitive diff --git a/R/random.R b/R/random.R index d7a3a602..2fdc9aad 100644 --- a/R/random.R +++ b/R/random.R @@ -131,7 +131,7 @@ stri_rand_strings <- function(n, length, pattern = "[A-Za-z0-9]") #' @param n_paragraphs single integer, number of paragraphs to generate #' @param start_lipsum single logical value; should the resulting #' text start with \emph{Lorem ipsum dolor sit amet}? -#' @param nparagraphs deprecated alias of \code{n_paragraphs} +#' @param nparagraphs [DEPRECATED] alias of \code{n_paragraphs} #' #' @return Returns a character vector of length \code{n_paragraphs}. #' @@ -147,7 +147,7 @@ stri_rand_lipsum <- function(n_paragraphs, start_lipsum = TRUE, nparagraphs=n_paragraphs) { if (!missing(nparagraphs) && missing(n_paragraphs)) { # DEPRECATED - # TODO: warning + warning("The 'nparagraphs' argument in stri_rand_lipsum is a deprecated alias of 'n_paragraphs' and will be removed in a future release of 'stringi'.") n_paragraphs <- nparagraphs } diff --git a/R/sort.R b/R/sort.R index 9ed29312..0117bf09 100644 --- a/R/sort.R +++ b/R/sort.R @@ -260,8 +260,10 @@ stri_unique <- function(str, ..., opts_collator = NULL) #' @export stri_duplicated <- function(str, from_last = FALSE, fromLast = from_last, ..., opts_collator = NULL) { - if (!missing(fromLast)) # DEPRECATED + if (!missing(fromLast)) { + warning("The 'fromLast' argument in stri_duplicated is a deprecated alias of 'from_last' and will be removed in a future release of 'stringi'.") from_last <- fromLast + } if (!missing(...)) opts_collator <- do.call(stri_opts_collator, as.list(c(opts_collator, ...))) .Call(C_stri_duplicated, str, from_last, opts_collator) @@ -272,8 +274,10 @@ stri_duplicated <- function(str, from_last = FALSE, #' @export stri_duplicated_any <- function(str, from_last = FALSE, fromLast = from_last, ..., opts_collator = NULL) { - if (!missing(fromLast)) # DEPRECATED + if (!missing(fromLast)) { # DEPRECATED + warning("The 'fromLast' argument in stri_duplicated_any is a deprecated alias of 'from_last' and will be removed in a future release of 'stringi'.") from_last <- fromLast + } if (!missing(...)) opts_collator <- do.call(stri_opts_collator, as.list(c(opts_collator, ...))) .Call(C_stri_duplicated_any, str, from_last, opts_collator) diff --git a/docs/genindex.html b/docs/genindex.html index 880ce91e..d495d6a7 100644 --- a/docs/genindex.html +++ b/docs/genindex.html @@ -357,7 +357,7 @@
--with-extra-libs
.
-If you do not manage to set up a successful build, do not hesitate to file a bug report. However, please check the list of archived (closed) issues first – @@ -470,7 +470,7 @@
[GENERAL] ICU bundle updated to version 74.1 (Unicode 15.1, CLDR 44).
[BACKWARD INCOMPATIBLE] [BUILD TIME] Support for Solaris has now been dropped. -The package is no longer shipped with the very outdated ICU55 bundle. +
[BACKWARD INCOMPATIBILITY] [BUILD TIME] Support for Solaris has now been +dropped. The package is no longer shipped with the very outdated ICU55 bundle. A compiler supporting at least C++11 as well as ICU >= 61 are now required.
[BACKWARD INCOMPATIBLE] #469: Missing date-time fields in +
[BACKWARD INCOMPATIBILITY] #469: Missing date-time fields in
stri_datetime_parse
and stri_datetime_create
now default to today’s
midnight local time.
[BACKWARD INCOMPATIBILITY] Removed the long-deprecated and defunct
+fallback_encoding
parameter of stri_read_lines
and the ellipsis
+parameter of stri_opts_collator
, stri_opts_regex
, stri_opts_fixed
,
+and stri_opts_regex
.
[BUILD TIME] As per the suggestion of Prof. Brian Ripley, icudt74l
(ICU data - little endian) is now included in the source tarball (compressed
with xz to save space). This allows for building stringi on systems with
no internet access.
[NEW FEATURE] #476: A warning is emitted when selecting an unknown locale -for collation as it most likely indicates that a wrong resource is being -returned.
[NEW FEATURE] #476: In break iterator-, date-time-, and collator-based
+operations (e.g., stri_sort
), a warning is emitted when the root ICU
+resource bundle is returned when using an explicitly requested locale.
+This might happen when we pass an ‘unknown’ locale
argument to these
+functions. Note that when relying on the default locale=NULL
argument,
+no warning is emitted. In such a case, checking
+if the default locale as returned by stri_enc_get
is amongst
+those listed in stri_enc_list
is recommended.
[NEW FEATURE] The C
locale identifier now resolves to en_US_POSIX
.
[BUGFIX] #469: stri_datetime_parse
did not reset the Calendar
object when parsing multiple dates.
One of many examples of locale-dependent services is the Collator, which performs a locale-aware string comparison. It is used for string comparing, ordering, sorting, and searching. See stri_opts_collator
for the description on how to tune its settings, and its locale
argument in particular.
When choosing a resource bundle that is not available in the requested locale nor in its more general variants (e.g., ‘es_ES’ vs ‘es’), a warning is emitted.
+When choosing a resource bundle that is not available in the explicitly requested locale (but not when using the default locale) nor in its more general variants (e.g., ‘es_ES’ vs ‘es’), a warning is emitted.
Other locale-sensitive functions include, e.g., stri_trans_tolower
(that does character case mapping).
stri_datetime_fields(stri_datetime_now())
## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
-## 1 2023 11 8 17 0 49 139 46 2
+## 1 2023 11 9 11 24 49 982 46 2
## DayOfYear DayOfWeek Hour12 AmPm Era
-## 1 312 4 5 2 2
+## 1 313 5 11 1 2
stri_datetime_fields(stri_datetime_now(), locale='@calendar=hebrew')
## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
-## 1 5784 2 24 17 0 49 148 9 4
+## 1 5784 2 25 11 24 49 986 9 4
## DayOfYear DayOfWeek Hour12 AmPm Era
-## 1 54 4 5 2 1
+## 1 55 5 11 1 1
stri_datetime_symbols(locale='@calendar=hebrew')$Month[
stri_datetime_fields(stri_datetime_now(), locale='@calendar=hebrew')$Month
]
@@ -470,7 +470,7 @@ Examples
Some rights reserved. Licensed under CC BY-NC-ND 4.0.
Built with Sphinx
and a customised Furo theme.
- Last updated on 2023-11-08T17:01:04+1100.
+ Last updated on 2023-11-09T11:25:05+1100.
This site will never display any ads: it is a non-profit project.
It does not collect any data.
locale
NULL
or ''
for default locale, NA
for just checking the UTF-* family, or a single string with locale identifier.
NULL
or ''
for the default locale, or a single string with locale identifier.
skip_sentence_sep
logical; perform no action for sentences that do not contain an ending sentence terminator, but are ended by a hard separator or end of input
...
[DEPRECATED] any other arguments passed to this function generate a warning; this argument will be removed in the future
numeric
single logical value; when turned on, this attribute generates a collation key for the numeric value of substrings of digits; this is a way to get ‘100’ to sort AFTER ‘2’; note that negative or non-integer numbers will not be ordered properly
...
[DEPRECATED] any other arguments passed to this function generate a warning; this argument will be removed in the future
stri_opts_fixed(case_insensitive = FALSE, overlap = FALSE, ...)
+stri_opts_fixed(case_insensitive = FALSE, overlap = FALSE)
@@ -373,9 +373,6 @@ Argumentsoverlap
logical; enable overlapping matches’ detection
-...
-[DEPRECATED] any other arguments passed to this function generate a warning; this argument will be removed in the future
-
stack_limit
integer; maximal size, in bytes, of the heap storage available for the match backtracking stack; setting a limit is desirable if poorly written regexes are expected on input; 0 for no limit
...
[DEPRECATED] any other arguments passed to this function generate a warning; this argument will be removed in the future
nparagraphs
deprecated alias of n_paragraphs
[DEPRECATED] alias of n_paragraphs
''
for the current default encoding.
fname
deprecated alias of con
fallback_encoding
deprecated argument, no longer used
[DEPRECATED] alias of con
fname
deprecated alias of con
[DEPRECATED] alias of con
fname
deprecated alias of con
[DEPRECATED] alias of con