-
-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(starr): expand dual audio regex #1979
base: master
Are you sure you want to change the base?
feat(starr): expand dual audio regex #1979
Conversation
Another thought: it would be nice if this could also be captured in the {custom formats} for renaming, to better track which files have dual audio and which don't. But that would require it having a shorter name like |
Ah, nevermind—I realized that's already accounted for in the |
@rg9400 could I get you to take a look at this given the regex changes |
5783172
to
97a0780
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could probably be improved further but this seems to do the same job.
not sure why the varyg is case sensitive, that seems like its unnecessary - when would this be lowercase and not be a match?
Fair enough—I think I was just trying to be super safe, but you make a good point. And I appreciate the additional refactor! Changes accepted. |
still needs review from anime team - my changes have been committed
Co-authored-by: zakary <[email protected]>
Co-authored-by: zakary <[email protected]>
370a931
to
41b551e
Compare
Sorry, was on vacation for the last few weeks. Can you share some test cases against this regex? The VARYG format has The regex itself looks good, but sometimes it's hard to identify potential issues, so just trying to do a bit of due diligence to validate the changes are solving a problem. |
@rg9400 given your recent changes is this still valid/needed? |
Hi! My apologies for leaving and then resurrecting this PR after so long. Everyone in this thread gave a lot of great feedback, and I've been giving it a ton of consideration. I'm trying to approach this matter very thoughtfully, as my goal (perhaps lofty) is to "solve" dual audio anime for the foreseeable future. To that end, I'd love feedback and recommendations on the different approaches I'm about to lay out. TLDR:
If that kind of performance hit is not a concern (given we're still talking a couple millisecs here), I advocate for high coverage. If it is, or there are other concerns, medium will still be a huge improvement. Improvements over current regex(All)
(High coverage only)
Responding to previous comments
The problem as I understand it currently
Approaches and tradeoffsIt's possible to maximize coverage for all cases, with some tradeoffs: namely, possibility of false positives, and performance (relative to current regex) if we then try to control for them. Here are the options. MEDIUM COVERAGE: Match as many cases EXCEPT for the word
|
In regards to the It's also much less of a concern for trailing spaces to be used, in my opinion, than for different languages to have different whitespace usage. Just my opinion, though. |
Heard and appreciated! Whatever the ultimate PR winds up being, I'll make sure to replace spaces with |
@adapowers I have this flagged to look at, but I'll need to see when I find time due to the holidays. Prima facie, I think we can certainly expand it to handle some of the cases you've highlighted. I've jotted down some quick thoughts from reading your post, but I haven't gone over it in detail or spent time with the regex, I will do that later. Do you see lots of results with [English + Japanese]? I only included the [EN+JA] cases because that's what the Arr's rename the file to include if multi-languages are detected. The other sort of thing I noticed was your reasoning for the Dual tradeoff. I think it makes sense but is something I am wary of in terms of maintenance. The other potential concern could be episode titles since it could match on those, though I think most anime groups do not include them (though P2P do sometimes). Also, I do not think we should match on |
Given the number of regex steps and substantially increased time processing for every individual case of multiple languages takes, I'm not entirely sure the benefits outweigh the cost to cover every single language variation in one, and the precedent that every language is covered is only tech-debt. I would recommend taking a rather lax approach to what you include and maybe write an additional CF or 2 for users that need them. Or a brief "this is how to add your own language" Bundling everything into one only introduces complexity for maintaining, cost for processing, and precedent to cover every language in the future. None of that sounds particularly great to me, but I will defer to you all as anime isn't my thing to begin with. |
I appreciate both your notes, and will respond meaningfully ASAP. To ask a Q now though, @zakkarry, in response to something you said: is there precedent for a (TRaSH-synced) "advanced" version of a CF with qualified tradeoffs? My hopeful goal is to improve everyone's *arr experience, else I'd just keep my CF for myself and unsync from the guide's. :) Overall, I know we have to minimize debt+maximize compatibility to keep the UX bar high, just knowing what other options exist could be helpful for deciding how to split this out. (And to your other point: anime naming is a monster, and we could use all the CPU cycles in the world trying to account for it; it's likely always going to be a little more expensive compared to other CFs for even decent coverage, but obviously cost:benefit is always a relevant factor.) |
I don't think that DUAL would ever be considered a default CF, but I don't use the sync'ing software and use mostly custom CF's and scores I wrote and manage to begin with, so I wouldn't be the person to ask about the feasibility or anything of that. It was just what I would suggest if all things were equal. |
@adapowers been going through your medium regex for now (I know that we have to make a decision on the riskier ones, will get to that in a bit). Some notes on it below, but otherwise, I think that is easy enough to adopt without much concern.
For the more aggressive regex, I have verified it works, but we just have to make a call here. Mainly P2P groups use this format, and the guide isn't super oriented around them (though they do get scored). On the other hand, your regex is controlling for the edge cases we can predict so far, though it seems the processing time and maintenance will be the tradeoff (as well as those edge cases we cannot predict such as episode titles which might be included in some p2p releases). I am still thinking through this, but we can fix the above since it is consistent across all the formats in the meantime. |
Pull Request
Purpose
Two common conventions for indicating dual audio releases are not currently captured by the regex:
DUAL-VARYG
DUAL {Resolution}
or{Resolution} DUAL
I attempted to add a pattern which was flexible, while having a low risk of false positives.
Approach
The following patterns have been added into the regex:
(?-i)DUAL-VARYG(?i)
(?-i)
disables case-insensitivity for the rest of the patternDUAL-VARYG
matches literally(?i)
turns case-insensitivity back on for the rest of the patterndual[ ._-]?(\d{3,4}p|ultrahd|4k)
dual[ ._-]
matchesdual
(case insensitive) plus common separator characters\d{3,4}p|ultrahd|4k
matches any 3-4 digits followed byp
(1080p
,720p
, etc.) or4K
orUltraHD
(\d{3,4}p|ultrahd|4k)[ ._-]?dual
1080p.DUAL
,4K UltraHD DUAL
, etc.Notes:
dual
is a dictionary word, we want to prevent matching on any uppercase, hyphen-delimited filename that contains it. This is the same reason I chose to encode the release group directly. Without both, there could be too many false positives.Regex
https://regex101.com/r/p1Rt67/6
Open Questions and Pre-Merge TODOs
Requirements