Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mzXML files can have multiple binary encodings #35

Open
wkumler opened this issue Mar 6, 2024 · 2 comments
Open

mzXML files can have multiple binary encodings #35

wkumler opened this issue Mar 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@wkumler
Copy link
Owner

wkumler commented Mar 6, 2024

The DDA file from the Skyline folks in their DIA tutorial throws a bunch of warnings when run. Prying into these revealed that the mzXML file sometimes has an encoding of "none" and sometimes has an encoding of zlib (aka gzip).

Scan 1: none
Scan 2: zlib
Scan 3: none
Scan 4: none
Scan 5: none
Scan 6: none
Scan 7: zlib

with no clear pattern. I had always assumed that the same encoding would be applied to every binary encoding (and RaMS only reads out the first peak's value). I'll have to switch it over to individual encodings which is super annoying.

<scan num="1" msLevel="1" peaksCount="17" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.1821S" injectionTime="PT0.1000S" lowMz="423.914" highMz="1765.56" basePeakMz="1765.56" basePeakIntensity="779.182" totIonCurrent="10071.9">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="none" compressedLen="0">Q9P1BUPwxShD3HUiRAlb1EPe6yBD+KoQRAOuX0PvpUdEBuBDRAdZKkQH489EAu6lRAmIZEQfVKJEGxU5RBL2OUQj+/pEIzpVRDzChUQky15EXYj5RAlMgERrSNREDDeVRHfOj0QNYvNEfzuWRDAJVESVvzREGRwVRK8DZUQsofNE3LIIRELLow==</peaks>
</scan>
<scan num="2" msLevel="1" peaksCount="352" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.6401S" injectionTime="PT0.1000S" lowMz="400.251" highMz="1753.41" basePeakMz="445.117" basePeakIntensity="2.62395e+006" totIonCurrent="1.77468e+007">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="zlib" compressedLen="2622">eJwN1vlfzWkfx/GcFnXqtKAmKqQI2UY0JrJ93tf1PaftFJP1niTLhAnNiDv3GLK1nTqRUBNZk5GkRbJlG057Km2kkjZLU9ZKy+2n1+P5H7xINXoi3/bDDFJN2s6uyB+TalcCN3ObQ6pT63Bk8G1SJZfhWFwMqfoOsaT9CZS7c4CtNhVR7q6xaLZwotyvRthgOYvydh5GkEJJeRFOXD7bjfISRfxobyPlpfewhXNXUr7Ehy3dkkr547awxzeaKD/gLA5tEFO+ohdHeB3lZ7Zhl8dMKsh6jpD9elQoXYqfxfepMCZBCGCnqDC1Gxs8t1Nh2nlWoN1GRSYWPNA+j4pSNFhlzU4qulKHhv3mVDyFIdBXl4r3ewhrVUoq0XDiR4sLqCQog7m1r6aSfX3cvuU6lYSlsev631PJ+VY0PE6mJxrq7KBlID0RJyEhbhI92S3iVavK6Mk+d8TlCPQkSV16s0dJTzJrEPiqk54MiLh2+xcqHe8vtMp6qPTPWNZ+JpxKk8r5nj+6qTTrLd48r6LS/jJ8WGVPZZbZzGdBKZVNmMlc3e5S2Q0R81nsQuUXtkPlfI/KU4ayCMer9HTKdpRP1aOnCR8Qtmo3PU014B/yMujp1Q8oun2WKqYlsQbJIqrwz5KJo/2oIuyrTPNXRhUJW+Fv5UkV760ETx8zqtTVkC78okOVHptx+8B3VOkvl/3p6EmVIUuFaQFrqDIsQfjQ3kWVndlC8QM9qtJezhfbelGVOEP4bvUBqtraKnP17qCqAxe49solVBXcyU1zs6gqbD4fNOccVUVZI66rkaou9gh+ojNUldWL0Ht2VNW5WEh89pmqNVW4ZexH1YPv4OFrXaoWH0CO3J+qJ3rx1Ta+VL11rzSl/5uV5TjsakrV2S047jqWqjs6Bb2fd1PNVnPhQvYFqsmah8tLa6jmZg+7clJENR0nuNGjWnq2pYzPW5xGzzrcsD5oMj1P00W6ZwfVHkoURB4nqTZDgwXsNaQXRmncyIDohZ8L+29wBb3YsgVFK5vpRWS302gbB3pxJYsVz9hIL96F4vKs+1RnoCGLuLuN6qbeQNnoLqpTrBY2nGigushjsrGNi6juUqv0bE0u1eu58v8Z7qJ6/d3S6Ws9qH7SEuHhL01Ub3cVmzbLqT70KpsyrpTqw5u5d0A41UfOF+xeiag+WcUDZFnUoPMcvtYO1KD7luHB39Sg78Ht445Rw9i9qAv+Sg2TLfBl1StqiHiJ/Iyh9PIvBVf9HUgv2+wRnFFNjWYvuYmBNzWmN+HoMRW9iqnAwwmf6NW5fI77f1DT2MlMrq5LTTHxPDl2BjUbF3CLm98auAPHpg5Qc4wnS/Jqp+bLH7npL97UMsRTmLOtl1qmbMSXO7HUEsVkmiuyqSVNgYOZN6jV0F3qWrmcWpUNQk+lH7XmVKFZ6xS1GZQJ+imrqU35Gx9+5yi9zuxn6zNy6M2xOgRNLqR3e3RYzqPj1K62AKX2SmofLmG/GY2i9t0PmW/LSmqPnY70R+b074Q0tqJ2FHVkvkHYA0vqvK7Jrfzr6f2sYPTVjKCPB9O4us08+qR1mx2fVk+f9nXjhH84fTq4G2+2jqMvp2NZYnU/dVl2o2JdHHWlnkLFqEPUlf2etwT/RN0OPnxkYCF138hFSHY9fY2QCpWmVvQ1YwhbFvmOeiXLhAXXBlGvqy/+9L9HvWHxgmXkaeqNkPCvQ65Tb/QedmRtAvWJb/CQW87UJ7Fmt9cFUt/QJ8zVyZb6Qobio3ET9YUt5dNb7ahPsQQv//OM+qKHIHyhGfWL3VhMWzT1h2ZjV1oEDcQqmG/pFRq42SuM8X4MtZG2fHW2Empe3vhHqMMgWT9PiZ+BQaH9XC3RDINU52XOsgoMqrXAif5EiLRTpY8qN0Ek3Sn1fxgFUbAT8zAbDFFIoHDFzAGif/5li/zvQPTYjIn7cyBSyXl72Amoa09kFY2VUPdqZ6pKB6gnWOOO3nRouG7lWaG3oRGxh1XMfwCNwrnspbsXNIqiEPd9EzQlTogfroSm82dhoNEUmgo1IbRxAjTzEtjOy1ehmV8gDO3+AVriFfwF7KClW84Pnv4ILdki9CbsgpbTcs7tvtk5hf1xfz20wqaz9DsHoRVejovzMqGV54Doe0OhswDImb8DOhHX2Lt5b6BTeJIFnboNsUSJnm1rIBb24NbNh9B7GIwUr0+QaOxEx4j5kJAhriZJIBH2orPsOgyFEmHTHicYBtfzkubZMHwUwL237Ybh42x+1GsbjLQ2s4v1FjAafI+9uTQVRtyHHd++HEYHfdDtPoCh+SP5ktHWGKY7jWXnj8EwWYk0epKAYaFVUvO7+himOsqebp2NYbnpQot6N4y1I9mIthkw1kkVRn5aAWPpAra4wQ7GstPc45A/jEOD2PmvCTDR9+Pl/66EidsB7mz6DiaR3qw8eR5MilbhuO9IfFdoKU33HAZTyWTpmqvnYeoyIB0W1wzT8ABuFbMfphHagp9JFkzzd7DzNeUwLWjix5LnYLjuOjavbQDD9bpYwwQZzPXzeeZIe5i7FfHlY6JhHpnO9XQ0YV6Uzf7AZFjoX0JJfjtG6tnj47LfMUb1lf8TuxZWOhIu1bgGKyGKbTySDSuZIVdnF2EV/CtKms1gFfKWrRl8AtbOwTj9KgzjFGuRem0SxuV1SYfUpcBGV024kvoGNk5NwmHzMNiEtfAXw1phk/eQz1iWhfHiRub3Kg8T3WRS69mOmBi5RKj4/AATi8YLyht6sNVTsOjBI2CrP1PQv+sLW5d1KPRKga3bHM7GK2Cr8MNftybANnIGW9R5GFOUHdxYTQtTSky4oXgXphp0s9jvN2Cq/NsPDFGDnUQPipaNsHNpQMVuJWY6WbEyrfGYGWbB4nkpZubZsB3BnbAX6+HE+0T86GIpfJ4VhR8VU4T8gHj8WKAvhPuNhYNuH681T4SDcyUPXLYdDgoJqzibDof8t8jMmIY5xf/jTR71cDQI4dHLRsBRvoZfyDgDx4gHXOSZCkflXa6Rq4Rj4VXmmecMx+LDUH3JwVzJc6Yw8sACww6kDLfEgqi3CHH9CxDacGRUOfDoN5yRb4aQ746cB71wLnQVXB54wkUiF+aec4TLhPfY8PwNXFw9eJ98LlwifLhCZAiXQkuuu78LrhILpK/dDrl7Mte4dw3yqIssNfcj5CWJrLbtJdz17zH3xn1wN/yAv+9Zw92tB/F7f4Z7ZC8KFu7DCsUBJPZ0YEXBKubfcAkr9ZSIN+uDl1sAH2gwgFdkAK81bIZX0a+8hGVilf42PpudxSq3n1hH4iv4RM3E4ZZNWGM4GzErfoefJAaBP6zH1mIl14wUw9/gEh8tNMFfHs+nvk6CvzKevetygH+xN1vq8AFBSXLmm6SJIFUAW5d8HkGtXuz0jJPYq72GedfOw75iJyjV3LF/YwIOLZyOsNbXyLwchPAJYjRdP4VYk2c4dP0Qzn7Oxp6ENOQ+2odzz7XwbLINjiRb/B8ZS+lz</peaks>
</scan>
<scan num="3" msLevel="1" peaksCount="84" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.9075S" injectionTime="PT0.0023S" lowMz="407.983" highMz="1790.98" basePeakMz="445.12" basePeakIntensity="3.00796e+006" totIonCurrent="1.64408e+007">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="none" compressedLen="0">Q8v920asXxtDzZYNRwcRFEPPhMhHQR1XQ8/i7kagqMtD0aWgRqjUyEPRqHNIwsWoQ9Io30fvKtBD1nIdSQBwQkPWi2pJLuT8Q9bykUexHVZD1wttSF/ySUPXccBJcLopQ9eLAkg0avND1/InR0Jy3UPYcVZJMPqXQ9jx1UfkrS5D2XEDSKTaGEPacJNHqFcdQ96Pa0o3l1ND3w94SYGzm0PfjNZHMKAUQ9+O9UlKvzZD36yESJBP00PgDvJIKL3bQ+As6kedu99D55DHSRq49kPnksFGptqKQ+ffWEamyqND6BDLSGjWckPokHtH7zsSQ+kQnkcFr3RD+LpLRw7pfEP7jcZHzBPnQ/wOBkcmXgpD/7xoR7XF/EQBil9G0CXGRAHI7UlzCfREAgjtSKzMwEQCSLdIZwmfRAKIxUeX5FxEBUw3SCa2UEQFjFZHHFy2RAWvRUbFIHpEBcwiRsxJ+kQLNWJGsVU6RAvMcEbNe0FEEMbIRvEJKUQUMsJGr9ZvRBRKKEiE0AhEFIosSExdg0QUygJH74qPRBUJ/0cuEa1EF817SMdhy0QYDX9Iioq7RBhNNUgOsMVEGI02R2ylHEQgUlpGoqg8RCbLZUfS0XBEJwtbR8oaPEQnS0dHQpPURCpOoUgT0eJEKo66R7HmVEQqzshHJfxdRCsOhUbtkfhENQntRsdXuUQ14uRGuu73RDZ3XkbuMw9ENxdNRql+CUQ4zf1GySnfRDlMkUe1PBJEOYyaR53SZUQ5zKxG3sT1RDoMYkbon1NEOvsNRqo4MkRLFOVGyFTLREtyJkbE4DFES4PARrhXdERLzbFHHdvDREwN60cPEzxElaiKRtVP1ESa4JNG6GD2RKnHHUbZld5EuR0xRuc/PkTf34JG/H+s</peaks>
</scan>

and instead of

  vals <- lapply(all_peak_nodes, function(binary){
    if(!nchar(binary))return(matrix(ncol = 2, nrow = 0))
    decoded_binary <- base64enc::base64decode(binary)
    raw_binary <- as.raw(decoded_binary)
    decomp_binary <- memDecompress(raw_binary, type = file_metadata$compression)
    final_binary <- readBin(decomp_binary, what = "numeric",
                            n=length(decomp_binary)/file_metadata$precision,
                            size = file_metadata$precision,
                            endian = file_metadata$endi_enc)
    matrix(final_binary, ncol = 2, byrow = TRUE)

I'll have to do something like

all_peak_nodes <- xml2::xml_text(xml2::xml_find_all(xml_nodes, xpath = "d1:peaks"))
all_peak_encs <- xml2::xml_attr(xml2::xml_find_all(xml_nodes, xpath = "d1:peaks"), "compressionType")
vals <- mapply(function(binary, encoding_i){
    if(!nchar(binary))return(matrix(ncol = 2, nrow = 0))
    decoded_binary <- base64enc::base64decode(binary)
    raw_binary <- as.raw(decoded_binary)
    decomp_binary <- memDecompress(raw_binary, type = encoding_i)
    final_binary <- readBin(decomp_binary, what = "numeric",
                            n=length(decomp_binary)/file_metadata$precision,
                            size = file_metadata$precision,
                            endian = file_metadata$endi_enc)
    matrix(final_binary, ncol = 2, byrow = TRUE)
}, all_peak_nodes, peak_encodings)
@wkumler
Copy link
Owner Author

wkumler commented Mar 6, 2024

This has been implemented on the enc_hotfix branch - I wonder if it's easier to just convert to mzML though??

@wkumler
Copy link
Owner Author

wkumler commented Mar 6, 2024

Possible implementation would be to request all the encodings in grabEncodings and if there's multiple (maybe length(unique())>1?) switch to this method and otherwise use the original?

@wkumler wkumler added the bug Something isn't working label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant