Run CRISP-Correct with R1 fastq only #3

Edert · 2024-11-18T11:56:40Z

Hi,

I am trying to run CRISPR-Correct with just one fastq file (R1 only). According to the readme it should be possible to do so.
How should I define the fastq_r2_fn parameter as it seems to be required? None, NULL, or an empty string '' does not work.

My python lines:

import crispr_ambiguous_mapping
import pandas as pd
import gzip
import shutil

PROTOSPACER_HAMMING_THRESHOLD = 7
CPUS=16

GUIDE_LIBRARY_DATAFRAME = pd.read_table("ref/lib1_for_CRISPR_correct.txt")

GUIDE_LIBRARY_DATAFRAME
                protospacer
0      GCCTCTGCCTGGTCTGTGGG
1      TGGTCTGTGGGGACGTGGCC
2      CATCCTGTGAGGCCTGCAAA
3      AGGCCTGCAAAGCCTTCTTC
4      ACAGCTGTCCGGCCTCCAAC
...                     ...
56100  TGGGGGTGTTCTGCTGGTAG
56101  TGGTTGTCGGGCAGCAGCAC
56102  TGTACTCCAGCTTGTGCCCC
56103  TGTGATCGCGCTTCTCGTTG
56104  TTCAAGTCCGCCATGCCCGA

[56105 rows x 1 columns]


INPUT_GZ_FILE= "PlasmidPool_R1.fastq.gz"
FASTQ_FILE ="PlasmidPool_R1.fastq" 

#unzip
with gzip.open(INPUT_GZ_FILE, 'rb') as gz_file:
    with open(FASTQ_FILE, 'wb') as extracted_file:
        shutil.copyfileobj(gz_file, extracted_file)

 
count_result =  crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_umitools_output(whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME, fastq_r1_fn=FASTQ_FILE, fastq_r2_fn='None',protospacer_hamming_threshold_strict=PROTOSPACER_HAMMING_THRESHOLD,cores=CPUS)

and this is my fastq file:

head PlasmidPool_R1.fastq
@LH00392:128:22F7GGLT4:3:1101:41340:1042 1:N:0:ATACCTGT
CATGCAAGACAGGTCACAAG
+
IIIIIIIIIIIIIIIIIIII
@LH00392:128:22F7GGLT4:3:1101:42100:1042 1:N:0:ATACCTGT
GTTCTGGACATTCACCATCC
+
IIIIIIIIIIIIIIIIIIII
@LH00392:128:22F7GGLT4:3:1101:43152:1042 1:N:0:ATACCTGT
GTGCTCTAGCTGTCAAGCTT

Best regards,
Thomas

The text was updated successfully, but these errors were encountered:

CodingBash · 2024-11-22T20:11:50Z

Hi Thomas,

I just updated the repository, please see the updated README and install the latest version of CRISPR-Correct: pip install crispr-ambiguous-mapping==0.0.177

Here is a draft of what the function should look like with R1 only:

result = crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_fastq(
       whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME, 
       fastq_r1_fn=INPUT_GZ_FILE,  # Tool accepts GZ files 

       protospacer_start_position = 0,
       protospacer_length = 20,

       is_protospacer_r1 = True, 
       is_protospacer_header = False, 
       revcomp_protospacer = False, 
       protospacer_hamming_threshold_strict=7,
       cores=CPUS)

The updated version isn't as robustly tested so let me know if you still have issues.

Edert · 2024-11-25T07:58:03Z

Dear Basheer,

thank you for the quick reply. I installed version 0.0.177 but it still requires a second fastq file:

>>> count_result = crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_fastq(
...        whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME,
...        fastq_r1_fn=INPUT_GZ_FILE,  # Tool accepts GZ files
...
...        protospacer_start_position = 0,
...        protospacer_length = 20,
...
...        is_protospacer_r1 = True,
...        is_protospacer_header = False,
...        revcomp_protospacer = False,
...        protospacer_hamming_threshold_strict=PROTOSPACER_HAMMING_THRESHOLD,
...        cores=CPUS)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: get_whitelist_reporter_counts_from_fastq() missing 1 required positional argument: 'fastq_r2_fn'

I did then change line 99 in main_mapping.py
from
fastq_r2_fn: Optional[str], to
fastq_r2_fn: Optional[str] = None,

Then I get this:


Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
1 seconds for parsing
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 227, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 52, in get_whitelist_reporter_counts_with_umi
    encoded_whitelist_protospacer_sequences_series = crispr_sequence_encoding.encode_guide_series(padded_whitelist_guide_reporter_df["protospacer"])
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_sequence_encoding.py", line 62, in encode_guide_series
    guide_numpy_encoding = encode_DNA_base_vectorized(guide_numpy_char)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2372, in __call__
    return self._call_as_normal(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2365, in _call_as_normal
    return self._vectorize_call(func=func, args=vargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2446, in _vectorize_call
    res = self._vectorize_call_with_signature(func, args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2486, in _vectorize_call_with_signature
    results = func(*(arg[index] for arg in args))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_sequence_encoding.py", line 48, in encode_DNA_base
    return full_encoding_dict[char]
           ~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: ' '

I have the feeling that the guides df might not be in the correct format, how should it look like?

Currently it is:

>>> GUIDE_LIBRARY_DATAFRAME
                protospacer
0      GCCTCTGCCTGGTCTGTGGG
1      TGGTCTGTGGGGACGTGGCC
2      CATCCTGTGAGGCCTGCAAA
3      AGGCCTGCAAAGCCTTCTTC
4      ACAGCTGTCCGGCCTCCAAC
...                     ...
56100  TGGGGGTGTTCTGCTGGTAG
56101  TGGTTGTCGGGCAGCAGCAC
56102  TGTACTCCAGCTTGTGCCCC
56103  TGTGATCGCGCTTCTCGTTG
56104  TTCAAGTCCGCCATGCCCGA

[56105 rows x 1 columns]

>>> type(GUIDE_LIBRARY_DATAFRAME)
<class 'pandas.core.frame.DataFrame'>

>>> GUIDE_LIBRARY_DATAFRAME.dtypes
protospacer    object
dtype: object

>>> GUIDE_LIBRARY_DATAFRAME.info
<bound method DataFrame.info of                 protospacer
0      GCCTCTGCCTGGTCTGTGGG
1      TGGTCTGTGGGGACGTGGCC
2      CATCCTGTGAGGCCTGCAAA
3      AGGCCTGCAAAGCCTTCTTC
4      ACAGCTGTCCGGCCTCCAAC
...                     ...
56100  TGGGGGTGTTCTGCTGGTAG
56101  TGGTTGTCGGGCAGCAGCAC
56102  TGTACTCCAGCTTGTGCCCC
56103  TGTGATCGCGCTTCTCGTTG
56104  TTCAAGTCCGCCATGCCCGA

[56105 rows x 1 columns]>

>>> GUIDE_LIBRARY_DATAFRAME.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56105 entries, 0 to 56104
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   protospacer  56105 non-null  object
dtypes: object(1)
memory usage: 438.4+ KB

Best regards,
Thomas

CodingBash · 2024-11-25T16:36:54Z

Hi Thomas,

I have updated the package: 1) added the "= None" to the function to accept missing fastq_r2_fn, good catch, and 2) added a pre-processing to strip all trailing whitespace in the guide library dataframe sequences.

RE point 2, I think the error came from there being trailing whitespace in one or multiple of the protospacer sequences in your library, so there was an issue with my tool encoding spaces. You don't have to change anything with your library since the tool will be able to handle trailing whitespaces, just need to rerun function with updated version.

Install updated version: pip install crispr-ambiguous-mapping==0.0.179

Best,
Basheer

Edert · 2024-11-26T09:22:56Z

Dear Basheer,

yes, you were right somehow some white spaces slipped into my guide list. Thank you for handling them!

Unfortunately I ran into another error:


Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
1 seconds for parsing
Protospacer hamming threshold is 7
Inferring the true guides from observed guides
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 226, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 119, in get_whitelist_reporter_counts_with_umi
    encoded_whitelist_surrogate_sequences_series=encoded_whitelist_surrogate_sequences_series,
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'encoded_whitelist_surrogate_sequences_series' where it is not associated with a value

So I did initialize those four variables in the get_whitelist_reporter_counts_with_umi function in crispr_guide_contring.py:

    encoded_whitelist_surrogate_sequences_series = np.array([])
    encoded_whitelist_barcode_sequences_series = np.array([])
    surrogate_hamming_threshold = surrogate_hamming_threshold_strict
    barcode_hamming_threshold = barcode_hamming_threshold_strict

but ended up with:


Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
0 seconds for parsing
Protospacer hamming threshold is 7
Inferring the true guides from observed guides
Running inference parallelized on 84745 observed seqeunces with cores 16
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_inference.py", line 33, in infer_whitelist_sequence
    def infer_whitelist_sequence(observed_guide_reporter_sequence_input: Tuple[str, Optional[str], Optional[str]],
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/typeguard/_functions.py", line 113, in check_argument_types
    check_type_internal(value, expected_type, memo=memo)
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/typeguard/_checkers.py", line 676, in check_type_internal
    checker(value, origin_type, args, memo)
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/typeguard/_checkers.py", line 334, in check_tuple
    raise TypeCheckError("is not a tuple")
typeguard.TypeCheckError: argument "observed_guide_reporter_sequence_input" (str) is not a tuple
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 226, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 138, in get_whitelist_reporter_counts_with_umi
    inferred_true_reporter_sequences = pool.map(
                                       ^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
typeguard.TypeCheckError: argument "observed_guide_reporter_sequence_input" (str) is not a tuple

I guess observed_guide_reporter_sequence_input comes as string but the infer_whitelist_sequence function requires a tuple.

Best,
Thomas

CodingBash · 2024-12-09T15:55:29Z

Hi Thomas,

I set encoded_whitelist_surrogate_sequences_series=None. Hopefully this error is fixed. There may be some other variables regarding the surrogate sequence and barcode sequence that I mistakenly forgot to initialize when surrogate and barcode isn't inputted, let me know if these errors pop up

new version 0.0.184

Best,
Basheer

Edert · 2024-12-10T12:59:06Z

Dear Basheer,

yes, there are more not initialized variables I guess.

>>> count_result = crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_fastq(
...        whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME,
...        fastq_r1_fn=INPUT_GZ_FILE,  
...        protospacer_start_position = 0,
...        protospacer_length = 20,
...        is_protospacer_r1 = True,
...        is_protospacer_header = False,
...        revcomp_protospacer = False,
...        protospacer_hamming_threshold_strict=PROTOSPACER_HAMMING_THRESHOLD,
...        cores=CPUS)
Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
0 seconds for parsing
Protospacer hamming threshold is 7
Inferring the true guides from observed guides
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 226, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 124, in get_whitelist_reporter_counts_with_umi
    surrogate_hamming_threshold=surrogate_hamming_threshold,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'surrogate_hamming_threshold' where it is not associated with a value

Best,
Thomas

CodingBash · 2024-12-13T03:56:14Z

Hi Thomas,

I believe I fixed all these issues (tested on one of my protospacer-only samples). Try running again.

version 0.0.191

Then you can retrieve your counts from using either of the three:
count_result.all_match_set_whitelist_reporter_counter_series_results.protospacer_match.ambiguous_accepted_counterseries
count_result.all_match_set_whitelist_reporter_counter_series_results.protospacer_match.ambiguous_ignored_counterseries
count_result.all_match_set_whitelist_reporter_counter_series_results.protospacer_match.ambiguous_spread_counterseries

depending on how you want to treat sequences ambiguously mapped to more than one protospacer (tally +1 to all mapped protospacer, ignore ambiguous mapping and don't tally any protospacer, tally fractional +1/#_mapped to all mapped sequence).

Best,
Basheer

Edert · 2024-12-20T09:01:57Z

Dear Basheer,

thank you that´s great! I did test it and there was no more error message and I do receive counts.

I would have two more questions:

Is there a way to set the hamming threshold lower than 1? I guess not because then it would count only perfect matches?
How do you usually continue with the counts in terms of normalization and testing against the pool? MAGeCK or edgeR/Deseq2 maybe?

Best,
Thomas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run CRISP-Correct with R1 fastq only #3

Run CRISP-Correct with R1 fastq only #3

Edert commented Nov 18, 2024

CodingBash commented Nov 22, 2024

Edert commented Nov 25, 2024

CodingBash commented Nov 25, 2024

Edert commented Nov 26, 2024

CodingBash commented Dec 9, 2024

Edert commented Dec 10, 2024

CodingBash commented Dec 13, 2024

Edert commented Dec 20, 2024

Run CRISP-Correct with R1 fastq only #3

Run CRISP-Correct with R1 fastq only #3

Comments

Edert commented Nov 18, 2024

CodingBash commented Nov 22, 2024

Edert commented Nov 25, 2024

CodingBash commented Nov 25, 2024

Edert commented Nov 26, 2024

CodingBash commented Dec 9, 2024

Edert commented Dec 10, 2024

CodingBash commented Dec 13, 2024

Edert commented Dec 20, 2024