Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run CRISP-Correct with R1 fastq only #3

Open
Edert opened this issue Nov 18, 2024 · 8 comments
Open

Run CRISP-Correct with R1 fastq only #3

Edert opened this issue Nov 18, 2024 · 8 comments

Comments

@Edert
Copy link

Edert commented Nov 18, 2024

Hi,

I am trying to run CRISPR-Correct with just one fastq file (R1 only). According to the readme it should be possible to do so.
How should I define the fastq_r2_fn parameter as it seems to be required? None, NULL, or an empty string '' does not work.

My python lines:

import crispr_ambiguous_mapping
import pandas as pd
import gzip
import shutil

PROTOSPACER_HAMMING_THRESHOLD = 7
CPUS=16

GUIDE_LIBRARY_DATAFRAME = pd.read_table("ref/lib1_for_CRISPR_correct.txt")

GUIDE_LIBRARY_DATAFRAME
                protospacer
0      GCCTCTGCCTGGTCTGTGGG
1      TGGTCTGTGGGGACGTGGCC
2      CATCCTGTGAGGCCTGCAAA
3      AGGCCTGCAAAGCCTTCTTC
4      ACAGCTGTCCGGCCTCCAAC
...                     ...
56100  TGGGGGTGTTCTGCTGGTAG
56101  TGGTTGTCGGGCAGCAGCAC
56102  TGTACTCCAGCTTGTGCCCC
56103  TGTGATCGCGCTTCTCGTTG
56104  TTCAAGTCCGCCATGCCCGA

[56105 rows x 1 columns]


INPUT_GZ_FILE= "PlasmidPool_R1.fastq.gz"
FASTQ_FILE ="PlasmidPool_R1.fastq" 

#unzip
with gzip.open(INPUT_GZ_FILE, 'rb') as gz_file:
    with open(FASTQ_FILE, 'wb') as extracted_file:
        shutil.copyfileobj(gz_file, extracted_file)

 
count_result =  crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_umitools_output(whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME, fastq_r1_fn=FASTQ_FILE, fastq_r2_fn='None',protospacer_hamming_threshold_strict=PROTOSPACER_HAMMING_THRESHOLD,cores=CPUS)

and this is my fastq file:

head PlasmidPool_R1.fastq
@LH00392:128:22F7GGLT4:3:1101:41340:1042 1:N:0:ATACCTGT
CATGCAAGACAGGTCACAAG
+
IIIIIIIIIIIIIIIIIIII
@LH00392:128:22F7GGLT4:3:1101:42100:1042 1:N:0:ATACCTGT
GTTCTGGACATTCACCATCC
+
IIIIIIIIIIIIIIIIIIII
@LH00392:128:22F7GGLT4:3:1101:43152:1042 1:N:0:ATACCTGT
GTGCTCTAGCTGTCAAGCTT

Best regards,
Thomas

@CodingBash
Copy link
Collaborator

Hi Thomas,

I just updated the repository, please see the updated README and install the latest version of CRISPR-Correct: pip install crispr-ambiguous-mapping==0.0.177

Here is a draft of what the function should look like with R1 only:

result = crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_fastq(
       whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME, 
       fastq_r1_fn=INPUT_GZ_FILE,  # Tool accepts GZ files 

       protospacer_start_position = 0,
       protospacer_length = 20,

       is_protospacer_r1 = True, 
       is_protospacer_header = False, 
       revcomp_protospacer = False, 
       protospacer_hamming_threshold_strict=7,
       cores=CPUS)

The updated version isn't as robustly tested so let me know if you still have issues.

@Edert
Copy link
Author

Edert commented Nov 25, 2024

Dear Basheer,

thank you for the quick reply. I installed version 0.0.177 but it still requires a second fastq file:

>>> count_result = crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_fastq(
...        whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME,
...        fastq_r1_fn=INPUT_GZ_FILE,  # Tool accepts GZ files
...
...        protospacer_start_position = 0,
...        protospacer_length = 20,
...
...        is_protospacer_r1 = True,
...        is_protospacer_header = False,
...        revcomp_protospacer = False,
...        protospacer_hamming_threshold_strict=PROTOSPACER_HAMMING_THRESHOLD,
...        cores=CPUS)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: get_whitelist_reporter_counts_from_fastq() missing 1 required positional argument: 'fastq_r2_fn'

I did then change line 99 in main_mapping.py
from
fastq_r2_fn: Optional[str], to
fastq_r2_fn: Optional[str] = None,

Then I get this:


Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
1 seconds for parsing
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 227, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 52, in get_whitelist_reporter_counts_with_umi
    encoded_whitelist_protospacer_sequences_series = crispr_sequence_encoding.encode_guide_series(padded_whitelist_guide_reporter_df["protospacer"])
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_sequence_encoding.py", line 62, in encode_guide_series
    guide_numpy_encoding = encode_DNA_base_vectorized(guide_numpy_char)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2372, in __call__
    return self._call_as_normal(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2365, in _call_as_normal
    return self._vectorize_call(func=func, args=vargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2446, in _vectorize_call
    res = self._vectorize_call_with_signature(func, args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/numpy/lib/function_base.py", line 2486, in _vectorize_call_with_signature
    results = func(*(arg[index] for arg in args))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_sequence_encoding.py", line 48, in encode_DNA_base
    return full_encoding_dict[char]
           ~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: ' '

I have the feeling that the guides df might not be in the correct format, how should it look like?

Currently it is:

>>> GUIDE_LIBRARY_DATAFRAME
                protospacer
0      GCCTCTGCCTGGTCTGTGGG
1      TGGTCTGTGGGGACGTGGCC
2      CATCCTGTGAGGCCTGCAAA
3      AGGCCTGCAAAGCCTTCTTC
4      ACAGCTGTCCGGCCTCCAAC
...                     ...
56100  TGGGGGTGTTCTGCTGGTAG
56101  TGGTTGTCGGGCAGCAGCAC
56102  TGTACTCCAGCTTGTGCCCC
56103  TGTGATCGCGCTTCTCGTTG
56104  TTCAAGTCCGCCATGCCCGA

[56105 rows x 1 columns]

>>> type(GUIDE_LIBRARY_DATAFRAME)
<class 'pandas.core.frame.DataFrame'>

>>> GUIDE_LIBRARY_DATAFRAME.dtypes
protospacer    object
dtype: object

>>> GUIDE_LIBRARY_DATAFRAME.info
<bound method DataFrame.info of                 protospacer
0      GCCTCTGCCTGGTCTGTGGG
1      TGGTCTGTGGGGACGTGGCC
2      CATCCTGTGAGGCCTGCAAA
3      AGGCCTGCAAAGCCTTCTTC
4      ACAGCTGTCCGGCCTCCAAC
...                     ...
56100  TGGGGGTGTTCTGCTGGTAG
56101  TGGTTGTCGGGCAGCAGCAC
56102  TGTACTCCAGCTTGTGCCCC
56103  TGTGATCGCGCTTCTCGTTG
56104  TTCAAGTCCGCCATGCCCGA

[56105 rows x 1 columns]>

>>> GUIDE_LIBRARY_DATAFRAME.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56105 entries, 0 to 56104
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   protospacer  56105 non-null  object
dtypes: object(1)
memory usage: 438.4+ KB

Best regards,
Thomas

@CodingBash
Copy link
Collaborator

Hi Thomas,

I have updated the package: 1) added the "= None" to the function to accept missing fastq_r2_fn, good catch, and 2) added a pre-processing to strip all trailing whitespace in the guide library dataframe sequences.

RE point 2, I think the error came from there being trailing whitespace in one or multiple of the protospacer sequences in your library, so there was an issue with my tool encoding spaces. You don't have to change anything with your library since the tool will be able to handle trailing whitespaces, just need to rerun function with updated version.

Install updated version: pip install crispr-ambiguous-mapping==0.0.179

Best,
Basheer

@Edert
Copy link
Author

Edert commented Nov 26, 2024

Dear Basheer,

yes, you were right somehow some white spaces slipped into my guide list. Thank you for handling them!

Unfortunately I ran into another error:


Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
1 seconds for parsing
Protospacer hamming threshold is 7
Inferring the true guides from observed guides
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 226, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 119, in get_whitelist_reporter_counts_with_umi
    encoded_whitelist_surrogate_sequences_series=encoded_whitelist_surrogate_sequences_series,
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'encoded_whitelist_surrogate_sequences_series' where it is not associated with a value

So I did initialize those four variables in the get_whitelist_reporter_counts_with_umi function in crispr_guide_contring.py:

    encoded_whitelist_surrogate_sequences_series = np.array([])
    encoded_whitelist_barcode_sequences_series = np.array([])
    surrogate_hamming_threshold = surrogate_hamming_threshold_strict
    barcode_hamming_threshold = barcode_hamming_threshold_strict

but ended up with:


Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
0 seconds for parsing
Protospacer hamming threshold is 7
Inferring the true guides from observed guides
Running inference parallelized on 84745 observed seqeunces with cores 16
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
           ^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_inference.py", line 33, in infer_whitelist_sequence
    def infer_whitelist_sequence(observed_guide_reporter_sequence_input: Tuple[str, Optional[str], Optional[str]],
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/typeguard/_functions.py", line 113, in check_argument_types
    check_type_internal(value, expected_type, memo=memo)
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/typeguard/_checkers.py", line 676, in check_type_internal
    checker(value, origin_type, args, memo)
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/typeguard/_checkers.py", line 334, in check_tuple
    raise TypeCheckError("is not a tuple")
typeguard.TypeCheckError: argument "observed_guide_reporter_sequence_input" (str) is not a tuple
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 226, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 138, in get_whitelist_reporter_counts_with_umi
    inferred_true_reporter_sequences = pool.map(
                                       ^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
typeguard.TypeCheckError: argument "observed_guide_reporter_sequence_input" (str) is not a tuple

I guess observed_guide_reporter_sequence_input comes as string but the infer_whitelist_sequence function requires a tuple.

Best,
Thomas

@CodingBash
Copy link
Collaborator

Hi Thomas,

I set encoded_whitelist_surrogate_sequences_series=None. Hopefully this error is fixed. There may be some other variables regarding the surrogate sequence and barcode sequence that I mistakenly forgot to initialize when surrogate and barcode isn't inputted, let me know if these errors pop up

new version 0.0.184

Best,
Basheer

@Edert
Copy link
Author

Edert commented Dec 10, 2024

Dear Basheer,

yes, there are more not initialized variables I guess.

>>> count_result = crispr_ambiguous_mapping.mapping.get_whitelist_reporter_counts_from_fastq(
...        whitelist_guide_reporter_df=GUIDE_LIBRARY_DATAFRAME,
...        fastq_r1_fn=INPUT_GZ_FILE,  
...        protospacer_start_position = 0,
...        protospacer_length = 20,
...        is_protospacer_r1 = True,
...        is_protospacer_header = False,
...        revcomp_protospacer = False,
...        protospacer_hamming_threshold_strict=PROTOSPACER_HAMMING_THRESHOLD,
...        cores=CPUS)
Opening FASTQ.gz file with gzip, filename=PlasmidPool_R1.fastq.gz
0 seconds for file loading
0 seconds for parsing
Protospacer hamming threshold is 7
Inferring the true guides from observed guides
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/mapping/main_mapping.py", line 226, in get_whitelist_reporter_counts_from_fastq
    return get_whitelist_reporter_counts_with_umi_PARTIAL(contains_surrogate=contains_surrogate, contains_barcode=contains_barcode, contains_umi = contains_umi)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eder/.conda/envs/CRISPR-Correct/lib/python3.11/site-packages/crispr_ambiguous_mapping/processing/crispr_guide_counting.py", line 124, in get_whitelist_reporter_counts_with_umi
    surrogate_hamming_threshold=surrogate_hamming_threshold,
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'surrogate_hamming_threshold' where it is not associated with a value

Best,
Thomas

@CodingBash
Copy link
Collaborator

Hi Thomas,

I believe I fixed all these issues (tested on one of my protospacer-only samples). Try running again.

version 0.0.191

Then you can retrieve your counts from using either of the three:
count_result.all_match_set_whitelist_reporter_counter_series_results.protospacer_match.ambiguous_accepted_counterseries
count_result.all_match_set_whitelist_reporter_counter_series_results.protospacer_match.ambiguous_ignored_counterseries
count_result.all_match_set_whitelist_reporter_counter_series_results.protospacer_match.ambiguous_spread_counterseries

depending on how you want to treat sequences ambiguously mapped to more than one protospacer (tally +1 to all mapped protospacer, ignore ambiguous mapping and don't tally any protospacer, tally fractional +1/#_mapped to all mapped sequence).

Best,
Basheer

@Edert
Copy link
Author

Edert commented Dec 20, 2024

Dear Basheer,

thank you that´s great! I did test it and there was no more error message and I do receive counts.

I would have two more questions:

  1. Is there a way to set the hamming threshold lower than 1? I guess not because then it would count only perfect matches?

  2. How do you usually continue with the counts in terms of normalization and testing against the pool? MAGeCK or edgeR/Deseq2 maybe?

Best,
Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants