Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function estimate_oc #38

Open
NPJuncal opened this issue Jan 15, 2024 · 9 comments
Open

function estimate_oc #38

NPJuncal opened this issue Jan 15, 2024 · 9 comments

Comments

@NPJuncal
Copy link
Collaborator

NPJuncal commented Jan 15, 2024

Estimation of organic Carbon is done by means of linear regressions on log(organic carbon) ~ log(organic matter). It gives back a organic carbon value for each organic matter value provided. If there is a organic carbon value for that sample it return the same value, else, generates a model for that site, else, model for specie, else, model for Ecosystem. If a model can not be created due to the low number of samples (<10) it uses the equations in Howard et al. 2014 to estimate the organic carbon.

outputs:
The initial tibble or data.frame with three new columns:

  • one column with estimated organic carbon values (eOC) in %
  • the standard error of the prediction (eOC_se)
  • the type of model used for estimation (origin)
    In addition, a plot with the relationship between organic matter and estimated organic carbon

Works with example data but there is at least one bug.

tasks:

-the function does not work if the oc column is empty. The function estimate several models and then check p values and r2 to see if they are good to use. If there are not good models utilize published equations to estimate oc. But if the oc column is empty it can not estimate any model and it breaks. In this case it should go directly to to the published equations but it does not.

-change equation to estimate oc in salt marsh data if models can not be built. Now it comes from Craft et al. 1991. New equation in Maxwell et al. (2023). https://doi.org/10.1038/s41597-023-02633-x

-add a new output: df with the parameters of the models estimated

-write more test (only a few to check that the function checks type of data provided (numeric vs not numeric), but not about the funtionality of the function itself

-check documentation

enhancement 1: the function divide the df by ecossytem, dominant species and station. Due to this, it is compulsory to provide three columns with ecosystem, species and station. The user may not have diferent species or stations. Allow user to chose if it want this groups or not. Ecosystem column need to be there anyway just in case the models are no good and need to use published functions.

NPJuncal pushed a commit that referenced this issue Jan 17, 2024
…xtra columns with the estimated oc and the model used to estimate each sample, to a list with tho elements. The data frame that was the former output and a aditional object with the parameters of the models.

#38
@NPJuncal
Copy link
Collaborator Author

@costavale
I have tried the function with a database that had the oc column empty and it work perfectly using the literature functions... could you send me the dataframe you try to use to check what the problem was?

@NPJuncal
Copy link
Collaborator Author

Craft et al 1991 function already changed for Maxwell et al 2023

@NPJuncal
Copy link
Collaborator Author

updated tasks:

-check function with @costavale data. The function does work if the oc column is empty. This was not the problem.

-Change 10 sample limit for a warning (see issue #44 ). Right now the function does not fit models if there are less than 10 samples. We can allow it and give a warning. Not all models will be used (models with low r2 or p value are deleted). Need to save which models had less than 10 samples and check with final database if any of this model were used and provide the names of the cores in the warning.

-Check if the om values used to estimate oc are within the range of the samples used to fit the models (see issue #44 ). Possible approach: add the minimum and maximum input om value of each model in a df and then check if the values used to estimate the oc in each core are between them.

-write more test (only a few to check that the function checks type of data provided (numeric vs not numeric), but not about the funtionality of the function itself

-check documentation

enhancement 1: the function divide the df by ecossytem, dominant species and station. Due to this, it is compulsory to provide three columns with ecosystem, species and station. The user may not have diferent species or stations. Allow user to chose if it want this groups or not. Ecosystem column need to be there anyway just in case the models are no good and need to use published functions.

NPJuncal pushed a commit that referenced this issue Feb 23, 2024
… to 3

#38

Also added a warning if the model used had less than 10 samples and if the oc estimated is below or abouve the initial oc range to estimate the mode. Second and third task in issue #38.
Seeing the warning is dificult because the warning for mangrove data with less than 5% of om prints for every sample insted of ones.
I will try to fix this or add it to the issue tasks.
@NPJuncal
Copy link
Collaborator Author

updated tasks:

-check function with @costavale data. The function does work if the oc column is empty. This was not the problem.

-write more test (only a few to check that the function checks type of data provided (numeric vs not numeric), but not about the funtionality of the function itself

-check documentation

enhancement 1: the function divide the df by ecossytem, dominant species and station. Due to this, it is compulsory to provide three columns with ecosystem, species and station. The user may not have diferent species or stations. Allow user to chose if it want this groups or not. Ecosystem column need to be there anyway just in case the models are no good and need to use published functions.

@NPJuncal
Copy link
Collaborator Author

I have solved three bugs I found while using this function with new data:

fit_models would not fit site models if there was only one specie per station
choose model function should not work if site_models or species_models were null
it the input dataset has a column name as one of the temporal columns created by any of the functions they would break

I have already fixed all three. Maybe those were the one @costavale found.

@costavale
Copy link
Collaborator

Hi all, I am going to check these updates and test again the functions with that dataset.
Regarding the rest, how did we split the work? I am sorry I missed some of the previous updates

@Julenasti
Copy link
Collaborator

Yes sorry, I'm a little lost right now. Nerea, if you think I can help you with a specific job let me know. I think I'll be more helpful if the issues you assign me are code related - if you have any

@NPJuncal
Copy link
Collaborator Author

NPJuncal commented Mar 19, 2024

This function is huge and writing test that would test specific steps is a bit messi. I have decided to write test for each of the small functions within it.
For example:
test_that("fit_ecosystem_model function works as expected", {

what do you think @Pakillo ? Maybe I should put all test in the same script insted of opening a new script for each function?

@NPJuncal
Copy link
Collaborator Author

The function crashes
Furthermore, change the Kauffman equation for new equation

NPJuncal pushed a commit that referenced this issue Jul 30, 2024
choose model function tested if the species of that sample was included in the species model, but didnt give and alternative if not.
is fixed, now it skip the species model and select ecosystem model :)
#38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants