Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashes to identify input, outputs and output_annotations data entries #155

Open
oplatek opened this issue Nov 27, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@oplatek
Copy link
Member

oplatek commented Nov 27, 2024

Our dataset management can be illustrated based on the dependencies how the entries are generated.

input(dataset, split) -> NLG process -> output(NLG_system_id)  \
   output(NLG_system_id) -> ANNOTATION_PROCESS -> annotations_of_output(campaign_details, ...) 

Since many properties could identify input, output, and output_annotations, I think it is best to use hashes to identify inputs, outputs, and list_of_example_annotations.

I image that each data entry will have a hash

input
  - input_idx  # determining dataset, split and particular example, how to example was preprocess/rendered by factgenie etc...

output
  - input_idx  # reference to the exact input which was used for generation
  - output_idx  # uniquely identifying the output

annotations_list
  - output_idx  # uniquely identifying which output was annotated
  - annotations_idx  # uniquely identifyiing the annotation list
  
@oplatek oplatek added the enhancement New feature or request label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant