Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform documentation and example Notebooks for all transforms! #753

Open
1 of 2 tasks
shahrokhDaijavad opened this issue Oct 29, 2024 · 15 comments
Open
1 of 2 tasks
Labels
enhancement New feature or request simplify-DPK

Comments

@shahrokhDaijavad
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

Feature

This is a "super-issue" that will affect all transforms! Each transform owner will be assigned to do two tasks, for the transform they won:

  1. Better documentation of each transform, based on a given template (higher priority task)
  2. An example notebook for every transform. The notebooks should be simple to use by taking in the user data, calling the API, and showing the output result. All other code (extra imports, parameter settings) should be hidden away. This will be done based on a notebook template.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@touma-I
Copy link
Collaborator

touma-I commented Nov 5, 2024

@shahrokhDaijavad can you please follow-up with the individual transform owners.

@Bytes-Explorer
Copy link
Collaborator

@shahrokhDaijavad @agoyal26 Here is my suggestion on first batch of transforms to be updated. What do you guys think?

  • Exact Dedup
  • Fuzzy Dedup
  • PDF2Parquet
  • HTML2Parquet
  • Doc_chunk
  • Doc_quality
  • text_encoder
  • doc_id
  • HAP

@shahrokhDaijavad
Copy link
Member Author

This is a good first batch, @Bytes-Explorer. The owners of these transforms are: Boris, Michele, Sung, Tsuzuku-san, and Yang Zhao. We can go ahead and assign them, but we need to discuss what (and a timeline) to expect if they are doing other projects at the moment.

@Bytes-Explorer
Copy link
Collaborator

I believe you already have the template for documentation which should help all the owners fill in. Btw, Constantin has taken over all the work from Boris.

@shahrokhDaijavad
Copy link
Member Author

@cmadam , @dolfim-ibm , @sungeunan-ibm , @dtsuzuku-ibm and @ian-cho We have started a significant effort to simplify the use of DPK for the first-time users.
A high-priority item is to have a better and more unified documentation (README files) for each transform. Beyond that, we want an example Jupyter notebook for each transform. For the second step, we will develop a template notebook and share it with you later. But the first step is first!

For the first step, we have a template for what we think should be in the README for each transform (attached).
DPK_Documentation_template.docx

As the owners of the first batch of transforms (list below), I am going to assign to you the task of such documentation for your transform with a target date of Nov. 22. Please use your good judgment to do this, based on what the current README has and what the common template is trying to achieve. If current work commitments prevent you from doing this, please comment, and suggest a way forward (e.g., a later date, a different person to assign this task to, etc).

Owners:
Exact dedup, Fuzzy dedup, doc_id: Constantin
PDF2Parquet, Doc_chunk, text_encoder: Michele
HTML2Parquet: Sung
Doc_quality: Tsuzuku-san
HAP: Yang Zhao

@shahrokhDaijavad
Copy link
Member Author

Hi, @cmadam, @dolfim-ibm, @sungeunan-ibm and @ian-cho . I just looked at the PR @dtsuzuku-ibm has submitted for the documentation of Doc_quality (PR #790) and if you haven't started doing this, you can use that README as a model (easier than the template above).

@shahrokhDaijavad
Copy link
Member Author

BTW, we are working towards a template for the Jupyter notebook in this issue #754, and we will make it more solid in the next couple of days as a model to follow.

@dolfim-ibm
Copy link
Member

@shahrokhDaijavad here we go #800

@shahrokhDaijavad
Copy link
Member Author

@dtsuzuku-ibm, @cmadam, @dolfim-ibm, @sungeunan-ibm and @ian-cho (cc: @agoyal26 and @touma-I):
Based on the discussion we have been having with @dtsuzuku-ibm and @dolfim-ibm, who have finished the documentation of their transforms in PR #790 and PR #800 about adding some example code to the README file or not, we think we should add a simple example Notebook and the link to from the README now (combining steps 1 and 2 above) and don't wait for a "perfect" Notebook template. Having this Notebook example obviates the need for code in the README. The template for this "minimal" Notebook is this example Notebook that Maroun did for html2parquet: https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb. You should modify this notebook to your transform and add some explanation of what each cell does. The notebook goes in a directory named "notebooks" in parallel with python, ray, ... directories for your transform.

@Bytes-Explorer
Copy link
Collaborator

@shahrokhDaijavad Should we not keep all notebooks in the example folder? Readme can have the link.

@shahrokhDaijavad
Copy link
Member Author

@Bytes-Explorer I think we should use the example folder for "use cases" that use a sequence of transforms to showcase that use case, e.g., RAG, fine-tuning, etc. IMHO, a single-function Notebook that only shows how to use the transform belongs to the directory of that transform, as it complements the README file of that transform with some real code. If you have a good reason for putting all these single-function notebooks in the examples folder, I change my opinion easily!

@dolfim-ibm
Copy link
Member

Do I get it right, such a notebook will run only when the transform (in its latest state) is published to pypi?

When changing the transform, we need to wait the (pre)release before we can update the notebook, right?

@Bytes-Explorer
Copy link
Collaborator

@Bytes-Explorer I think we should use the example folder for "use cases" that use a sequence of transforms to showcase that use case, e.g., RAG, fine-tuning, etc. IMHO, a single-function Notebook that only shows how to use the transform belongs to the directory of that transform, as it complements the README file of that transform with some real code. If you have a good reason for putting all these single-function notebooks in the examples folder, I change my opinion easily!


@shahrokhDaijavad I see where you are coming from. I am coming from the point of view that if all examples are in the same folder like this one, it is easy for a beginner to have one place to look for things and get started. Application specific examples can be in sub folders, like this one and this one.

@touma-I
Copy link
Collaborator

touma-I commented Nov 15, 2024

Do I get it right, such a notebook will run only when the transform (in its latest state) is published to pypi?

When changing the transform, we need to wait the (pre)release before we can update the notebook, right?

@dolfim-ibm We could setup the venv environment for running the notebook based on either pip install or make venv. I have the feeling most developers will want to use make venv to do any testing of their notebooks and don't want to hit possible issues that the packaingmay introduce. I would keep it confined to the specific transform using make venv for that transform.

@touma-I
Copy link
Collaborator

touma-I commented Nov 15, 2024

@shahrokhDaijavad @Bytes-Explorer I would lean to keep this in the transform folder and not require the developer to make it work with collab. What we are asking here is very specific for the transform owner to think through how folks use their transform in a notebook and reduce as much as possible the number of variables that the developers need to deal with.

cmadam added a commit that referenced this issue Nov 26, 2024
cmadam added a commit that referenced this issue Nov 26, 2024
touma-I added a commit that referenced this issue Dec 5, 2024
Update doc for doc_id and ededup to follow template in issue #753
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request simplify-DPK
Projects
None yet
Development

No branches or pull requests

9 participants