Skip to content

Commit

Permalink
deploy: 5cfdba4
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Jul 5, 2024
1 parent 24bfb36 commit ce065f5
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 10 deletions.
Binary file modified .doctrees/developer_notes/base_data_class.doctree
Binary file not shown.
Binary file modified .doctrees/environment.pickle
Binary file not shown.
10 changes: 5 additions & 5 deletions _sources/developer_notes/base_data_class.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ DataClass
In `PyTorch`, ``Tensor`` is the data type used in ``Module`` and ``Optimizer`` across the library.
Tensor wraps a multi-dimensional matrix to better support its operations and computations.
In LLM applications, data constantly needs to interact with LLMs in the form of strings via prompt and be parsed back to structured data from LLMs' text prediction.
:class:`core.base_data_class.DataClass` is designed to ease the data interaction with LLMs via prompt(input) and text prediction(output).
:class:`DataClass<core.base_data_class.DataClass>` is designed to ease the data interaction with LLMs via prompt(input) and text prediction(output).

.. figure:: /_static/images/dataclass.png
:align: center
Expand All @@ -28,7 +28,7 @@ This overlaps with the serialization and deserialization of the data in the conv
Packages like ``Pydantic`` or ``Marshmallow`` can covers the seralization and deserialization, but it will end up with more complexity and less transparency to users.
LLM prompts are known to be sensitive, the details, controllability, and transparency of the data format are crucial here.

We eventually created a base class :class:`core.base_data_class.DataClass` to handle data that will interact with LLMs, which builds on top of Python's native ``dataclasses`` module.
We eventually created a base class :class:`DataClass<core.base_data_class.DataClass>` to handle data that will interact with LLMs, which builds on top of Python's native ``dataclasses`` module.
Here are our reasoning:

1. ``dataclasses`` module is lightweight, flexible, and is already widely used in Python for data classes.
Expand Down Expand Up @@ -60,7 +60,7 @@ Here is how users typically use the ``dataclasses`` module:

We also made the effort to provide more control:

1. **Keep the ordering of your data fields.** We provided :func:`core.base_data_class.required_field` with ``default_factory`` to mark the field as required even if it is after optional fields. We also has to do customization to preserve their ordering while being converted to dictionary, json and yaml string.
1. **Keep the ordering of your data fields.** We provided :func:`required_field<core.base_data_class.required_field>` with ``default_factory`` to mark the field as required even if it is after optional fields. We also has to do customization to preserve their ordering while being converted to dictionary, json and yaml string.
2. **Exclude some fields from the output.** All serialization methods support `exclude` parameter to exclude some fields even for nested dataclasses.
3. **Allow nested dataclasses, lists, and dictionaries.** All methods support nested dataclasses, lists, and dictionaries.

Expand Down Expand Up @@ -113,7 +113,7 @@ Work with Data Instance
* - ``format_example_str(self, format_type, exclude) -> str``
- Generate data examples string, covers ``to_json`` and ``to_yaml``.

We have :class:`core.base_data_class.DataClassFormatType` to specify the format type for the data format methods.
We have :class:`DataclassFormatType<core.base_data_class.DataClassFormatType>` to specify the format type for the data format methods.

.. note::

Expand Down Expand Up @@ -173,7 +173,7 @@ Describe the data format to LLMs
We will create ``TrecData2`` class that subclasses from `DataClass`.
You decide to add a field ``metadata`` to the ``TrecData`` class to store the metadata of the question.
For your own reason, you want ``metadata`` to be a required field and you want to keep the ordering of your fields while being converted to strings.
``DataClass`` will help you achieve this using :func:`core.base_data_class.required_field` on the `default_factory` of the field.
``DataClass`` will help you achieve this using :func:`required_field<core.base_data_class.required_field>` on the `default_factory` of the field.
Normally, this is not possible with the native `dataclasses` module as it will raise an error if you put a required field after an optional field.

.. note::
Expand Down
10 changes: 5 additions & 5 deletions developer_notes/base_data_class.html
Original file line number Diff line number Diff line change
Expand Up @@ -481,7 +481,7 @@
<p>In <cite>PyTorch</cite>, <code class="docutils literal notranslate"><span class="pre">Tensor</span></code> is the data type used in <code class="docutils literal notranslate"><span class="pre">Module</span></code> and <code class="docutils literal notranslate"><span class="pre">Optimizer</span></code> across the library.
Tensor wraps a multi-dimensional matrix to better support its operations and computations.
In LLM applications, data constantly needs to interact with LLMs in the form of strings via prompt and be parsed back to structured data from LLMs’ text prediction.
<a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.DataClass" title="core.base_data_class.DataClass"><code class="xref py py-class docutils literal notranslate"><span class="pre">core.base_data_class.DataClass</span></code></a> is designed to ease the data interaction with LLMs via prompt(input) and text prediction(output).</p>
<a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.DataClass" title="core.base_data_class.DataClass"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataClass</span></code></a> is designed to ease the data interaction with LLMs via prompt(input) and text prediction(output).</p>
<figure class="align-center" id="id1">
<a class="reference internal image-reference" href="../_images/dataclass.png"><img alt="DataClass" src="../_images/dataclass.png" style="width: 680px;" />
</a>
Expand All @@ -496,7 +496,7 @@ <h2>Design<a class="headerlink" href="#design" title="Link to this heading">#</a
This overlaps with the serialization and deserialization of the data in the conventional programming.
Packages like <code class="docutils literal notranslate"><span class="pre">Pydantic</span></code> or <code class="docutils literal notranslate"><span class="pre">Marshmallow</span></code> can covers the seralization and deserialization, but it will end up with more complexity and less transparency to users.
LLM prompts are known to be sensitive, the details, controllability, and transparency of the data format are crucial here.</p>
<p>We eventually created a base class <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.DataClass" title="core.base_data_class.DataClass"><code class="xref py py-class docutils literal notranslate"><span class="pre">core.base_data_class.DataClass</span></code></a> to handle data that will interact with LLMs, which builds on top of Python’s native <code class="docutils literal notranslate"><span class="pre">dataclasses</span></code> module.
<p>We eventually created a base class <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.DataClass" title="core.base_data_class.DataClass"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataClass</span></code></a> to handle data that will interact with LLMs, which builds on top of Python’s native <code class="docutils literal notranslate"><span class="pre">dataclasses</span></code> module.
Here are our reasoning:</p>
<ol class="arabic simple">
<li><p><code class="docutils literal notranslate"><span class="pre">dataclasses</span></code> module is lightweight, flexible, and is already widely used in Python for data classes.</p></li>
Expand Down Expand Up @@ -525,7 +525,7 @@ <h2>Design<a class="headerlink" href="#design" title="Link to this heading">#</a
</ol>
<p>We also made the effort to provide more control:</p>
<ol class="arabic simple">
<li><p><strong>Keep the ordering of your data fields.</strong> We provided <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.required_field" title="core.base_data_class.required_field"><code class="xref py py-func docutils literal notranslate"><span class="pre">core.base_data_class.required_field()</span></code></a> with <code class="docutils literal notranslate"><span class="pre">default_factory</span></code> to mark the field as required even if it is after optional fields. We also has to do customization to preserve their ordering while being converted to dictionary, json and yaml string.</p></li>
<li><p><strong>Keep the ordering of your data fields.</strong> We provided <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.required_field" title="core.base_data_class.required_field"><code class="xref py py-func docutils literal notranslate"><span class="pre">required_field</span></code></a> with <code class="docutils literal notranslate"><span class="pre">default_factory</span></code> to mark the field as required even if it is after optional fields. We also has to do customization to preserve their ordering while being converted to dictionary, json and yaml string.</p></li>
<li><p><strong>Exclude some fields from the output.</strong> All serialization methods support <cite>exclude</cite> parameter to exclude some fields even for nested dataclasses.</p></li>
<li><p><strong>Allow nested dataclasses, lists, and dictionaries.</strong> All methods support nested dataclasses, lists, and dictionaries.</p></li>
</ol>
Expand Down Expand Up @@ -604,7 +604,7 @@ <h3>Work with Data Instance<a class="headerlink" href="#work-with-data-instance"
</tbody>
</table>
</div>
<p>We have <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.DataClassFormatType" title="core.base_data_class.DataClassFormatType"><code class="xref py py-class docutils literal notranslate"><span class="pre">core.base_data_class.DataClassFormatType</span></code></a> to specify the format type for the data format methods.</p>
<p>We have <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.DataClassFormatType" title="core.base_data_class.DataClassFormatType"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataclassFormatType</span></code></a> to specify the format type for the data format methods.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>To use <code class="docutils literal notranslate"><span class="pre">DataClass</span></code>, you have to decorate your class with the <code class="docutils literal notranslate"><span class="pre">dataclass</span></code> decorator from the <code class="docutils literal notranslate"><span class="pre">dataclasses</span></code> module.</p>
Expand Down Expand Up @@ -640,7 +640,7 @@ <h3>Describe the data format to LLMs<a class="headerlink" href="#describe-the-da
<p>We will create <code class="docutils literal notranslate"><span class="pre">TrecData2</span></code> class that subclasses from <cite>DataClass</cite>.
You decide to add a field <code class="docutils literal notranslate"><span class="pre">metadata</span></code> to the <code class="docutils literal notranslate"><span class="pre">TrecData</span></code> class to store the metadata of the question.
For your own reason, you want <code class="docutils literal notranslate"><span class="pre">metadata</span></code> to be a required field and you want to keep the ordering of your fields while being converted to strings.
<code class="docutils literal notranslate"><span class="pre">DataClass</span></code> will help you achieve this using <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.required_field" title="core.base_data_class.required_field"><code class="xref py py-func docutils literal notranslate"><span class="pre">core.base_data_class.required_field()</span></code></a> on the <cite>default_factory</cite> of the field.
<code class="docutils literal notranslate"><span class="pre">DataClass</span></code> will help you achieve this using <a class="reference internal" href="../apis/core/core.base_data_class.html#core.base_data_class.required_field" title="core.base_data_class.required_field"><code class="xref py py-func docutils literal notranslate"><span class="pre">required_field</span></code></a> on the <cite>default_factory</cite> of the field.
Normally, this is not possible with the native <cite>dataclasses</cite> module as it will raise an error if you put a required field after an optional field.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
Expand Down

0 comments on commit ce065f5

Please sign in to comment.