diff --git a/sigpro/sigpro_pipeline_demo.ipynb b/sigpro/sigpro_pipeline_demo.ipynb new file mode 100644 index 0000000..e50e8dc --- /dev/null +++ b/sigpro/sigpro_pipeline_demo.ipynb @@ -0,0 +1,485 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "559d158c", + "metadata": {}, + "source": [ + "# Processing Signals with Pipelines\n", + "\n", + "Now that we have identified and/or generated several primitives for our signal feature generation, we would like to define a reusable *pipeline* for doing so. \n", + "\n", + "First, let's import the required libraries and functions.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "29d47100", + "metadata": {}, + "outputs": [], + "source": [ + "import sigpro\n", + "import numpy as np\n", + "import pandas as pd\n", + "from matplotlib import pyplot as plt\n", + "from sigpro.demo import _load_demo as get_demo" + ] + }, + { + "cell_type": "markdown", + "id": "85e23607", + "metadata": {}, + "source": [ + "\n", + "## Defining Primitives\n", + "\n", + "Recall that we can obtain the list of available primitives with the `get_primitives` method:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "fd8e7afe", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['sigpro.SigPro',\n", + " 'sigpro.aggregations.amplitude.statistical.crest_factor',\n", + " 'sigpro.aggregations.amplitude.statistical.kurtosis',\n", + " 'sigpro.aggregations.amplitude.statistical.mean',\n", + " 'sigpro.aggregations.amplitude.statistical.rms',\n", + " 'sigpro.aggregations.amplitude.statistical.skew',\n", + " 'sigpro.aggregations.amplitude.statistical.std',\n", + " 'sigpro.aggregations.amplitude.statistical.var',\n", + " 'sigpro.aggregations.frequency.band.band_mean',\n", + " 'sigpro.transformations.amplitude.identity.identity',\n", + " 'sigpro.transformations.amplitude.spectrum.power_spectrum',\n", + " 'sigpro.transformations.frequency.band.frequency_band',\n", + " 'sigpro.transformations.frequency.fft.fft',\n", + " 'sigpro.transformations.frequency.fft.fft_real',\n", + " 'sigpro.transformations.frequency_time.stft.stft',\n", + " 'sigpro.transformations.frequency_time.stft.stft_real']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from sigpro import get_primitives\n", + "\n", + "get_primitives()" + ] + }, + { + "cell_type": "markdown", + "id": "e7a5d87d", + "metadata": {}, + "source": [ + "In addition, we can also define our own custom primitives.\n", + "\n", + "## Building a Pipeline\n", + "\n", + "Let’s go ahead and define a feature processing pipeline that sequentially applies the `identity`and `fft` transformations before applying the `std` aggregation. To pass these primitives into the signal processor, we must write each primitive as a dictionary with the following fields:\n", + "\n", + "- `name`: Name of the transformation / aggregation.\n", + "- `primitive`: Name of the primitive to apply.\n", + "- `init_params`: Dictionary containing the initializing parameters for the primitive. *\n", + "\n", + "Since we choose not to specify any initial parameters, we do not set `init_params` in these dictionaries." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "748893d2", + "metadata": {}, + "outputs": [], + "source": [ + "identity_transform = {'name': 'identity1',\n", + " 'primitive': 'sigpro.transformations.amplitude.identity.identity'}\n", + "\n", + "fft_transform = {'name': 'fft1',\n", + " 'primitive': 'sigpro.transformations.frequency.fft.fft'}\n", + "\n", + "std_agg = {'name': 'std1',\n", + " 'primitive': \"sigpro.aggregations.amplitude.statistical.std\"}" + ] + }, + { + "cell_type": "markdown", + "id": "bb97a448", + "metadata": {}, + "source": [ + "\n", + "We now define a new pipeline containing the primitives we would like to apply. At minimum, we will need to pass in a list of transformations and a list of aggregations; the full list of available arguments is given below.\n", + "\n", + "- Inputs:\n", + " - `transformations (list)` : List of dictionaries containing the transformation primitives.\n", + " - `aggregations (list)`: List of dictionaries containing the aggregation primitives.\n", + " - `values_column_name (str)`(optional):The name of the column that contains the signal values. Defaults to `'values'`.\n", + " - `keep_columns (Union[bool, list])` (optional): Whether to keep non-feature columns in the output DataFrame or not. If a list of column names are passed, those columns are kept. Defaults to `False`.\n", + " - `input_is_dataframe (bool)` (optional): Whether the input is a pandas Dataframe. Defaults to `True`.\n", + "\n", + "Returning to the example:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "755c6442", + "metadata": {}, + "outputs": [], + "source": [ + "transformations = [identity_transform, fft_transform]\n", + "\n", + "aggregations = [std_agg]\n", + "\n", + "mypipeline = sigpro.SigPro(transformations, aggregations, values_column_name = 'yvalues', keep_columns = True)" + ] + }, + { + "cell_type": "markdown", + "id": "1a69f6b7", + "metadata": {}, + "source": [ + "\n", + "SigPro will proceed to build an `MLPipeline` that can be reused to build features.\n", + "\n", + "To check that `mypipeline` was defined correctly, we can check the input and output arguments with the `get_input_args` and `get_output_args` methods." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "efea71e3", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[{'name': 'readings', 'keyword': 'data', 'type': 'pandas.DataFrame'}, {'name': 'feature_columns', 'default': None, 'type': 'list'}]\n", + "[{'name': 'readings', 'type': 'pandas.DataFrame'}, {'name': 'feature_columns', 'type': 'list'}]\n" + ] + } + ], + "source": [ + "input_args = mypipeline.get_input_args()\n", + "output_args = mypipeline.get_output_args()\n", + "\n", + "print(input_args)\n", + "print(output_args)" + ] + }, + { + "cell_type": "markdown", + "id": "7de8fe01", + "metadata": {}, + "source": [ + "## Applying a Pipeline with `process_signal`\n", + "\n", + "Once our pipeline is correctly defined, we apply the `process_signal` method to a demo dataset. Recall that `process_signal` is defined as follows:\n", + "\n", + "\n", + "```python\n", + "def process_signal(self, data=None, window=None, time_index=None, groupby_index=None,\n", + " feature_columns=None, **kwargs):\n", + "\n", + "\t\t...\n", + "\t\treturn data, feature_columns\n", + "```\n", + "\n", + "`process_signal` accepts as input the following arguments:\n", + "\n", + "- `data (pd.Dataframe)` : Dataframe with a column containing signal values.\n", + "- `window (str)`: Duration of window size, e.g. ('1h').\n", + "- `time_index (str)`: Name of column in `data` that represents the time index.\n", + "- `groupby_index (str or list[str])`: List of column names to group together and take the window over.\n", + "- `feature_columns (list)`: List of columns from the input data that should be considered as features (and not dropped).\n", + "\n", + "`process_signal` outputs the following:\n", + "\n", + "- `data (pd.Dataframe)`: Dataframe containing output feature values as constructed from the signal\n", + "- `feature_columns (list)`: list of (generated) feature names.\n", + "\n", + "We now apply our pipeline to a toy dataset. We define our toy dataset as follows: " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "05a2d02e", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
turbine_idsignal_idxvaluesyvaluessampling_frequency
0T001Sensor1_signal12020-01-01 00:00:00[0.43616983763682876, -0.17662312586241055, 0....1000
1T001Sensor1_signal12020-01-01 01:00:00[0.8023828754411122, -0.14122063493312714, -0....1000
2T001Sensor1_signal12020-01-01 02:00:00[-1.3143142430046044, -1.1055740033788437, -0....1000
3T001Sensor1_signal12020-01-01 03:00:00[-0.45981995520032104, -0.3255426061995603, -0...1000
4T001Sensor1_signal12020-01-01 04:00:00[-0.6380405111460377, -0.11924167777027689, 0....1000
\n", + "
" + ], + "text/plain": [ + " turbine_id signal_id xvalues \\\n", + "0 T001 Sensor1_signal1 2020-01-01 00:00:00 \n", + "1 T001 Sensor1_signal1 2020-01-01 01:00:00 \n", + "2 T001 Sensor1_signal1 2020-01-01 02:00:00 \n", + "3 T001 Sensor1_signal1 2020-01-01 03:00:00 \n", + "4 T001 Sensor1_signal1 2020-01-01 04:00:00 \n", + "\n", + " yvalues sampling_frequency \n", + "0 [0.43616983763682876, -0.17662312586241055, 0.... 1000 \n", + "1 [0.8023828754411122, -0.14122063493312714, -0.... 1000 \n", + "2 [-1.3143142430046044, -1.1055740033788437, -0.... 1000 \n", + "3 [-0.45981995520032104, -0.3255426061995603, -0... 1000 \n", + "4 [-0.6380405111460377, -0.11924167777027689, 0.... 1000 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "demo_dataset = get_demo()\n", + "demo_dataset.columns = ['turbine_id', 'signal_id', 'xvalues', 'yvalues', 'sampling_frequency']\n", + "demo_dataset.head()" + ] + }, + { + "cell_type": "markdown", + "id": "74f2f337", + "metadata": {}, + "source": [ + "Finally, we apply the `process_signal` method of our previously defined pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "b97344a2", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
turbine_idsignal_idxvaluesyvaluessampling_frequencyidentity1.fft1.std1.std_value
0T001Sensor1_signal12020-01-01 00:00:00[0.43616983763682876, -0.17662312586241055, 0....100014.444991
1T001Sensor1_signal12020-01-01 01:00:00[0.8023828754411122, -0.14122063493312714, -0....100012.326223
2T001Sensor1_signal12020-01-01 02:00:00[-1.3143142430046044, -1.1055740033788437, -0....100012.051415
3T001Sensor1_signal12020-01-01 03:00:00[-0.45981995520032104, -0.3255426061995603, -0...100010.657243
4T001Sensor1_signal12020-01-01 04:00:00[-0.6380405111460377, -0.11924167777027689, 0....100012.640728
\n", + "
" + ], + "text/plain": [ + " turbine_id signal_id xvalues \\\n", + "0 T001 Sensor1_signal1 2020-01-01 00:00:00 \n", + "1 T001 Sensor1_signal1 2020-01-01 01:00:00 \n", + "2 T001 Sensor1_signal1 2020-01-01 02:00:00 \n", + "3 T001 Sensor1_signal1 2020-01-01 03:00:00 \n", + "4 T001 Sensor1_signal1 2020-01-01 04:00:00 \n", + "\n", + " yvalues sampling_frequency \\\n", + "0 [0.43616983763682876, -0.17662312586241055, 0.... 1000 \n", + "1 [0.8023828754411122, -0.14122063493312714, -0.... 1000 \n", + "2 [-1.3143142430046044, -1.1055740033788437, -0.... 1000 \n", + "3 [-0.45981995520032104, -0.3255426061995603, -0... 1000 \n", + "4 [-0.6380405111460377, -0.11924167777027689, 0.... 1000 \n", + "\n", + " identity1.fft1.std1.std_value \n", + "0 14.444991 \n", + "1 12.326223 \n", + "2 12.051415 \n", + "3 10.657243 \n", + "4 12.640728 " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "processed_data, feature_columns = mypipeline.process_signal(demo_dataset, time_index = 'xvalues')\n", + "\n", + "processed_data.head()\n" + ] + }, + { + "cell_type": "markdown", + "id": "f04273a4", + "metadata": {}, + "source": [ + "\n", + "Success! We have managed to apply the primitives to generate features on the input dataset.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d804c3d2", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.13" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}