Skip to content

Commit

Permalink
Merge pull request #11 from NERC-CEH/diagram_view
Browse files Browse the repository at this point in the history
Diagram view of first stage pipeline (from sampling instrument to shared storage)
  • Loading branch information
metazool authored Aug 8, 2024
2 parents 043c6a5 + 830a4fb commit 4171bb1
Show file tree
Hide file tree
Showing 10 changed files with 282 additions and 2 deletions.
10 changes: 8 additions & 2 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,13 @@
name: flake8 Lint

on: [push, pull_request]

on:
push:
paths:
- "cyto_ml"
pull_request:
paths:
- "cyto_ml"

jobs:
flake8-lint:
runs-on: ubuntu-latest
Expand Down
59 changes: 59 additions & 0 deletions .github/workflows/pages_graphs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Pages and Graphviz re-render
on:
push:
paths: 'docs/**/*'

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

jobs:
build:
name: Rebuild graphs and pages
runs-on: ubuntu-latest
defaults:
run:
working-directory: docs
steps:
- uses: actions/checkout@v4
- name: Setup Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: '3.3' # Not needed with a .ruby-version file
bundler-cache: true # runs 'bundle install' and caches installed gems automatically
cache-version: 0 # Increment this number if you need to re-download cached gems
working-directory: '${{ github.workspace }}/docs'
- name: Setup Pages
id: pages
uses: actions/configure-pages@v3
- name: Build with Jekyll
# Outputs to the './_site' directory by default
# Will this copy the diagrams tho
run: bundle exec jekyll build --baseurl "${{ steps.pages.outputs.base_path }}"
env:
JEKYLL_ENV: production
- uses: ts-graphviz/setup-graphviz@v2
- name: Diagrams
run: chmod +x ../scripts/render_diagrams.sh; bash ../scripts/render_diagrams.sh
- name: Upload artifact
# Automatically uploads an artifact from the './_site' directory by default
uses: actions/upload-pages-artifact@v1
with:
path: "docs/_site"

# Deployment job
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v2
35 changes: 35 additions & 0 deletions docs/Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
source "https://rubygems.org"
# Hello! This is where you manage which Jekyll version is used to run.
# When you want to use a different version, change it below, save the
# file and run `bundle install`. Run Jekyll with `bundle exec`, like so:
#
# bundle exec jekyll serve
#
# This will help ensure the proper Jekyll version is running.
# Happy Jekylling!
#gem "jekyll", "~> 4.3.3"
# This is the default theme for new Jekyll sites. You may change this to anything you like.
gem "minima", "~> 2.5"
# If you want to use GitHub Pages, remove the "gem "jekyll"" above and
# uncomment the line below. To upgrade, run `bundle update github-pages`.
gem "github-pages", "~> 231", group: :jekyll_plugins
gem "webrick"
gem "just-the-docs"
# If you have any plugins, put them here!
group :jekyll_plugins do
gem "jekyll-feed", "~> 0.12"
end

# Windows and JRuby does not include zoneinfo files, so bundle the tzinfo-data gem
# and associated library.
platforms :mingw, :x64_mingw, :mswin, :jruby do
gem "tzinfo", ">= 1", "< 3"
gem "tzinfo-data"
end

# Performance-booster for watching directories on Windows
gem "wdm", "~> 0.1.1", :platforms => [:mingw, :x64_mingw, :mswin]

# Lock `http_parser.rb` gem to `v0.6.x` on JRuby builds since newer versions of the gem
# do not have a Java counterpart.
gem "http_parser.rb", "~> 0.6.0", :platforms => [:jruby]
12 changes: 12 additions & 0 deletions docs/_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
title: Plankton ML / pipelines
email: [email protected]
description: >- # this means to ignore newlines until "baseurl:"
This repository contains code, proof of concepts, test cases and workflows for low-investment methods to apply image machine learning to plankton characterisation.
baseurl: "" # the subpath of your site, e.g. /blog
url: "" # the base hostname & protocol for your site, e.g. http://example.com
github_username: metazool

# Build settings
theme: just-the-docs
plugins:
- jekyll-feed
34 changes: 34 additions & 0 deletions docs/diagrams/as_is/instrument_to_store.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# http://www.graphviz.org/content/cluster

digraph G {
rankdir=LR;
graph [fontname = "Handlee"];
node [fontname = "Handlee"];
edge [fontname = "Handlee"];

bgcolor=transparent;

scope [shape=rect label="Microscope \n(FlowCam)"];
pc [shape=rect label="Local PC"]

scope2 [shape=rect label="Laser Imaging \n(Flow Cytometer)"];
pc2 [shape=rect label="Local PC"]

san [shape=cylinder label="SAN \nprivate cloud"]
vm [shape=rect label="VM \nprivate cloud"]
store [shape=cylinder label="S3 \nobject store"]

vm->store [label="triggered by app?" fontsize=10];
scope->pc
scope2->pc2

pc2->san [label="physically, via USB stick", fontsize=10];
pc->san [label="physically, via USB stick", fontsize=10];


san->vm [dir=back] [label="manually run script" fontsize=10];

}



33 changes: 33 additions & 0 deletions docs/diagrams/could_be/instrument_to_store.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# http://www.graphviz.org/content/cluster

digraph G {
rankdir=LR;
graph [fontname = "Handlee"];
node [fontname = "Handlee"];
edge [fontname = "Handlee"];

bgcolor=transparent;

scope [shape=rect label="Microscope \n(FlowCam)"];
pc [shape=rect label="Local PC"]

scope2 [shape=rect label="Laser imaging \n(Flow Cytometer)"];
pc2 [shape=rect label="Local PC"]

san [shape=cylinder label="SAN \nprivate cloud"]
engine [shape=rect label="Workflow engine"]
tasks [label="Task graph"]
store [shape=cylinder label="S3 \nobject store"]

engine->tasks
tasks->san;
tasks->store [];
scope->pc
scope2->pc2

pc2->san [label="pull on a schedule?", dir=back,fontsize=10];

pc->san [label="push on a schedule?", fontsize=10];

}

22 changes: 22 additions & 0 deletions docs/diagrams/could_be/task_graph.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# http://www.graphviz.org/content/cluster

digraph G {
rankdir=LR;

edge [fontname = "Handlee"];

graph [fontsize=10 fontname="Handlee"];
node [shape=record fontsize=10 fontname="Handlee"];

bgcolor=transparent;

subgraph cluster_0 {
style=filled;
color=lightgrey;
node [color=white,style=filled];
store -> chunk -> sift -> profile -> upload;
label = "Task flow";
fontsize = 20;
}
}

33 changes: 33 additions & 0 deletions docs/diagrams/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
# Feel free to add content and custom Front Matter to this file.
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults

layout: home
title: Plankton ML - workflow diagrams
---

# Workflow Diagrams

Views of the flow of data from the imaging instrument to cloud-accessible storage

### As is

Data saved during a session with the microscope is downloaded onto a USB key, then uploaded from a researcher's laptop into a shared storage area on a site-specific SAN.

Later, a data scientist logs into a virtual machine in the on-premise "private cloud" and runs more than one script to read the data, process it for analysis, and then upload to s3 storage hosted at JASMIN. Authorisation in this chain requires personal credentials.

<object data="as_is/instrument_to_store.svg" type="image/svg+xml">
</object>

There are file naming conventions including metadata which doesn't follow the same path as the data, and there are spatio-temporal properties of the samples which could be recorded.

### Could be

PC that drives the instrument is connected to the storage network, but not the internet (for security standards compliance reasons). What are the current precedents for either directly saving output to shared storage, or a watcher process that either pulls or pushes data from a lab PC to networked storage?

Automated workflow (could be Apache Airflow or Beam based - FDRI project is trialling components) which watches for new source data, distributes the preprocessing with Dask or Spark if necessary, and publishes analysis-ready data _and metadata_ to cloud storage, continuously.

<object data="could_be/instrument_to_store.svg" type="image/svg+xml">
</object>


20 changes: 20 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
# Feel free to add content and custom Front Matter to this file.
# To modify the layout, see https://jekyllrb.com/docs/themes/#overriding-theme-defaults

layout: home
title: Plankton ML
---

# Plankton ML

This is a small experimental project on automating the analysis of plankton images

* Inform related work on reproducible analytical pipelines for bioimage machine learning by grounding them in a concrete use case
* Evaluate reusable components (e.g. the Cefas plankton model from scivision) and associated trade-offs
* Evolve a shared template for similar smaller projects undertaken by members of the RSE group in the Environmental Data Service, UK Centre for Ecology and Hydrology

Please see the associated Github repository which has [outline tasks in Issues](https://github.com/NERC-CEH/plankton_ml/issues) and [prototype work in pull requests](https://github.com/NERC-CEH/plankton_ml/pulls)



26 changes: 26 additions & 0 deletions scripts/render_diagrams.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/bin/bash
# Copilot generated script to render diagrams as SVG

# Set the directory path
DIR="./diagrams/"
SITE="_site/"

# Loop through each subdirectory
for sub_dir in "$DIR"*/; do
# Loop through each dot file in the subdirectory
for dotfile in "$sub_dir"*.dot; do
# Get the base name without extension
base_name=$(basename "$dotfile" .dot)
dir_path=${sub_dir//diagrams/_site\/diagrams}
mkdir -p $dir_path
output="$dir_path$base_name.svg"

# Render the dot file to SVG
dot -Tsvg "$dotfile" -o $output

# Print a success message
echo "Rendered $dotfile to $output"
done
done


0 comments on commit 4171bb1

Please sign in to comment.