In this project we defined a generic workflow description specification(FlowSpec)
We developed a FlowSpec based migration tool(MigrationX) to migrate workflow models from different workflow scheduling systems to DataWorks workflow
model.
We can use this tool to develop conversion tools for other scheduling system workflow model.
CycleWorkflow is the specification of a scheduled workflow that contains task nodes and dependencies
Field Name
Field Type
Required
Description
version
string
Yes
Version
kind
string
Yes
CycleWorkflow
metadata
Metadata
No
define extra meta data of workflow
spec
CycleWorkflowSpec
Yes
specific definition of CycleWorkflow
ManualWorkflow is the specification of manual triggered workflow consist of task nodes and dependencies
Field Name
Field Type
Required
Description
version
string
Yes
version info
kind
string
Yes
ManualWorkflow
metadata
Metadata
No
define extra meta data of workflow
spec
ManualWorkflowSpec
Yes
specific definition of ManualWorkflow
CycleWorkflowSpec the specification of CycleWorkflow
Field Name
Field Type
Required
Description
nodes
Array<
Node
>
Yes
list of cycle node definition
variables
Array<
Variable
>
No
list of variable definition
scripts
Array<
Script
>
No
list of script definition
triggers
Array<
Trigger
>
No
list of cycle trigger definition
artifacts
Array<
Artifact
>
No
list of artifact definition
runtimeResources
Array<
RuntimeResource
>
No
list of runtime resource definition
fileResources
Array<
FileResource
>
No
list of file resource definition
functions
Array<
Function
>
No
list of function definition
flow
Array<
Flow
>
No
list of flow defines node dependent relationship
ManualWorkflowSpec is the specification of ManualWorkflow
Field Name
Field Type
Required
Description
nodes
Array<
Node
>
Yes
list of manual node definition
variables
Array<
Variable
>
No
list of variable definition
scripts
Array<
Script
>
No
list of script definition
artifacts
Array<
Artifact
>
No
list of artifact definition
runtimeResources
Array<
RuntimeResource
>
No
list of runtime resource definition
fileResources
Array<
FileResource
>
No
list of file resource definition
functions
Array<
Function
>
No
list of function definition
flow
Array<
Flow
>
No
list of flow defines node dependent relationship
Defines extra metadata of resource
Field Name
Field Type
Required
Description
owner
string
No
owner of spec
description
string
No
description of spec
Node is the definition of workflow node.
Field Name
Field Type
Required
Description
id
String
Yes
node local identifier in spec
name
String
Yes
node name
script
Script
Yes
referred script define or reference of the node
functions
Array<
Function
>
No
referred functions define or reference of the node
fileResources
Array<
FileResource
>
No
referred file resources define or reference of the node
inputs
Inputs
No
inputs of the node. TableArtifact
, NodeArtifcat
, Variable
can be used as inputs of Node
outputs
Outputs
No
outputs of the node. TableArtifact
, NodeArtifcat
, Variable
can be used as outputs of Node
runtimeResource
RuntimeResource
No
runtime resource define or reference of the node
recurrence
string
No
recurrence
defines cycle schedule state of node, see enumerated values: Recurrence
priority
integer
No
priority of the node, the larger the value, the higher the priority
timeout
integer
No
timeout in seconds of the node, node instance will be killed when timed out after specified seconds
instanceMode
string
No
instance mode of the node, see enumerated values InstanceMode
rerunMode
string
No
the rerun strategy of the node instance, see enumerated values RerunMode
The flow
section of spec defines dependencies of related workflow nodes.
Field Name
Field Type
Required
Description
nodeId
string
Yes
node identifier of specific node defined in spec
depends
Array<
FlowDepend
>
Yes
list of nodes depended by the node
FlowDepend
define the dependency or relationship between workflow nodes.
Field Name
Field Type
Required
Description
nodeId
string
Yes
node identifier of specific node defined in spec
type
string
Yes
dependency type of the node, see enumerated values DependencyType
Variable
defines variables of workflow. Variables can be used in workflow nodes.
Field Name
Field Type
Required
Description
id
string
Yes
local identifier in spec
name
string
Yes
variable name
scope
string
Yes
variable scope, see enumerated values VariableScope
type
string
Yes
variable type, see enumerated values VariableType
value
string
Yes
variable value expression
Script
defines script source file resources. Scripts can be used in workflow by nodes, functions or resources.
Field Name
Field Type
Required
Description
id
string
Yes
local identifier in spec
path
string
Yes
script path
language
string
No
script language
runtime
Runtime
Yes
runtime definition of script
parameters
Array<
Variable
>
No
list of parameter definitions used by script
Trigger
defines the rules of firing time of scheduled nodes.
Field Name
Field Type
Required
Description
id
string
Yes
local identifier in spec
type
string
Yes
trigger type, values: Scheduler
, Manual
cron
string
No
cron expression of Scheudler
Trigger
startTime
string
No
start effect time of Scheduler
Trigger. nodes will only instanced time in range from startTime
to endTime
endTime
string
No
end of effect time of Scheduler
Trigger.
timezone
string
No
timezone of the Scheduler
Trigger
Artifacts can be types like NodeOutput
, Table
, Variable
. Variable
can be a context variable produced by workflow nodes.
Field Name
Field Type
Required
Description
guid
string
Yes
table artifact guid
Field Name
Field Type
Required
Description
output
string
Yes
output string identifier of node
RuntimeResource
defines runtime resources config are used to run workflow nodes runtime resources, like: resource group, YARN cluster etc.
Field Name
Field Type
Required
Description
id
string
Yes
local identifier in spec
resourceGroup
string
Yes
resource group global identifier
FileResource
defines the resource files used by workflow nodes. like jar, python, text file, archive files etc.
Field Name
Field Type
Required
Description
id
string
Yes
local identifier in spec
name
string
Yes
resource file name
script
Script
Yes
resource file script define
User-Define-Function definition that used by workflow nodes.
Field Name
Field Type
Required
Description
id
string
Yes
local identifier in spec
name
string
Yes
name of udf
script
Script
Yes
script file of udf
fileResources
Array<
FileResource
>
No
list of related file resources
Runtime
define the runtime environment of script. like command, runtime engine, image etc.
Field Name
Field Type
Required
Description
engine
string
No
runtime engine
command
string
No
command identifier of script runtime execution environment
Outputs hold parameters, artifacts, and results from a workflow node, Outputs
can be consumed by another workflow node.
Field Name
Field Type
Required
Description
tables
Array<
Table
>
No
tables
are list of artifact tables produced by node
variables
Array<
Variable
>
No
variables
are list of Variable
produced by node
nodeOutputs
Array<
NodeOutput
>
No
nodeOutputs
are list of pre-defined node output identifier strings produced by node
Inputs are the mechanism for passing parameters, artifacts, volumes from one workflow node to another
Field Name
Field Type
Required
Description
tables
Array<
Artifact
>
No
tables
are a list of artifact tables passed as inputs
variables
Array<
Variable
>
No
variables
are a list of Variable
passed as inputs
nodeOutputs
Array<
Artifact
>
No
nodeOutputs
are a list of pre-defined node output identifier string passed as inputs
Enum Name
Description
NodeParameter
NodeParameter
means the variable is avaliable in a specific node
NodeContext
NodeContext
means the variable is avaliable in downstream nodes that depends on current node
Workflow
Workflow
means the variable is avaliable in all nodes that in current workflow
Workspace
Workspace
means the variable is avaliable in all nodes that in current workspace
Tenant
Tenant
means the variable is avaliable in all nodes that in current tenant workspaces
Enum Name
Description
System
System
means the variable is avaliable a system variable like: $yyyymmdd
Constant
Constant
means the variable is constant value
Enum Name
Description
Normal
Normal
means the node instance of current cycle instance depends on the specific node instance in the same cycle round
CrossCycleDependsOnSelf
CrossCycleDependsOnSelf
means the current cycle instance of the node depends on the previous cycle round instance of the node itself
CrossCycleDependsOnChildren
CrossCycleDependsOnChildren
means the current cycle instance of the node depends on the children instance of itself in the previous cycle round
CrossCycleDependsOnOtherNode
CrossCycleDependsOnOtherNode
means the current cycle instance of the node depends on the specific node instance of itself in the previous cycle round
Enum Name
Description
Normal
Normal
means node instance code will be executed as defined repeat cycle, node will be instanced by defined cycle
Skip
Skip
means node instance will be set success without any code effects, node will be instanced by defined cycle
Pause
Pause
means node instance will be set failure without any code effects, node will be instanced by defined cycle
Enum Name
Description
Allowed
Allowed
means node instance can be rerun without any precondition
Denied
Denied
means node instance cannot be rerun on any condition
FailureAllowed
FailureAllowed
means node instance can be rerun on failure state
Enum Name
Description
T+1
T+1
means node modification will be applied effect on the next day
Immediately
Immediately
means node modification will be applied effect immediately
FlowSpec can be used to describe a workflow, it is a json file that contains a list of nodes.
example without id variables references
example with id variables references
DataWorks migration assistant spec package demo
the directory structure of the spec package is consistent with the directory tree of the DataWorks DataStudio business process interface
*.sql, *.sh, *.hql are user script source files
*.flow is the spec file corresponding to the user script source file
➜ project_c_dw tree
.
└── Business Flow
├── project_c_dag_3zq3ei4d6
│ └── ClickHouse
│ └── Data Analytics
│ ├── clickhouse_sql_1.flow
│ └── clickhouse_sql_1.sql
└── project_c_demo_workflow_1
├── EMR
│ └── Data Analytics
│ ├── demo_hive_sql_1.flow
│ ├── demo_hive_sql_1.hql
│ ├── demo_pg_sql_1.flow
│ ├── demo_pg_sql_1.hql
│ ├── demo_shell_1.flow
│ ├── demo_shell_1.sh
│ ├── demo_sql_1.flow
│ └── demo_sql_1.hql
└── General
├── dep_ck_1
└── dep_ck_1.flow
MigrationX is a workflow model transformation tool based on FlowSpec.
MigrationX
Dolphinscheduler migrate to Dataworks DataStudio in one-click command run
Dolphinscheduler export command tool
Conversion dolphinscheduler workflow to DataWorks FlowSpec
Import FlowSpec package to DataWorks DataStudio with DataWorks OpenAPI
Define domain model for different workflow engine, containing domain entities and corresponding operation service
Implementations of export reader tools for specific workflow engine.
Implementations of transformation logics between specific workflow engines.
Implementations of import writer tools for specific workflow engine.
Usage
migrationx-common: common module
migrationx-domain: domain model of specific workflow engine
migrationx-reader: export reader implementation of specific workflow engine
migrationx-transformer: transformer implementation of specific workflow engine
migrationx-writer: import write implementation of specific workflow engine
Develop guide
Alibaba Cloud-DataWorks-Develop & Modeling & Analytics Team