Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
README_CN.md	README_CN.md
visionlan_resnet45_LA.yaml	visionlan_resnet45_LA.yaml
visionlan_resnet45_LF_1.yaml	visionlan_resnet45_LF_1.yaml
visionlan_resnet45_LF_2.yaml	visionlan_resnet45_LF_2.yaml

English | 中文

VisionLAN

VisionLAN: From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

Introduction

VisionLAN

Visual Language Modeling Network (VisionLAN) [1] is a text recognion model that learns the visual and linguistic information simultaneously via character-wise occluded feature maps in the training stage. This model does not require an extra language model to extract linguistic information, since the visual and linguistic information can be learned as a union.

Figure 1. The architecture of visionlan [1]

As shown above, the training pipeline of VisionLAN consists of three modules:

The backbone extract visual feature maps from the input image;
The Masked Language-aware Module (MLM) takes the visual feature maps and a randomly selected character index as inputs, and generates position-aware character mask map to create character-wise occluded feature maps;
Finally, the Visual Reasonin Module (VRM) takes occluded feature maps as inputs and makes prediction under the complete word-level supervision.

While in the test stage, MLM is not used. Only the backbone and VRM are used for prediction.

Requirements

mindspore	ascend driver	firmware	cann toolkit/kernel
2.3.1	24.1.RC2	7.3.0.1.231	8.0.RC2.beta1

Quick Start

Installation

Please refer to the installation instruction in MindOCR.

Dataset preparation

Training sets

The authors of VisionLAN used two synthetic text datasets for training: SynthText(800k) and MJSynth. Please follow the instructions of the original VisionLAN repository to download the training sets.

After download SynthText.zip and MJSynth.zip, please unzip and place them under ./datasets/train. The training set contain 14,200,701 samples in total. More details are as follows:

SynText: 25GB, 6,976,115 samples
MJSynth: 21GB, 7,224,586 samples

Validation sets

The authors of VisionLAN used six real text datasets for evaluation: IIIT5K Words (IIIT5K_3000) ICDAR 2013 (IC13_857), Street View Text (SVT), ICDAR 2015 (IC15), Street View Text-Perspective (SVTP), CUTE80 (CUTE). We used the sum of the six benchmarks as validation sets. Please follow the instructions of the original VisionLAN repository to download the validation sets.

After download evaluation.zip, please unzip this zip file, and place them under ./datasets. Under ./datasets/evaluation, there are seven folders:

IIIT5K: 50M, 3000 samples
IC13: 72M, 857 samples
SVT: 2.4M, 647 samples
IC15: 21M, 1811 samples
SVTP: 1.8M, 645 samples
CUTE: 8.8M, 288 samples
Sumof6benchmarks: 155M, 7248 samples

During training, we only used the data under ./datasets/evaluation/Sumof6benchmarks as the validation sets. Users can delete the other folders ./datasets/evaluation optionally.

Test Sets

We choose ten benchmarks as the test sets to evaluate the model's performance. Users can download the test sets from here (ref: deep-text-recognition-benchmark). Only the evaluation.zip is required for testing.

After downloading the evaluation.zip, please unzip it, and rename the folder name from evaluation to test. Please place this folder under ./datasets/.

The test sets contain 12,067 samples in total. The detailed information is as follows:

CUTE80: 8.8 MB, 288 samples
IC03_860: 36 MB, 860 samples
IC03_867: 4.9 MB, 867 samples
IC13_857: 72 MB, 857 samples
IC13_1015: 77 MB, 1015 samples
IC15_1811: 21 MB, 1811 samples
IC15_2077: 25 MB, 2077 samples
IIIT5k_3000: 50 MB, 3000 samples
SVT: 2.4 MB, 647 samples
SVTP: 1.8 MB, 645 samples

In the end of preparation, the file structure should be like:

datasets
├── test
│   ├── CUTE80
│   ├── IC03_860
│   ├── IC03_867
│   ├── IC13_857
│   ├── IC13_1015
│   ├── IC15_1811
│   ├── IC15_2077
│   ├── IIIT5k_3000
│   ├── SVT
│   ├── SVTP
├── evaluation
│   ├── Sumof6benchmarks
│   ├── ...
└── train
    ├── MJSynth
    └── SynText

Update yaml config file

If the datasets are placed under ./datasets, there is no need to change the train.dataset.dataset_root in the yaml configuration file configs/rec/visionlan/visionlan_L*.yaml.

Otherwise, change the following fields accordingly:

...
train:
  dataset_sink_mode: False
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/dataset          <--- Update
    data_dir: train                       <--- Update
...
eval:
  dataset_sink_mode: False
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/dataset          <--- Update
    data_dir: evaluation/Sumof6benchmarks <--- Update
...

Optionally, change train.loader.num_workers according to the cores of CPU.

Apart from the dataset setting, please also check the following important args: system.distribute, system.val_while_train, common.batch_size. Explanations of these important args:

system:
  distribute: True                                                    # `True` for distributed training, `False` for standalone training
  amp_level: 'O0'
  seed: 42
  val_while_train: True                                               # Validate while training
common:
  ...
  batch_size: &batch_size 192                                          # Batch size for training
...
  loader:
      shuffle: False
      batch_size: 64                                                  # Batch size for validation/evaluation
...

Notes:

As the global batch size (batch_size x num_devices) is important for reproducing the result, please adjust batch_size accordingly to keep the global batch size unchanged for a different number of NPUs, or adjust the learning rate linearly to a new global batch size.

Training

The training stages include Language-free (LF) and Language-aware (LA) process, and in total three steps for training:

LF_1: train backbone and VRM, without training MLM
LF_2: train MLM and finetune backbone and VRM
LA: using the mask generated by MLM to occlude feature maps, train backbone, MLM, and VRM

We used distributed training for the three steps. For standalone training, please refer to the recognition tutorial.

mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/visionlan/visionlan_resnet45_LF_1.yaml
mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/visionlan/visionlan_resnet45_LF_2.yaml
mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/visionlan/visionlan_resnet45_LA.yaml

The training result (including checkpoints, per-epoch performance and curves) will be saved in the directory parsed by the arg ckpt_save_dir in yaml config file. The default directory is ./tmp_visionlan.

Test

After all three steps training, change the system.distribute to False in configs/rec/visionlan/visionlan_resnet45_LA.yaml before testing.

To evaluate the model's accuracy, users can choose from two options:

Option 1: Repeat the evaluation step for all individual datasets: CUTE80, IC03_860, IC03_867, IC13_857, IC131015, IC15_1811, IC15_2077, IIIT5k_3000, SVT, SVTP. Then take the average score.

An example of evaluation script fort the CUTE80 dataset is shown below.

model_name="e8"
yaml_file="configs/rec/visionlan/visionlan_resnet45_LA.yaml"
training_step="LA"

python tools/eval.py --config $yaml_file --opt eval.dataset.data_dir=test/CUTE80 eval.ckpt_load_path="./tmp_visionlan/${training_step}/${model_name}.ckpt"

Option 2: Given that all the benchmark datasets folder are under the same directory, e.g. test/. And use the script tools/benchmarking/multi_dataset_eval.py. The example evaluation script is like:

model_name="e8"
yaml_file="configs/rec/visionlan/visionlan_resnet45_LA.yaml"
training_step="LA"

python tools/benchmarking/multi_dataset_eval.py --config $yaml_file --opt eval.dataset.data_dir="test" eval.ckpt_load_path="./tmp_visionlan/${training_step}/${model_name}.ckpt"

Results

Accuracy

According to our experiments, the evaluation results on ten public benchmark datasets is as follow:

model name	backbone	train dataset	params(M)	cards	batch size	jit level	graph compile	ms/step	img/s	accuracy	recipe	weight
visionlan	Resnet45	MJ+ST	42.22	4	128	O2	191.52 s	280.29	1826.63	90.62%	yaml(LF_1) yaml(LF_2) yaml(LA)	ckpt files \| mindir(LA)

Detailed accuracy results for ten benchmark datasets

model name	backbone	cards	IC03_860	IC03_867	IC13_857	IC13_1015	IC15_1811	IC15_2077	IIIT5k_3000	SVT	SVTP	CUTE80	average
visionlan	Resnet45	1	96.16%	95.16%	95.92%	94.19%	84.04%	77.47%	95.53%	92.27%	85.89%	89.58%	90.62%

Notes:

Train datasets: MJ+ST stands for the combination of two synthetic datasets, SynthText(800k) and MJSynth.
To reproduce the result on other contexts, please ensure the global batch size is the same.
The models are trained from scratch without any pre-training. For more dataset details of training and evaluation, please refer to Dataset preparation section.
The input Shape of MindIR of VisionLAN is (1, 3, 64, 256).

MindSpore Lite Inference

To inference with MindSpot Lite on Ascend 310, please refer to the tutorial MindOCR Inference. In short, the whole process consists of the following steps:

Model Export

Please download the exported MindIR file first, or refer to the Model Export tutorial and use the following command to export the trained ckpt model to MindIR file:

# For more parameter usage details, please execute `python tools/export.py -h`
python tools/export.py --model_name_or_config visionlan_resnet45 --data_shape 64 256 --local_ckpt_path /path/to/visionlan-ckpt

The data_shape is the model input shape of height and width for MindIR file. The shape value of MindIR in the download link can be found in Notes under results table.

Environment Installation

Please refer to Environment Installation tutorial to configure the MindSpore Lite inference environment.

Model Conversion

Please refer to Model Conversion, and use the converter_lite tool for offline conversion of the MindIR file.

Inference

Assuming that you obtain output.mindir after model conversion, go to the deploy/py_infer directory, and use the following command for inference:

python infer.py \
    --input_images_dir=/your_path_to/test_images \
    --rec_model_path=your_path_to/output.mindir \
    --rec_model_name_or_config=../../configs/rec/visionlan/visionlan_resnet45_LA.yaml \
    --res_save_dir=results_dir

References

[1] Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, Yongdong Zhang: From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network. ICCV 2021: 14174-14183

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

visionlan

visionlan

README.md

VisionLAN

Introduction

VisionLAN

Requirements

Quick Start

Installation

Dataset preparation

Update yaml config file

Training

Test

Results

Accuracy

MindSpore Lite Inference

References

Files

visionlan

Directory actions

More options

Directory actions

More options

Latest commit

History

visionlan

Folders and files

parent directory

README.md

VisionLAN

Introduction

VisionLAN

Requirements

Quick Start

Installation

Dataset preparation

Update yaml config file

Training

Test

Results

Accuracy

MindSpore Lite Inference

References