Skip to content

Test Result Push Model

Rishi edited this page Nov 7, 2017 · 14 revisions

Transfer2Go Tests - Push Model

Test Architecture

TestArchitecture

Resource requirements and Deployment

  • To perform this test we used three machines. 1) Main-agent(CERN VM) 2) Source-server (CERN VM) 3) Destination-agent (uibo-cms-02.cr.cnaf.infn.it). 4) source2 (uibo-cms-03.cr.cnaf.infn.it)

  • uibo-cms-02.cr.cnaf.infn.it and uibo-cms-03.cr.cnaf.infn.it are the grid UIs located at the INFN-CNAF Tier-1 centre of WLCG, in Bologna (Italy). Its RH version is Scientific Linux release 6.9 (Carbon). It has 36GB of RAM and 12 cores (AMD 2.6 GHz).

  • Each of the CERN VMs is M2.small type (2 cores) and has 1.8 GB RAM, 1 VCPU, 10 GB disk space.

  • This test requires Go 1.6 or more recent. Here is the manual to setup the code. [link]

  • For this test, we used v00.00.09 version of the code.

  • There are already predefined config files under the test/config folder. We used HTTP protocol to transfer the data. Also, used sqlite3 database to store the catalogs and requests.

  • Here is the schema of created db. [link]

  • After the setup, build and run the agents on each of these three machines. Start the main agent in pull based model. Register source and destination agent with the main agent.

  • For this test, VMs are behind the CERN firewall so we need to make a ssh tunnel between the destination-source and destination-main agent.

Network monitoring tool: iftop (version 1.0pre4)

Setup

To perform the tests we need 4 nodes, the main agent (coordinator), the source agent (which keeps our data) and destination agent (where we'll relocate the data). We define them as following:

mainhost=main-agent.cern.ch (Ram: 1.8 GB, Type: m2.small, VCPUs: 1)
sourcehost1=source-server.cern.ch (Ram: 1.8 GB, Attached Volume: 15 GB, VCPUs: 1)
sourcehost2=source-2.cern.ch (Ram: 3.7 GB, Type: m2.medium, VCPUs: 2)
sourcehost3=uibo-cms-03.cr.cnaf.infn.it (VCPUs: 12, Ram: 35 GB)
desthost=uibo-cms-02.cr.cnaf.infn.it (VCPUs: 12, Ram: 35 GB)

Visit this page for the detailed setup guide.


Tests

Note: Each and every test is monitored by iftop (monitor network bandwidth)
and top(monitors CPU usage/memory) tools. The table data provides information
about maximum CPU usage and memory usage while transferring the data.
The CPU time represents how much time it requires on source agent and 
destination agent to complete the transfer process.
It is the total CPU time the task has used since it started.

Test 1/2: "TRANSFER MONITORING TEST".

Check if file arrives at destination in the same health as the original one (cksum, for example), and monitor a single transfer and its behavior, performance, CPU load on all involved machines.

sourcehost1
File Size CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun*)
1 GB 10.0 0.6 00:00.87
1.3 GB 11.3 0.5 00:01.23
5.2 GB 13.6 0.5 00:6.33
15 GB 9.0 0.6 00:14.42

*Hun=Hundredth

desthost
File Size CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun)
1 GB 86.3 0.0 00:09.54
1.3 GB 91.0 0.0 00:13.14
5.2 GB 88.0 0.0 00:41.75
15 GB 70.3 0.0 01:58.36
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.

Table Data: case0, case1, case2, case3

Network Rate: Peak Rate : 886 Mb, Transferred Data : 1 GB, Tool: iftop

Observation:

  • Successfully transferred the files between sourcehost1(VM) and desthost site with the completely verification using adler32 hash.

  • Did the measurement of CPU and memory usage using top command.

  • The data during the transfer process is passed through the ssh tunnel. There are basically two ssh tunnels one is between mainhost and desthost, another one between sourcehost1 and desthost.

  • In the table CPU (%) and MEM % are representing maximum threshold.

  • According to the above data, we have constant memory usage that means our file copying logic(io.copy) is using constant size of buffer to copy the files. We can increase buffer size in order to reduce the CPU time.

  • The CPU time for the destination is usually high because we are calculating hash again to verify the received data.

  • It is taking around 2-3 mins(CPU time) to transfer the 15GB file. We can even transfer further than that.


Test3: "DATASET TEST".

Mimic a dataset (multi-file) transfer, i.e. take a bunch of random files (e.g. 10), could even be replicas of the same one but with different names, trigger a transfer of the entire set (still, on the same A->B path as in point 1), and have them all moved to destination. Monitor how the system behaves, if as expected or not (e.g. sequential? in parallel? how in parallel? who arrives first? all cksum-ok? how long did it took? any configuration change to be tried out?)

Case 1: (Sequential) Transferring 10 files sequentially. This can be the case when these 10 files belongs to same dataset or same blocks. We will transfer all the 10 files together in one common request of particular dataset.

sourcehost1
File Size (10 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun*)
1 GB 8.7 0.6 00:11.07

*Hun=Hundredth

desthost
File Size (10 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun)
1 GB 69.1 0.0 01:35.60
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.

Table Data: link

Network Rate: Peak Rate : 812 Mb, Transferred Data : 10 GB, Tool: iftop

Observation:

  • Successfully transferred 10 files, between sourcehost1 and desthost, where each belongs to the same data set and block.
  • This process was transferring the files sequentially because we registered only one transfer request for one data set. We can also transfer the same files by registering 10 different transfer requests for each of them(case 2).
  • In this approach we are not completely utilizing our bandwidth because we are transferring 10 files one by one.
  • Let’s assume our maximum transfer rate can be 850 Mb. So while transferring files chunk wise sometimes one single HTTP request divides data in more than two chunks. This way it can use maximum 350 Mb(1 GB / 3 chunks). Take a look at this screen shots(link). Which will help you to understand why it is not using entire bandwidth for each HTTP request.
  • The behavior of this transfer process was quite symmetric. First rise in transfer rate then sudden fall in rate and again jump in a rate.

Case 2: (Parallel) Transfer the same 10 files in parallely. This can be the case when we will register separate transfer requests for each of the files and then transfer all of them parallely.

sourcehost1
File Size (10 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun*)
1 GB 67.0 0.6 00:10.1

*Hun=Hundredth

desthost
File Size (10 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun)
1 GB 84.6 0.0 02:17.06
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.

Table Data: link

Network Rate: Peak Rate : 820 Mb, Transferred Data : 10 GB, Tool: iftop

Observation:

  • In this case, the use of bandwidth completely depends on the how golang is scheduling their go routines.
  • Nature was quite random. Sometimes there was a sudden jump in transfer rate while on the other hand rate limit reached to at its bottom limit.

Test4: "INCOMPLETE DATASET TEST".

Redo test 3 by renaming e.g. 1 file on site A, so it cannot be found. The system is able to tell you that 9 files are transferred and the 10th is impossible?

sourcehost1
File Size (10 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun*)
1 GB 7.0 0.6

*Hun=Hundredth

desthost
File Size (10 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun)
1 GB 75.0 0.0
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.

Table Data: link

Network Rate: Transferred Data : 8.1 GB, Tool: iftop

Observation:

  • Changed the name of two files and then tried to transfer the 10 files of the same dataset.
  • Our agent is successfully transferring all the eight files and showing errors for the remaining two files.
  • Checkout the errors thrown by source agent over here. Link
  • Need to add some method or field to notify client about these untransferred files. Our agent showing that transfer request is finished instead of showing error for some files.

Test 5: "MULTI-SOURCE TEST".

Transfer files from multiple sites, e.g. 7 files on site A and 3 files on site B, and try to do the same test as point 3 (i.e. complete dataset) and move all 10 files to site C. Is the system able to understand, if 10 files are needed in C, where they are and where to pick them from? (perhaps not yet in the code?)

sourcehost1
File Size (7 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun*)
7.7 GB 9.3 0.7 7:40.00

*Hun=Hundredth

sourcehost2
File Size (3 files) CPU Usage (%) Memory Usage (%) CPU Time (Min:Sec.Hun)
3.3 GB 21.6 0.0 5:00.00 (approx)
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.

Table Data: link

Network Rate: Pick Rate : 828 Mb, Transferred Data : 11 GB, Tool: iftop

Observation:

  • It works based on this scheduling algorithm. [link]
  • First it fetches all the files for the given dataset. Then it takes union of these files and filters out agents which have given dataset(partially or fully).
  • Then it Sorts these filtered agents according to the prediction values of ML model. Now, it starts serving the requests from the highest prediction values to lowermost prediction values. Checks if the source agent is up or not if it is up then submit that request to the destination and do the pull transfer.
  • Router can successfully transfer files from two sources to the one destination. In this case there are two sources 1) uibo-cms-03.cr.cnaf.infn.it(contains 3 files) site and one VMs(contains 7 files).

Test 6: "Site FAILURE TEST".

Same as 5, i.e. A+B transferring to C, but switch A OFF (or rename all files there). How does the system behaves?

Observation:

  • It will only transfer the files which are available. Hence, in this case it will only transfer 3 files situated on agent B.

Test7: Transfer throughput of single node.

Test how much simultaneous transfers single node can sustain and how it's throughput drops with adding new transfer. Create N files of equal size and start transfers each of them one at a time, i.e. start one, then another, then another. Each time measure how throughput drop if new transfer is added to the pool.

Conclusion:

  • One thing we can conclude is if the two requests have shared resources then their transfer rates eventually drops and become extremely low(around 20Mb). In the beginning, when transferring two same files(shared resource) the rate was around 20 Mb. While processing those two requests, when we approved one different kind of request to transfer the data. Suddenly, the transfer rate increased and reached up to the 450 Mb.
  • With the new transfer request its throughput is sometimes increasing and sometimes decreasing.
  • Without having the shared resources, it is sending the data with the avg rate of 400-500Mb.

Source: source-server.cern.ch (VM) Ram: 1.8 GB Type: m2.small Attached Volume: 15 GB

Destination: uibo-cms-03.cr.cnaf.infn.it

File size : 1 GB

Sequential Transfer: One by one adding requests.

File Number CPU Usage (%) RAM Usage (%) Throughput
Initially 0.7 0.7 0
1. 7.3 0.7 0
2. 9.3 0.7 300 Mb
3. 7.0 0.7 535 Mb
4. 8.6 0.7 620 Mb
5. 7.6 0.7 551 Mb
6. 6.0 0.7 624 Mb
7. 7.3 0.8 605 Mb
8. 7.6 0.8 610 Mb
9. 7.3 0.8 627 Mb
10. 6.6 0.8 676 Mb
End 0.3 0.7 Decreased to 0

Parallel Transfer: Together submitting 1, 2, … 10 transfer requests. Here peak throughput refers to a throughput of a node, i.e. all files in transfer combined.

Total Files (N) Max CPU Usage (%) Max RAM Usage (%) Peak Throughput Errors
1 6.3 0.5 750 Mb
2 4.3 0.6 485 Mb
3 7.6 0.7 701 Mb
4 8.0 0.7 661 Mb
7 11.0 0.7 911 Mb On main-agent: database is locked
10 10.6 0.8 907 Mb On main-agent: database is locked

Throughput Calculation

Source:
Host: source-server.cern.ch
Ram: 1.8 GB
CPU: 1
Type: m2.small
Attached Volume: 15 GB

Destination:
Host: uibo-cms-03.cr.cnaf.infn.it
Ram: 36GB
Core: 12 cores

Data: 10 files of same dataset. Size of each file is 1GB.

Code Version: v01.00.05

Parallel Transfer

Note: Below mentioned throughput is measured in parallel mode.
The overall throughput is greater than the throughput of a 
single file. In the pull model data is bypassing through the
ssh tunnel. There is one ssh tunnel require while pulling the
data from source agent. While in push model source agent is 
directly pushing the data to destination agent i.e. without
using the ssh tunnel.
File Name Start Time (sec) End Time (sec) Total Time (sec) Throughput (MB/s) Throughput (Mb/s)
file1.root 1509217078 1509217842 764 1.28 10.24
file2.root 1509217078 1509217891 813 1.26 10.08
file3.root 1509217078 1509217645 567 1.81 14.48
file4.root 1509217078 1509217609 531 1.93 15.44
file5.root 1509217078 1509217785 707 1.45 11.6
file6.root 1509217078 1509217766 688 1.49 11.92
file7.root 1509217078 1509217722 644 1.59 12.72
file8.root 1509217078 1509217878 800 1.26 10.08
file9.root 1509217078 1509217785 707 1.34 10.72
file10.root 1509217078 1509217757 679 1.51 12.08
Sum of Throughput(Mb/s): 119.36

Sequential Transfer

Note: We transferred files of different sizes one by one.
In this case the throughput of that transfers process can be
equivalent to the overall system’s throughput.
File Size StartTime (sec) EndTime (sec) Total Time (sec) Throughput (MB/s) Throughput (Mb/s)
1 GB 1509271362 1509271477 115 8.95 71.6
3 GB 1509270793 1509271150 357 8.64 69.12