-
Notifications
You must be signed in to change notification settings - Fork 2
Test Result Push Model
-
To perform this test we used three machines. 1) Main-agent(CERN VM) 2) Source-server (CERN VM) 3) Destination-agent (uibo-cms-02.cr.cnaf.infn.it). 4) source2 (uibo-cms-03.cr.cnaf.infn.it)
-
uibo-cms-02.cr.cnaf.infn.it and uibo-cms-03.cr.cnaf.infn.it are the grid UIs located at the INFN-CNAF Tier-1 centre of WLCG, in Bologna (Italy). Its RH version is Scientific Linux release 6.9 (Carbon). It has 36GB of RAM and 12 cores (AMD 2.6 GHz).
-
Each of the CERN VMs is M2.small type (2 cores) and has 1.8 GB RAM, 1 VCPU, 10 GB disk space.
-
This test requires Go 1.6 or more recent. Here is the manual to setup the code. [link]
-
For this test, we used v00.00.09 version of the code.
-
There are already predefined config files under the test/config folder. We used HTTP protocol to transfer the data. Also, used sqlite3 database to store the catalogs and requests.
-
Here is the schema of created db. [link]
-
After the setup, build and run the agents on each of these three machines. Start the main agent in pull based model. Register source and destination agent with the main agent.
-
For this test, VMs are behind the CERN firewall so we need to make a ssh tunnel between the destination-source and destination-main agent.
-
Used to monitor bandwidth, latency and throughput.
-
How to read the information of iftop tool: https://linoxide.com/monitoring-2/iftop-network-traffic/
To perform the tests we need 4 nodes, the main agent (coordinator), the source agent (which keeps our data) and destination agent (where we'll relocate the data). We define them as following:
mainhost=main-agent.cern.ch (Ram: 1.8 GB, Type: m2.small, VCPUs: 1)
sourcehost1=source-server.cern.ch (Ram: 1.8 GB, Attached Volume: 15 GB, VCPUs: 1)
sourcehost2=source-2.cern.ch (Ram: 3.7 GB, Type: m2.medium, VCPUs: 2)
sourcehost3=uibo-cms-03.cr.cnaf.infn.it (VCPUs: 12, Ram: 35 GB)
desthost=uibo-cms-02.cr.cnaf.infn.it (VCPUs: 12, Ram: 35 GB)
Visit this page for the detailed setup guide.
Note: Each and every test is monitored by iftop (monitor network bandwidth)
and top(monitors CPU usage/memory) tools. The table data provides information
about maximum CPU usage and memory usage while transferring the data.
The CPU time represents how much time it requires on source agent and
destination agent to complete the transfer process.
It is the total CPU time the task has used since it started.
Check if file arrives at destination in the same health as the original one (cksum, for example), and monitor a single transfer and its behavior, performance, CPU load on all involved machines.
sourcehost1 | |||
File Size | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun*) |
1 GB | 10.0 | 0.6 | 00:00.87 |
1.3 GB | 11.3 | 0.5 | 00:01.23 |
5.2 GB | 13.6 | 0.5 | 00:6.33 |
15 GB | 9.0 | 0.6 | 00:14.42 |
*Hun=Hundredth
desthost | |||
File Size | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun) |
1 GB | 86.3 | 0.0 | 00:09.54 |
1.3 GB | 91.0 | 0.0 | 00:13.14 |
5.2 GB | 88.0 | 0.0 | 00:41.75 |
15 GB | 70.3 | 0.0 | 01:58.36 |
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.
Table Data: case0, case1, case2, case3
Network Rate: Peak Rate : 886 Mb, Transferred Data : 1 GB, Tool: iftop
Observation:
-
Successfully transferred the files between sourcehost1(VM) and desthost site with the completely verification using adler32 hash.
-
Did the measurement of CPU and memory usage using top command.
-
The data during the transfer process is passed through the ssh tunnel. There are basically two ssh tunnels one is between mainhost and desthost, another one between sourcehost1 and desthost.
-
In the table CPU (%) and MEM % are representing maximum threshold.
-
According to the above data, we have constant memory usage that means our file copying logic(io.copy) is using constant size of buffer to copy the files. We can increase buffer size in order to reduce the CPU time.
-
The CPU time for the destination is usually high because we are calculating hash again to verify the received data.
-
It is taking around 2-3 mins(CPU time) to transfer the 15GB file. We can even transfer further than that.
Mimic a dataset (multi-file) transfer, i.e. take a bunch of random files (e.g. 10), could even be replicas of the same one but with different names, trigger a transfer of the entire set (still, on the same A->B path as in point 1), and have them all moved to destination. Monitor how the system behaves, if as expected or not (e.g. sequential? in parallel? how in parallel? who arrives first? all cksum-ok? how long did it took? any configuration change to be tried out?)
Case 1: (Sequential) Transferring 10 files sequentially. This can be the case when these 10 files belongs to same dataset or same blocks. We will transfer all the 10 files together in one common request of particular dataset.
sourcehost1 | |||
File Size (10 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun*) |
1 GB | 8.7 | 0.6 | 00:11.07 |
*Hun=Hundredth
desthost | |||
File Size (10 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun) |
1 GB | 69.1 | 0.0 | 01:35.60 |
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.
Table Data: link
Network Rate: Peak Rate : 812 Mb, Transferred Data : 10 GB, Tool: iftop
Observation:
- Successfully transferred 10 files, between sourcehost1 and desthost, where each belongs to the same data set and block.
- This process was transferring the files sequentially because we registered only one transfer request for one data set. We can also transfer the same files by registering 10 different transfer requests for each of them(case 2).
- In this approach we are not completely utilizing our bandwidth because we are transferring 10 files one by one.
- Let’s assume our maximum transfer rate can be 850 Mb. So while transferring files chunk wise sometimes one single HTTP request divides data in more than two chunks. This way it can use maximum 350 Mb(1 GB / 3 chunks). Take a look at this screen shots(link). Which will help you to understand why it is not using entire bandwidth for each HTTP request.
- The behavior of this transfer process was quite symmetric. First rise in transfer rate then sudden fall in rate and again jump in a rate.
Case 2: (Parallel) Transfer the same 10 files in parallely. This can be the case when we will register separate transfer requests for each of the files and then transfer all of them parallely.
sourcehost1 | |||
File Size (10 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun*) |
1 GB | 67.0 | 0.6 | 00:10.1 |
*Hun=Hundredth
desthost | |||
File Size (10 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun) |
1 GB | 84.6 | 0.0 | 02:17.06 |
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.
Table Data: link
Network Rate: Peak Rate : 820 Mb, Transferred Data : 10 GB, Tool: iftop
Observation:
- In this case, the use of bandwidth completely depends on the how golang is scheduling their go routines.
- Nature was quite random. Sometimes there was a sudden jump in transfer rate while on the other hand rate limit reached to at its bottom limit.
Redo test 3 by renaming e.g. 1 file on site A, so it cannot be found. The system is able to tell you that 9 files are transferred and the 10th is impossible?
sourcehost1 | |||
File Size (10 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun*) |
1 GB | 7.0 | 0.6 |
*Hun=Hundredth
desthost | |||
File Size (10 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun) |
1 GB | 75.0 | 0.0 |
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.
Table Data: link
Network Rate: Transferred Data : 8.1 GB, Tool: iftop
Observation:
- Changed the name of two files and then tried to transfer the 10 files of the same dataset.
- Our agent is successfully transferring all the eight files and showing errors for the remaining two files.
- Checkout the errors thrown by source agent over here. Link
- Need to add some method or field to notify client about these untransferred files. Our agent showing that transfer request is finished instead of showing error for some files.
Transfer files from multiple sites, e.g. 7 files on site A and 3 files on site B, and try to do the same test as point 3 (i.e. complete dataset) and move all 10 files to site C. Is the system able to understand, if 10 files are needed in C, where they are and where to pick them from? (perhaps not yet in the code?)
sourcehost1 | |||
File Size (7 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun*) |
7.7 GB | 9.3 | 0.7 | 7:40.00 |
*Hun=Hundredth
sourcehost2 | |||
File Size (3 files) | CPU Usage (%) | Memory Usage (%) | CPU Time (Min:Sec.Hun) |
3.3 GB | 21.6 | 0.0 | 5:00.00 (approx) |
Note: Here memory usage is not zero but we have large amount of RAM
and our Go program uses significant amount of memory.
Table Data: link
Network Rate: Pick Rate : 828 Mb, Transferred Data : 11 GB, Tool: iftop
Observation:
- It works based on this scheduling algorithm. [link]
- First it fetches all the files for the given dataset. Then it takes union of these files and filters out agents which have given dataset(partially or fully).
- Then it Sorts these filtered agents according to the prediction values of ML model. Now, it starts serving the requests from the highest prediction values to lowermost prediction values. Checks if the source agent is up or not if it is up then submit that request to the destination and do the pull transfer.
- Router can successfully transfer files from two sources to the one destination. In this case there are two sources 1) uibo-cms-03.cr.cnaf.infn.it(contains 3 files) site and one VMs(contains 7 files).
Same as 5, i.e. A+B transferring to C, but switch A OFF (or rename all files there). How does the system behaves?
Observation:
- It will only transfer the files which are available. Hence, in this case it will only transfer 3 files situated on agent B.
Test how much simultaneous transfers single node can sustain and how it's throughput drops with adding new transfer. Create N files of equal size and start transfers each of them one at a time, i.e. start one, then another, then another. Each time measure how throughput drop if new transfer is added to the pool.
Conclusion:
- One thing we can conclude is if the two requests have shared resources then their transfer rates eventually drops and become extremely low(around 20Mb). In the beginning, when transferring two same files(shared resource) the rate was around 20 Mb. While processing those two requests, when we approved one different kind of request to transfer the data. Suddenly, the transfer rate increased and reached up to the 450 Mb.
- With the new transfer request its throughput is sometimes increasing and sometimes decreasing.
- Without having the shared resources, it is sending the data with the avg rate of 400-500Mb.
Source: source-server.cern.ch (VM) Ram: 1.8 GB Type: m2.small Attached Volume: 15 GB
Destination: uibo-cms-03.cr.cnaf.infn.it
File size : 1 GB
Sequential Transfer: One by one adding requests.
File Number | CPU Usage (%) | RAM Usage (%) | Throughput |
Initially | 0.7 | 0.7 | 0 |
1. | 7.3 | 0.7 | 0 |
2. | 9.3 | 0.7 | 300 Mb |
3. | 7.0 | 0.7 | 535 Mb |
4. | 8.6 | 0.7 | 620 Mb |
5. | 7.6 | 0.7 | 551 Mb |
6. | 6.0 | 0.7 | 624 Mb |
7. | 7.3 | 0.8 | 605 Mb |
8. | 7.6 | 0.8 | 610 Mb |
9. | 7.3 | 0.8 | 627 Mb |
10. | 6.6 | 0.8 | 676 Mb |
End | 0.3 | 0.7 | Decreased to 0 |
Parallel Transfer: Together submitting 1, 2, … 10 transfer requests. Here peak throughput refers to a throughput of a node, i.e. all files in transfer combined.
Total Files (N) | Max CPU Usage (%) | Max RAM Usage (%) | Peak Throughput | Errors |
1 | 6.3 | 0.5 | 750 Mb | |
2 | 4.3 | 0.6 | 485 Mb | |
3 | 7.6 | 0.7 | 701 Mb | |
4 | 8.0 | 0.7 | 661 Mb | |
7 | 11.0 | 0.7 | 911 Mb | On main-agent: database is locked |
10 | 10.6 | 0.8 | 907 Mb | On main-agent: database is locked |
Source:
Host: source-server.cern.ch
Ram: 1.8 GB
CPU: 1
Type: m2.small
Attached Volume: 15 GB
Destination:
Host: uibo-cms-03.cr.cnaf.infn.it
Ram: 36GB
Core: 12 cores
Data: 10 files of same dataset. Size of each file is 1GB.
Code Version: v01.00.05
Note: Below mentioned throughput is measured in parallel mode.
The overall throughput is greater than the throughput of a
single file. In the pull model data is bypassing through the
ssh tunnel. There is one ssh tunnel require while pulling the
data from source agent. While in push model source agent is
directly pushing the data to destination agent i.e. without
using the ssh tunnel.
File Name | Start Time (sec) | End Time (sec) | Total Time (sec) | Throughput (MB/s) | Throughput (Mb/s) |
file1.root | 1509217078 | 1509217842 | 764 | 1.28 | 10.24 |
file2.root | 1509217078 | 1509217891 | 813 | 1.26 | 10.08 |
file3.root | 1509217078 | 1509217645 | 567 | 1.81 | 14.48 |
file4.root | 1509217078 | 1509217609 | 531 | 1.93 | 15.44 |
file5.root | 1509217078 | 1509217785 | 707 | 1.45 | 11.6 |
file6.root | 1509217078 | 1509217766 | 688 | 1.49 | 11.92 |
file7.root | 1509217078 | 1509217722 | 644 | 1.59 | 12.72 |
file8.root | 1509217078 | 1509217878 | 800 | 1.26 | 10.08 |
file9.root | 1509217078 | 1509217785 | 707 | 1.34 | 10.72 |
file10.root | 1509217078 | 1509217757 | 679 | 1.51 | 12.08 |
Sum of Throughput(Mb/s): 119.36 |
Note: We transferred files of different sizes one by one.
In this case the throughput of that transfers process can be
equivalent to the overall system’s throughput.
File Size | StartTime (sec) | EndTime (sec) | Total Time (sec) | Throughput (MB/s) | Throughput (Mb/s) |
1 GB | 1509271362 | 1509271477 | 115 | 8.95 | 71.6 |
3 GB | 1509270793 | 1509271150 | 357 | 8.64 | 69.12 |