This repository has been archived by the owner on Dec 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmain.tex
5073 lines (4724 loc) · 511 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
% Niniejszy plik stanowi przykład formatowania pracy magisterskiej na
% Wydziale MIM UW. Szkielet użytych poleceń można wykorzystywać do
% woli, np. formatujac wlasna prace.
%
% Zawartosc merytoryczna stanowi oryginalnosiagniecie
% naukowosciowe Marcina Wolinskiego. Wszelkie prawa zastrzeżone.
%
% Copyright (c) 2001 by Marcin Woliński <[email protected]>
% Poprawki spowodowane zmianami przepisów - Marcin Szczuka, 1.10.2004
% Poprawki spowodowane zmianami przepisow i ujednolicenie
% - Seweryn Karłowicz, 05.05.2006
% Dodanie wielu autorów i tłumaczenia na angielski - Kuba Pochrybniak, 29.11.2016
% dodaj opcję [licencjacka] dla pracy licencjackiej
% dodaj opcję [en] dla wersji angielskiej (mogą być obie: [licencjacka,en])
\documentclass[en]{pracamgr}
% Code listings
\usepackage{listings}
\usepackage{xcolor}
\usepackage{amsmath}
% Diagrams
\usepackage{tikz}
\usetikzlibrary{calc,positioning,arrows.meta,shapes,fit,backgrounds,spy}
% Plots
\usepackage{pgfplots}
\usepackage{pgfplotstable}
\usepgfplotslibrary{groupplots}
\pgfplotsset{width=10cm, compat=1.18}
\usepgfplotslibrary{statistics}
% \usepgfplotslibrary{colorbrewer}
\usepackage{filecontents}
\usepackage{float}
\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{1,1,1}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
commentstyle=\color{codegreen},
keywordstyle=\color{magenta},
numberstyle=\tiny\color{codegray},
stringstyle=\color{codepurple},
basicstyle=\ttfamily\footnotesize,
breakatwhitespace=false,
breaklines=true,
captionpos=b,
keepspaces=true,
numbers=left,
numbersep=5pt,
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2
}
\lstset{basicstyle = \ttfamily, style=mystyle, frame=single, numbers=none, columns=fullflexible, upquote=true}
% End code listings
% Hrefs
\usepackage[colorlinks]{hyperref}
\hypersetup{
linkcolor=red,
citecolor=blue,
urlcolor=magenta,
}
% \usepackage{caption} % makes clicking figure~\ref navigate to the top of the figure instead of the caption (has some bug which makes the ref to the first figure in the appendix refer to Page 1, and others to be shifted by 1)
% End hrefs
\usepackage[backend=biber, dateabbrev=false,date=iso,urldate=iso]{biblatex}
\usepackage{csquotes}
% Tables
\usepackage{longtable}
\setlength{\LTcapwidth}{\textwidth} % make caption below longtable have the \textwidth
\usepackage{makecell}
\usepackage{multirow}
% Appendices
\usepackage[titletoc]{appendix}
\makeatletter
\def\@cline#1-#2\@nil{%
\omit
\@multicnt#1%
\advance\@multispan\m@ne
\ifnum\@multicnt=\@ne\@firstofone{&\omit}\fi
\@multicnt#2%
\advance\@multicnt-#1%
\advance\@multispan\@ne
\leaders\hrule\@height\arrayrulewidth\hfill
\cr
\noalign{\nobreak\vskip-\arrayrulewidth}}
%^^^^^^^^
\makeatother
\addbibresource{references.bib}
% Dane magistranta:
\autor{Krzysztof Małysa}{394442}
\title{Sandbox for multi-process applications for unprivileged users on Linux}
\titlepl{Piaskownica dla aplikacji wieloprocesowych dla nieuprzywilejowanych użytkowników systemu Linux}
%\tytulang{An implementation of a difference blabalizer based on the theory of $\sigma$ -- $\rho$ phetors}
%kierunek:
% - matematyka, informacyka, ...
% - Mathematics, Computer Science, ...
\kierunek{Computer Science}
% informatyka - nie okreslamy zakresu (opcja zakomentowana)
% matematyka - zakres moze pozostac nieokreslony,
% a jesli ma byc okreslony dla pracy mgr,
% to przyjmuje jedna z wartosci:
% {metod matematycznych w finansach}
% {metod matematycznych w ubezpieczeniach}
% {matematyki stosowanej}
% {nauczania matematyki}
% Dla pracy licencjackiej mamy natomiast
% mozliwosc wpisania takiej wartosci zakresu:
% {Jednoczesnych Studiow Ekonomiczno--Matematycznych}
% \zakres{Tu wpisac, jesli trzeba, jedna z opcji podanych wyzej}
% Praca wykonana pod kierunkiem:
% (podać tytuł/stopień imię i nazwisko opiekuna
% Instytut
% ew. Wydział ew. Uczelnia (jeżeli nie MIM UW))
\opiekun{dr Janina Mincer-Daszkiewicz\\Institute of Informatics}
% miesiąc i~rok:
\date{December 2023}
%Podać dziedzinę wg klasyfikacji Socrates-Erasmus:
\dziedzina{
%11.0 Matematyka, Informatyka:\\
%11.1 Matematyka\\
%11.2 Statystyka\\
11.3 Informatics, Computer Science\\
%11.4 Sztuczna inteligencja\\
%11.5 Nauki aktuarialne\\
%11.9 Inne nauki matematyczne i informatyczne
}
%Klasyfikacja tematyczna wedlug AMS (matematyka) lub ACM (informatyka)
\klasyfikacja{Security and privacy -- Systems security -- Operating systems security}
% Słowa kluczowe:
\keywords{sandboxing, security, Linux, secure execution, arbitrary code execution, online judging system, Linux namespaces, cgroups, seccomp,
programming competitions}
% Tu jest dobre miejsce na Twoje własne makra i~środowiska:
% \newtheorem{defi}{Definicja}[section]
% koniec definicji
\begin{document}
\renewcommand\textfraction{0}
\maketitle
%tu idzie streszczenie na strone poczatkowa
\begin{abstract}
Secure execution environments evolved over time and are commonplace these days. The relatively recent features of the Linux kernel allow the creation of simple yet effective and efficient secure environments. We introduce a new sandbox for unprivileged users on Linux that requires no kernel modifications. Its primary use case is an online judge platform for carrying out algorithmic and programming contests. We use Linux namespaces for resource isolation, cgroups for resource limiting, and seccomp for restricting the allowed system calls.
The sandbox is versatile enough to securely run complex multi-process programs like the C++ compiler. It is optimized to run short-running programs with minimal overhead. The simplest short-running programs take below 3 ms to execute in our configuration, more than 4 times faster than competitive solutions.
The implementation uses client-server architecture to optimize for running short-running programs. It allows fine-grained configuration of the Linux namespaces, and resource limits, while providing the execution statistics of real and CPU time and the peak memory usage.
In the thesis we describe the overview of sandboxing techniques, the design and implementation of our sandbox and the performance evaluation of the implementation in the context of the online judging platform.
\end{abstract}
\tableofcontents
%\listoffigures
%\listoftables
\chapter{Introduction}\label{chapter:introduction}
\section{Background}
Secure execution environments are commonplace these days, from containers and virtual machines on servers to sandboxes on laptop and smartphones --- most of which run on Linux. They are used to securely execute untrusted code, as well as trusted programs to prevent damage escalation in the event of unknown vulnerabilities. Their key features are isolation, limiting resource usage, and accounting for resource consumption.
The features of Linux allow the creation of simple yet effective and efficient secure environments. They work at application runtime, so in most cases existing software does not need to be adapted to use them. This makes them easily applicable, and explains why their adoption is growing.
In this thesis, the most important application of sandboxing are online judge systems. Online judge systems have beneficial role in programming education and competitive programming. They allow testing user-provided solution to a specific problem. The solution is run on a predefined test cases in order to check if it is valid. In such platforms isolating the compilation and running of the tested program is essential to provide security and robustness of the platform itself.
Historically, isolation techniques evolved together with the online judge platforms. The most primitive (yet insecure) was usage of \texttt{chroot(2)}~\cite{prevelakis2001sandboxing} to restrict access to part of the filesystem. To increase isolation virtual machines were used~\cite{5635141}. Later, containerization became a new way to provide isolation~\cite{marevs2012new, SPACEK20151665}.
Online education platforms greatly facilitate teaching and learning programming. They provide quick feedback on the correctness of the code the user submits. They are used in schools and universities and provide great learning opportunities for all.
Moreover, a versatile sandbox has applications outside online judging platforms. For example, it can be used to sanitize compiling a PDF form \LaTeX~sources or for safe execution of untrusted server-side scripts in web applications.
\section{Goal of the thesis}
The goal of the thesis is to design, implement and integrate a new sandbox for the Sim project~\cite{sim_project}. The Sim project is an online platform for preparing people for and carrying out algorithmic contests. The project started in 2012 and is developed by me since the beginning. It is used at the XIII High School in Szczecin and programming camps to teach young people programming and algorithms. It has an online judge with a sandbox specially developed for this use case. Over the years the sandbox became a limitation. It only allows running a single-threaded statically linked executable of programs written in C, C++ or Pascal. The new sandbox will allow supporting more programming languages and improve security of the tested program compilation stage.
\subsection{Requirements}
The new sandbox needs to be optimized for running short-running programs as well as have minimal runtime overhead. Most of the test cases the tested program is run on are small and it completes them in less than 10ms. The goal is to allow hundreds of such sort-running runs per second, hence optimizing for short-running programs is important. However, minimizing overhead of the sandbox during the run is also important i.e. if the program runs X ms normally, the objective is that the program inside the new sandbox will also run approximately X ms.
The new sandbox needs to be versatile. It will be used to secure the compilation of the tested programs as well as running of the tested programs. Compilation is a complicated process that involves parsing, translating, optimizing and linking the final program. For languages like C, C++ and Rust it involves running several executables in coordination e.g. compiler and linker i.e. more than one process at a time --- the sandbox needs to support that.
Sandboxing needs to have a low overhead. Apart from small tests where a tested program runs quickly (a matter of milliseconds), almost always the tested program is run on big test cases, where it may need several seconds to solve the problem. Increasing this time as little as possible while the tested program is running inside the sandbox is one of the primary objectives.
It often requires running several executables e.g. compiler and linker, so allowing a single process inside sandbox is not enough. Sandboxing the tested program is simpler, because it is a single process. But since it is often short-running, the overhead needs to be minimal.
The sandbox needs to allow limiting resources. Real time, CPU time, memory -- these need to be limited not only for the robustness of the platform, but specific problems require different limits. The goal of some problems is to solve it with very restricted memory e.g. find a missing integer in a random permutation of integers $1, \ldots, n$ without one element, but in $O(1)$ memory.
The sandbox needs to account resource usage. For every test, the user is presented with consumed memory and CPU time by their solution. The sandbox needs to provide this information.
The last requirement is the sandbox will not require any privileges. There is a tool called Sip~\cite{sip} for preparing the problem packages for the Sim platform. One of the purposes of the tool is to run the solutions inside the same secure environment as on the Sim platform. The user should not need any privileges to run this tool, so the sandbox should not require them either.
\subsection{Existing solutions}
Approaches to form a secure execution environments differ. One of them is virtualization or emulation e.g. QEMU~\cite{qemu_website} and KVM~\cite{kvm_website}, VirtualBox~\cite{virtualbox_website}, VMWare Workstation~\cite{vmware_workstation_website}. Although powerful and effective, they come with an enormous overhead i.e. booting up an entire operating system. Moreover, emulation noticeably slows down the runtime of an emulated application, rendering such solutions inapplicable.
Containers provide much lower overhead: setup of an order of milliseconds and negligible runtime overhead. But, Docker~\cite{Merkel:2014:DLL:2600239.2600241}, LXC~\cite{conf/cisis/BeserraMEBSF15} require root privileges to create a container. systemd-nspawn~\cite{systemd_nspawn} requires root privileges to run.
Rootless containers are containers~\cite{rootless_containers_rs} that can be created and run by an unprivileged user are the almost perfect solution to the problem. They provide almost all of the functionality of the normal containers but without the need to engage a privileged user. However, they often use setuid binaries and that is undesirable~\cite{podman_rootless_containers_presentation}. Also they are not optimized to run sequences of short-running programs. In this thesis we will create a sandbox that uses the same techniques as rootless containers but will be optimized for running sequences of short-running programs.
% \section{Scope}
% The sandbox is implemented in C++ and runs only on Linux. The implementation was tested on x86\_64 but should also work on x86 or arm64, due to no architecture specific parts. Only seccomp filters may need to be adapted for other architectures, but the sandbox itself uses user-provided seccomp filters so it should work out of the box.
% The sandbox uses linux namespaces to isolate the sandboxed program and cgroups to limit resources and provide resource usage accounting.
% A sandbox client in Rust language would be beneficial for the broader adoption of the sandbox, but is intentionally out of scope of this thesis.
\section{Structure of the Thesis}
Chapter~\ref{chapter:problem_overview} contains overview of sandboxing techniques and existing implementations and comparative analysis of them. Details of design and architecture are described in Chapter~\ref{chapter:design}. Implementation is described in Chapter~\ref{chapter:implementation}. Chapter~\ref{chapter:performance} contains performance evaluation of the final implementation and impact of some optimizations. Finally, Chapter~\ref{chapter:conclusion} contains conclusions and suggestions for further work. Appendix~\ref{appendix:tables_and_plots} contains tables and plots.
\chapter{Problem overview}\label{chapter:problem_overview}
\section{Overview of Sandboxing Techniques}
During the first programming competitions, the human judges manually read and verified the source code of the contestants' solutions~\cite{tochev2010validating}. Over time this became infeasible and gave birth to automatic judge systems.
To prevent people from interfering with the normal workflow of the competition, e.g. Denial of Service Attack by exhausting memory resources, the automatic judge systems need a secure way to compile and execute a contest's solution. This is where sandboxes come into place.
First sandboxes required modification of the OS kernel~\cite{provos2003improving, garfinkel2004janus, garfinkel2004ostia, jana2011txbox, li2014minibox}. While they had little run-time overhead, some of them were limited to single-threaded applications~\cite{merry2009using}.
Later, as support for process tracing matured, \texttt{ptrace}-based sandboxes arose~\cite{marevs2007perspectives, kolstad2009infrastructure, kim2013practical}. The problem with those solutions is the overhead that varies from around 75\%~\cite{merry2010performance} to 160\%~\cite{marevs2012new} for syscall-intensive programs. This overhead however, does not affect programming contest fairness much~\cite{marevs2011fairness}. Supporting multi-threaded and multi-process programs while using \texttt{ptrace} is tricky, but possible~\cite{kim2013practical}, because of Time of Check/Time of Use (TOCTOU) problem~\cite{cwe_toctou}. \texttt{ptrace}-based sandbox needs to inspect syscall arguments. To do so it has to read them, but the multi-threaded or multi-process program can change the indirect argument after the reading but before the kernel uses the argument. This creates a dangerous race condition that has to be addressed.
Finally, after the kernel support for containerization materialized, namespace and cgroup based sandboxes came into place~\cite{marevs2012new, netblue30/firejail, raknes2016nsroot, google/nsjail, flatpak, SPACEK20151665}. Contrary to ptrace-based sandboxes, namespace-based sandboxes have negligible runtime overhead~\cite{marevs2012new}. Moreover, they don't require modifications of the Linux kernel and work on major Linux distributions out of the box.
\section{Existing Implementations}
\subsection{Modifying OS kernel}
\subsubsection{Systrace}
Systrace~\cite{provos2003improving} intercepts all system calls in the kernel. It then decides if the syscall is safe by first checking a static list of safe syscalls. This step exists to reduce sandboxing performance overhead. If the syscall is not on the list, Systrace consults user space for a decision.
The system avoids TOCTOU problem~\cite{cwe_toctou} by copying syscall arguments to kernel memory before asking user space for a decision.
\subsubsection{Janus}
Janus~\cite{garfinkel2004janus} adds a module to extend Linux \texttt{ptrace} API. Policies are defined using configuration files. By default all syscalls are denied. The configuration directive refers to the policy module that provides the logic for deciding whether to allow a particular system call or not. For example, \texttt{path} module could be used to restrict IO on certain file paths.
\subsubsection{Ostia}
Ostia~\cite{garfinkel2004ostia} instead of filtering system calls delegates them to an external agent that performs syscalls on behalf of the sandboxed process. Authors emphasize that such architecture simplifies the system and protects from TOCTOU problems~\cite{cwe_toctou}.
Ostia is implemented as two components: a small kernel module and a user space part. The module intercepts the syscall and copies its arguments via IPC link to the user space agent. The agent decides whether the call should be allowed, executes it and returns the results back over the IPC link. Worth noting is that not all syscalls have to be delegated~---~some can be always allowed while others always denied.
\subsubsection{TxBox}
TxBox~\cite{jana2011txbox} introduces system-level transaction support. Impact of the untrusted insecure code is limited by rolling back the system state after the execution. This provides strong isolation and works with arbitrary executables but requires significant out-of-tree patches of the OS kernel.
\subsubsection{MiniBox}
MiniBox~\cite{li2014minibox} is a two-way sandbox that protects operating system from the application as well as application from the operating system. A modified version of TrustVisor~\cite{mccune2010trustvisor} hypervisor runs OS and sandboxed application separately in a Mutually Isolated Execution Environment. The hypervisor is the only communication channel between the isolated application and the regular OS. This way application is protected from the malicious operating system. To protect the OS from the application, MiniBox uses Software Fault Isolation techniques from NaCl~\cite{yee2010native}.
\subsubsection{SACO sandbox}
South African Computer Olympiad (SACO) sandbox~\cite{merry2009using} inserts a custom kernel module that hooks up to Linux Security Module infrastructure. Although it has negligible time and memory overhead, it only supports single-threaded programs.
\subsection{\texttt{ptrace}-based}
\subsubsection{MO sandbox}
MO sandbox~\cite{marevs2007perspectives, kolstad2009infrastructure} allows only single-threaded programs. It simply inspects arguments using \texttt{ptrace} and uses \texttt{setrlimit}~\cite{man_getrlimit_setrlimit_prlimit} to limit resources. It is used by USA Computer Olympiad (USACO).
\subsubsection{MBOX}
MBOX~\cite{kim2013practical} requires no superuser privileges. It makes use of seccomp BPF system call filtering to restrict allowed syscalls. BPF filtering is effective only for non-indirect arguments. To address this issue, the installed BPF filter notifies the \texttt{ptrace} monitoring process if further argument inspection is necessary to make a decision. To avoid TOCTOU problem~\cite{cwe_toctou}, the MBOX allocates a read-only page to which it copies the indirect arguments before inspecting them and rewrites the syscall to use the rewritten arguments. The copied arguments are protected against modification because changing page access permissions is impossible without a syscall.
\subsubsection{SIO2jail}
SIO2jail~\cite{sio2jail} uses \texttt{ptrace} to listen on \texttt{perf}~\cite{perf} performance counters. The \texttt{perf} performance counters are used to compute the number of executed instructions by the program. Polish Olympiad in Informatics (POI) uses SIO2jail to measure running time of the users' solutions in number of executed instructions. As far as we are aware, this is unique to POI, since all of the world uses CPU time. The drawback of this method is that REP instructions are counted as 1, no matter how many times they are repeated by the processor. SIO2jail requires no superuser privileges after access to performance counters is enabled in the system for all users.
\subsection{Using Linux namespaces}
\subsubsection{Firejail}
Firejail~\cite{netblue30/firejail} uses seccomp BPF system call filtering and mount namespaces to restrict filesystem access. Similarly it uses process namespaces to limit view of running processes and network namespaces to restrict access to network devices. However, Firejail uses a \texttt{setuid}~\cite{man_setuid} helper binary to achieve that. It allows resource limiting through \texttt{prlimit}~\cite{man_getrlimit_setrlimit_prlimit}.
\subsubsection{nsjail} \label{subsubsection:nsjail}
nsjail~\cite{google/nsjail} uses Linux namespaces, seccomp BPF system call filtering, \texttt{setrlimit}~\cite{man_getrlimit_setrlimit_prlimit} and cgroups to limit resources. It does not require superuser privileges. However, it is not optimized for running short-running programs. Also, it does not provide statistics of the run.
\subsubsection{nsroot}
nsroot~\cite{raknes2016nsroot} does not support resource limiting. It only makes use of Linux namespaces to restrict view of the file system, IPC and network devices.
\subsubsection{Flatpak}
Flatpak~\cite{flatpak}, previously xdg-app, is a software packaging and sandboxing tool. Internally, it uses Bubblewrap sandbox. The Bubblewrap~\cite{bubblewrap} is a \texttt{setuid}~\cite{man_setuid} program that uses Linux namespaces and seccomp filters.
\subsubsection{New Contest Sandbox}
New Contest Sandbox~\cite{marevs2012new} uses Linux namespaces and cgroups but not seccomp filters. It is used by Moe modular contest system (2012)~\cite{marevs2012new}. Linux namespaces and cgroups have negligible overhead compared to \texttt{ptrace}.
\subsubsection{APAC}
APAC (Automatic Programming Assignment Checker)~\cite{SPACEK20151665} uses Docker for sandboxing. It sets up a container for each run. Docker uses runC under the hood. While runtime overhead of Docker is low, the setup phase is primary source of overhead for short-running programs.
\subsubsection{runC}
runC~\cite{cochak2021runc} uses the same features of Linux kernel as nsjail. However, configuration is stored as files instead of passed as command-line arguments. It has a special \texttt{rootless} mode which does not require superuser privileges. Given all of the above however, it is not optimized for running short-running programs. runC is used internally by Docker.
\subsection{Other}
\subsubsection{Google Native Client}
Native Client (NaCl)~\cite{yee2010native} uses static analysis and Software Fault Isolation. After the static analysis, the program runs at native speed but requires recompilation with special compiler and libraries. NaCl only works for x86 architecture.
\section{Conclusion}
Many sandboxing solutions exist. From all of the above, closest to our requirements is nsjail (see Section~\ref{subsubsection:nsjail}). However, it is not optimised for running short-running programs. In fact, none of the above solutions is optimised to run hundreds of short-running programs per second.
Considering the similarities of nsjail and our solution, in the performance analysis we will compare our sandbox to nsjail sandbox.
% Overhead caused by context switches can be minimized by restricting the process or group of processes to a separate cpuset~\cite{merry2010performance}.
\chapter{Design and Architecture}\label{chapter:design}
\section{Client-server architecture}
From the start the sandbox was based on the client-server architecture. This was the choice to minimize process cloning overhead~\cite{redis-latency-generated-by-fork}, since it is a costly operation and happens for every sandboxing request. fork/clone needs to clone the whole address space of the process and the client process could have a large address space. Server process that is executed from a separate executable has a minimal required address space size therefore the cloning overhead is minimal for every request. Moreover, this architecture easily and safely allows sharing as much work as possible between the sandboxing requests which is the key to low overhead of running short-running programs.
The client spawns the sandboxing server and sends sandboxing requests via UNIX domain socket to the server. This is illustrated on Figure~\ref{fig:server_waits_for_request}. The request contains executable, arguments, namespace configuration, resource limits, seccomp BPF filters and a pipe through which the result will be sent back.
At startup, the server creates cgroup hierarchy and some namespaces so that they won't have to be created later or their creation will be faster. Other utilities are also setup here to do it once instead of for every request. Then it starts accepting requests.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\node [draw, densely dotted] (errors) {anonymous file\\for fatal errors};
\node [draw, above left=1cm and 0.1cm of errors, minimum height=1.4832815729996514cm, minimum width=2.4cm] (client) {client};
\node [draw, above right=1cm and 0.1cm of errors, minimum height=1.4832815729996514cm, minimum width=2.4cm] (supervisor) {supervisor\\process};
\draw[->] (client.-2) -- node[above] {UNIX socket \\ for requests} (supervisor.-178);
\draw[<-] (client) |- node[below] {} (errors);
\draw[->] (supervisor) |- node[below] {\texttt{stderr}} (errors);
\begin{scope}[on background layer]
\node [draw, dash dot, fit=(supervisor)] (server) {};
\node [above=0.1cm of server] {Sandbox server};
\end{scope}
% Legend
\matrix [draw, right] at ($(current bounding box.north west)+(0, 0.5)$) {
\draw[<-] (0,0) -- (0.5,0); & \node[right] {information flow}; \\
\draw node[right, draw] {x}; & \node[right] {process}; \\
\draw node[right, draw, densely dotted] {y}; & \node[right] {resource}; \\
};
\end{tikzpicture}
\caption{Sandbox server waits for requests. Client sends requests through the UNIX socket. Sandbox server will die on fatal error leaving the error message for the client in the anonymous file.}
\label{fig:server_waits_for_request}
\end{figure}
\subsection{Sandboxing request handling}
For each request, the server process (aka supervisor) spawns the PID~1 process of the new PID namespace. Then the init process setups namespaces and some of the resource limits. Finally the PID~1 process spawns the tracee process that finishes configuration and executes the requested executable. So for each sandboxing request we spawn exactly 2 processes. However, the executed program can spawn new processes --- each of them is referred to as a "tracee process". The PID~1 process is necessary for a couple of reasons:
\begin{itemize}
\item It reaps the zombie processes in the tracee PID namespace.
\item It allows locking mount-points in the mount namespace. The tracee process is spawned in a new user and mount namespace. Mounts are performed by the PID~1 process, therefore all mounts become locked together and cannot be individually unmounted by the tracee~\cite{man_mount_namespaces}. These mounts cannot be performed by the supervisor process instead, because it would alter the mount namespace for subsequent requests.
\item Inside a PID namespace, sending signals to the PID~1 process is allowed only for signals that the PID~1 process installed signal handler for. This could change the behavior for some programs, therefore a helper PID~1 process is needed.
\end{itemize}
This is shown on Figure~\ref{fig:server_handles_request_after_execve}.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\node [draw, densely dotted] (errors) {anonymous file\\for fatal errors};
\node [draw, above left=1cm and 0.1cm of errors, minimum height=1.4832815729996514cm, minimum width=2.4cm] (client) {client};
\node [draw, above right=1cm and 0.2cm of errors, minimum height=1.4832815729996514cm, minimum width=2.4cm] (supervisor) {supervisor\\process};
\node [draw, right=0cm and 0.9cm of supervisor, minimum height=1.4832815729996514cm, minimum width=2.4cm] (pid1) {PID~1\\process};
\node [draw, above right=-0.4cm and 0.5cm of pid1.east] (tracee_mid) {tracee\\process};
\node [draw, above=0.4cm of tracee_mid] (tracee_up) {tracee\\process};
\node [below=0.2cm of tracee_mid] (tracee_dots) {\ldots};
\node [draw, below=0.2cm of tracee_dots] (tracee_bottom) {tracee\\process};
\draw[->] (client.-2) -- node[above] {UNIX socket \\ for requests} (supervisor.-178);
\draw[<-] (client.-10) -- node[below] {response pipe} (supervisor.-170);
\draw[<-] (client) |- node[below] {} (errors);
\draw[->] (supervisor) |- node[below] {\texttt{stderr}} (errors);
\draw[<-, dashed] (supervisor) -- (pid1);
\draw[<-, dashed] (pid1) -- (tracee_mid);
\draw[<-, dashed] (pid1.20) -- (tracee_up);
\draw[<-, dashed] (pid1.-20) -- (tracee_bottom);
\begin{scope}[on background layer]
\draw [dash dot, rounded corners, blue!70, fill=blue!10]
($(pid1.north west)+(-0.3,0.3)$) .. controls +(0:3cm) and +(270:1.5cm) ..
($(tracee_up.north west)+(-0.3,0.3)$) --
($(tracee_up.north east)+(0.3,0.3)$) --
($(tracee_bottom.south east)+(0.3,-0.3)$) --
($(tracee_bottom.south west)+(-0.3,-0.3)$) .. controls +(90:1.2cm) and +(0:3cm) .. node[below, xshift=-8mm, pos=0.7, blue] {PID namespace}
($(pid1.south west)+(-0.3,-0.3)$) --
cycle;
\end{scope}
\begin{scope}[on background layer]
\coordinate (top_right) at ($(tracee_up.north east)+(0.4,0.4)$);
\coordinate (bottom_right) at ($(tracee_bottom.south east)+(0.4,-0.4)$);
\node [draw, dash dot, fit=(supervisor) (pid1) (top_right) (bottom_right)] (server) {};
\node [above=0.1cm of server] {Sandbox server};
\end{scope}
% Legend
\matrix [draw, right] at (current bounding box.north west) {
\draw[<-] (0,0) -- (0.5,0); & \node[right] {information flow}; \\
\draw[<-, dashed] (0,0) -- (0.5,0); & \node[right] {parentship}; \\
\draw node[right, draw] {x}; & \node[right] {process}; \\
\draw node[right, draw, densely dotted] {y}; & \node[right] {resource}; \\
};
\end{tikzpicture}
\caption{Sandbox server handles a request, at the moment after executing the requested executable. Sandbox server will die on fatal error leaving the error message for the client in the anonymous file. Sandbox server consist of the supervisor process and its child --- the PID~1 process that is spawned for each request. The PID~1 process performs a role of the init process in the PID namespace of the tracee processes.}
\label{fig:server_handles_request_after_execve}
\end{figure}
\section{Cgroups}
The server gains write access to cgroup hierarchy by being executed through \texttt{systemd-run -{}-user -{}-scope -{}-property=Delegate=yes -{}-collect}. It enables \texttt{pid}, \texttt{memory}, and \texttt{cpu} controllers for the below subgroups.
At startup, the server process creates the cgroup v2 hierarchy that looks as follows:
\begin{itemize}
\item \texttt{/supervisor} --- cgroup of the supervisor process,
\item \texttt{/pid1} --- cgroup of the PID~1 process,
\item \texttt{/tracee} --- cgroup of the tracee processes.
\end{itemize}
After creation of the hierarchy it places the supervisor process in its cgroup. Subsequent processes are placed in their cgroups by making use of \texttt{CLONE\_INTO\_CGROUP} flag.
\newline
\texttt{/tracee} cgroup allows:
\begin{itemize}
\item Killing all tracee processes by writing 1 to \texttt{/tracee/cgroup.kill} file.
\item Reading CPU user and CPU system time via \texttt{/tracee/cpu.stat} file.
\item Reading peak memory usage via \texttt{/tracee/memory.peak} file.
\item Setting process/tread number limit by writing \texttt{/tracee/pids.max}.
\item Setting memory hard limit by writing \texttt{/tracee/memory.max}.
\item Setting CPU usage limit by writing \texttt{/tracee/cpu.max}.
\item Disabling PSI accounting to reduce the sandboxing overhead by writing 0 to \\\texttt{/tracee/cgroup.pressure} file.
\end{itemize}
\texttt{/tracee} cgroup needs to be deleted and recreated after each request to reset \texttt{/tracee/cpu.stat} and \texttt{/tracee/memory.peak} files.
\section{Linux namespaces}
Linux allows unprivileged users to create user namespaces only. However, after entering a new user namespace the process gains all privileges inside the namespace and can create other namespaces.
\newline
The supervisor process creates the following namespaces:
\begin{itemize}
\item user namespace --- in order to create other namespaces and hide user ID and group ID,
\item mount namespace --- to allow mounting detached cgroups v2 hierarchy,
\item cgroup namespace --- to allow mounting detached cgroups v2 hierarchy,
\item network namespace --- to disconnect every tracee from network devices, done once, as it is costly,
\item IPC namespace --- to isolate every tracee from other processes' IPC, done once, for optimization,
\item UTS namespace --- to isolate every tracee from host's hostname, done once, for optimization,
\item time namespace --- to isolate every tracee from host's time namespace, done once, for optimization.
\end{itemize}
The PID~1 process creates the following namespaces:
\begin{itemize}
\item user namespace --- in order to create other namespaces and hide user ID and group ID,
\item mount namespace --- to allow mounting requested mount-point hierarchy,
\item PID namespace --- to isolate tracee from accessing other processes.
\end{itemize}
The tracee process creates the following namespaces:
\begin{itemize}
\item user namespace --- in order to create other namespaces and hide user ID and group ID and lock the mount tree,
\item mount namespace --- in order to lock the mount tree created by PID~1 process.
\end{itemize}
The listed namespaces hierarchy is illustrated on Figure~\ref{fig:server_namespaces}.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\node [draw, minimum height=0.8652475842497965cm, minimum width=1.4cm] (client) {client};
\node [draw, below=0.7cm of client, minimum height=1.4832815729996514cm, minimum width=2.4cm] (supervisor) {supervisor\\process};
\node [draw, below=0.7cm of supervisor] (pid1) {PID~1 process};
\node [draw, below left=0.7cm and -1.7cm of pid1] (tracee_mid) {tracee\\process};
\node [draw, left=0.3cm of tracee_mid] (tracee_left) {tracee\\process};
\node [right=0.1cm of tracee_mid] (tracee_dots) {\ldots};
\node [draw, right=0.1cm of tracee_dots] (tracee_right) {tracee\\process};
\draw[<-, dashed] (supervisor) -- (pid1);
\draw[<-, dashed] (pid1) -- (tracee_mid);
\draw[<-, dashed] (pid1.-150) -- (tracee_left);
\draw[<-, dashed] (pid1.-30) -- (tracee_right);
\begin{scope}[on background layer]
\draw[dash dot, rounded corners, darkgray!70, fill=darkgray!10]
($(supervisor.north west)+(-0.4,0.4)$) --
($(supervisor.north east)+(0.4,0.4)$) -- node[above, xshift=12mm, pos=0.6, darkgray] {user, mount,\\cgroup, network,\\IPC, UTS, time\\namespaces}
($(tracee_right.north east)+(0.4,0.4)$) --
($(tracee_right.south east)+(0.4,-0.4)$) --
($(tracee_left.south west)+(-0.4,-0.4)$) --
($(tracee_left.north west)+(-0.4,0.4)$) --
cycle;
\end{scope}
\begin{scope}[on background layer]
\draw[dash dot, rounded corners, blue!70, fill=blue!10]
($(pid1.north west)+(-0.3,0.3)$) --
($(pid1.north east)+(0.3,0.3)$) --
($(tracee_right.north east)+(0.3,0.3)$) --
($(tracee_right.south east)+(0.3,-0.3)$) --
($(tracee_left.south west)+(-0.3,-0.3)$) --
($(tracee_left.north west)+(-0.3,0.3)$) -- node[above, xshift=-12mm, pos=0.3, blue] {user, mount, PID\\namespaces}
cycle;
\end{scope}
\begin{scope}[on background layer]
\draw[dash dot, rounded corners, red!70, fill=red!10]
($(tracee_right.north east)+(0.2,0.2)$) -- node[above, xshift=13mm, pos=0.8, red] {user, mount\\namespaces}
($(tracee_right.south east)+(0.2,-0.2)$) --
($(tracee_left.south west)+(-0.2,-0.2)$) --
($(tracee_left.north west)+(-0.2,0.2)$) --
cycle;
\end{scope}
% Legend
\matrix [draw, right] at ($(current bounding box.north west)+(-2, -1)$) {
\draw[<-, dashed] (0,0) -- (0.5,0); & \node[right] {parentship}; \\
\draw[-, dash dot] (0,0) -- (0.5,0); & \node[right] {namespace boundary}; \\
\draw node[right, draw] {x}; & \node[right] {process}; \\
};
\end{tikzpicture}
\caption{Namespaces hierarchy of the sandbox server processes.}
\label{fig:server_namespaces}
\end{figure}
\section{Inter-process communication}
The client sends requests via UNIX domain socket to the supervisor process. The results are sent via a pipe attached to the request. The pipe is attached to the request as a file descriptor using \texttt{SCM\_RIGHTS} control message~\cite{man_unix}.
The supervisor, the PID~1 and the tracee processes communicate via shared anonymous memory page. Figure~\ref{fig:server_handles_request_before_execveat} illustrates this communication. Such communication requires no syscalls, is fast and reliable. This page is automatically unmapped upon \texttt{execveat} syscall~\cite{man_execveat} in the tracee process, so it is protected from the tracee access.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\node [draw, densely dotted] (errors) {anonymous file\\for fatal errors};
\node [draw, above left=1cm and 0.1cm of errors, minimum height=1.4832815729996514cm, minimum width=2.4cm] (client) {client};
\node [draw, above right=1cm and 0.2cm of errors, minimum height=1.4832815729996514cm, minimum width=2.4cm] (supervisor) {supervisor\\process};
\node [draw, right=0cm and 0.7cm of supervisor, minimum height=1.4832815729996514cm, minimum width=2.4cm] (pid1) {PID~1\\process};
\node [draw, right=0cm and 0.7cm of pid1, minimum height=1.4832815729996514cm, minimum width=2.4cm] (tracee) {tracee\\process};
\node [draw, above=1cm of pid1, densely dotted] (shmem) {shared memory\\for result};
\draw[->] (client.-2) -- node[above] {UNIX socket \\ for requests} (supervisor.-178);
\draw[<-] (client.-10) -- node[below] {response pipe} (supervisor.-170);
\draw[<-] (client) |- node[below] {} (errors);
\draw[->] (supervisor) |- node[below] {\texttt{stderr}} (errors);
\draw[<-] (supervisor.north) |- (shmem);
\draw[<->] (shmem.south) -- (pid1.north);
\draw[<-] (shmem.east) -| (tracee.north);
\draw[<-, dashed] (supervisor) -- (pid1);
\draw[<-, dashed] (pid1) -- (tracee);
\begin{scope}[on background layer]
\node [draw, dash dot, fit=(supervisor) (pid1) (shmem) (tracee)] (server) {};
\node [above=0.1cm of server] {Sandbox server};
\end{scope}
% Legend
\matrix [draw, right] at (current bounding box.north west) {
\draw[<-] (0,0) -- (0.5,0); & \node[right] {information flow}; \\
\draw[<-, dashed] (0,0) -- (0.5,0); & \node[right] {parentship}; \\
\draw node[right, draw] {x}; & \node[right] {process}; \\
\draw node[right, draw, densely dotted] {y}; & \node[right] {resource}; \\
};
\end{tikzpicture}
\caption{Sandbox server handles a request, at the moment before executing the requested executable. Sandbox server will die on fatal error leaving the error message for the client in the anonymous file. Sandbox server consist of the supervisor process and its child --- the PID~1 process that is spawned for each request and its child --- the tracee process that will execute the requested executable. The tracee process saves in the shared memory the time just before \texttt{execveat} and signals the PID~1 process. The PID~1 process reads the time saved in the shared memory and starts the real and CPU timers. After the tracee process dies, the PID~1 process writes exit code and status of the tracee process and the time it died. Moreover, the shared memory is used to communicate fatal errors to the supervisor process.}
\label{fig:server_handles_request_before_execveat}
\end{figure}
\section{Capabilities}
The supervisor process drops all capabilities, sets securebits and \texttt{NO\_NEW\_PRIVS} flag. This ensures minimal possible capabilities in its and ancestor user namespace and prevents gaining any new privileges.
\section{Hardening}
The PID~1 process, after spawning the tracee enters a new cgroup namespace to limit view of other namespaces i.e. if the tracee somehow takes control of the PID~1 process, it will not be able to raise its PID and memory limit. Moreover a seccomp filter is installed to limit allowed syscalls only to those needed for reaping the orphaned zombie process, managing time limits and exiting upon tracee death.
\section{Conclusion}
Client-server architecture allows time-performance optimizations. Furthermore, it allows more common work to be done once and simplifies implementation. For instance, file descriptors do not leak to other processes because there are no threads that could fork a new process. Request handling requires creation of 2 processes --- the PID~1 process and the tracee process that later executes the requested program. Resource limits and accounting is mostly performed by cgroups. Isolation is achieved by deft usage of Linux namespaces.
\chapter{Implementation}\label{chapter:implementation}
The project is written in C++, as it is a low-overhead, low-level language, but more convenient than C. Git is used as a Version Control System to track incremental implementation. Invaluable tool used during development was \texttt{strace}~\cite{strace}. It allowed easy inspection of system calls and their return values without any modification to the source code.
\section{Interface}
The client has to spawn the sandbox server --- the supervisor process. It then operates on the connection handle. Using the connection handle, the client can send requests to execute programs in the sandbox.
After sending a request, a request handle object is constructed. It can be used to obtain result of the execution, cancel execution or kill the tracee processes. Canceling execution is useful in case of errors in the client, where the result of the execution as well as the execution itself are no longer needed. Upon cancellation of the request, the tracee processes are immediately killed. The result of running the request is discarded. Canceling an already finished request discards the result. Canceling the request before the server started handling it causes it to be skipped. Killing an already finished request is no-op. Killing the request before the server started handling it causes tracee to be immediately killed after the \texttt{execveat}.
The client can then await the result of the request using the request handle. The result can either be a successful result or an error with a textual description. This error is not fatal to the supervisor process i.e. new requests can be sent. The successful result consists of the exit code and status, and runtime statistics:
\begin{itemize}
\item real time,
\item CPU user and CPU system time,
\item peak tracee cgroup memory usage.
\end{itemize}
Each request has a set of accompanying options:
\begin{itemize}
\item Optional \texttt{stdin}, \texttt{stdout}, and \texttt{stderr} file descriptors. If optional is specified as empty, \texttt{/dev/null} is opened as the file descriptor.
\item Environment as an array view of string views.
\item Linux namespace configuration:
\begin{itemize}
\item user ID and group ID mapping,
\item mounts and new root mount,
\end{itemize}
\item Cgroup resource limits: process and thread limit, memory limit, CPU maximum bandwidth.
\item \texttt{prlimit} hard limits.
\item Real time limit.
\item CPU time limit.
\item Seccomp BPF filter as a file descriptor. The decision to pass it as a file descriptor is that it lowers the overhead of repeatedly using the same filter --- a common scenario in a judge system. Only the file descriptor needs to be sent with each request instead of the whole BPF filter content. This allows the filter to be compiled once and passed for multiple requests with minimal overhead. An alternative is to extend the API to save seccomp filters but it was considered unnecessary given how small is the overhead of passing a single file descriptor.
\end{itemize}
\section{Time limits}
The PID~1 process controls the time limits. The tracee process, just before \texttt{execveat} saves current real time from \texttt{CLOCK\_MONOTONIC\_RAW} and CPU time from \texttt{cpu.stat} tracee cgroup file to the shared memory (see Figure~\ref{fig:server_handles_request_before_execveat}). The problem with \texttt{cpu.stat} file is that it is updated infrequently. For a young tracee process, this file often reports consumed CPU time equal to 0 microseconds instead of a few hundred microseconds. Fortunately, executing \texttt{sched\_yield()} system call forces recalculation of the file and the values are no longer 0. This is why this syscall is required as allowed in the seccomp BPF filter.
\subsection{Real time limit}
After saving the current real time the tracee process signals the PID~1 process with \texttt{SIGUSR2}. The PID~1 process reads the saved real time and sets up a POSIX timer to expire at the moment of saved time + real time limit. When the timer expires, \texttt{SIGUSR1} is sent by the kernel to the PID~1 process and it terminates all tracee processes by writing 1 to the \texttt{cgroup.kill} file of the tracee cgroup.
\subsection{CPU time limit}
In case the tracee is not restricted to one process, the setup is analogous to real time except that there is no CPU timer for a cgroup of processes. Instead we calculate minimal period of time in which the CPU time limit could expire as follows: $\frac{\text{remaining cpu time}}{\text{max parallelism}}$, where max parallelism equals: $\min(\text{available threads}, \text{\texttt{process\_num\_limit}}, \text{\texttt{cpu\_max\_bandwidth} in threads})$. Upon the timer expiration the remaining cpu time is recalculated and the timer is rescheduled if the remaining cpu time is greater than 0. Timer expiration is signaled by the kernel as signal \texttt{SIGXCPU}. To prevent polling, the minimal timer expiration period is capped to have minimum value of 1ms --- this gives at most 1000 checks per second.
In case the tracee is restricted to one process, the setup is different. After saving the current CPU time the tracee process signals the PID~1 process through a pipe. A signal cannot be used because \texttt{timer\_create} syscall and \texttt{clock\_getcpuclockid} library function are not marked async-signal-safe --- they are not specified to be safe to call inside a signal handler. An \texttt{eventfd} cannot be used either, because if tracee dies before writing to the \texttt{eventfd}, the PID~1 process will wait indefinitely on the \texttt{read} syscall. With a pipe, \texttt{read} syscall returns 0 when the other end becomes closed. With the limit of one process we use the CPU timer of the tracee process and set up a timer to expire when the tracee exceeds the CPU time limit.
In both cases, when the CPU time limit is exceeded, the PID~1 process is signaled about it and it terminates all tracee processes by writing 1 to \texttt{cgroup.kill} file of the tracee cgroup.
\section{Runtime statistics}
After the main tracee process (the first spawned process) exits, the PID~1 process saves the current real time and the exit status in the shared memory, unless the tracee set an error, and exits. The kernel kills all remaining tracee processes (because the PID namespace's init process died). After the PID~1 process exits, the supervisor process reads the shared memory (see Figure~\ref{fig:server_handles_request_before_execveat}). It checks if there is an error of either tracee or PID~1 process. If there is one, it becomes the result of the request. If there is none, the supervisor process calculates:
\begin{itemize}
\item real time using formula: $\text{time of tracee death} - \text{saved \texttt{execveat} real time}$,
\item CPU time using formula: $\text{CPU time read from \texttt{cpu.stat} file} - \text{saved \texttt{execveat} CPU time}$,
\item Peak memory usage by reading \texttt{memory.peak} tracee cgroup file.
\end{itemize}
\section{Error handling}
Errors in the supervisor process are considered fatal and are reported by writing to \texttt{stderr}. After writing errors, the supervisor process exits immediately. When the client tries to read the request result, the read will fail with \texttt{read} returning unexpected value 0. The client then ensures the supervisor process is dead (in case the communication failed) and tries to read the error the supervisor wrote. If the client finds one it throws exception with this error, otherwise it throws the exception with the \texttt{read} error.
The PID~1 process and the tracee process write error to the shared memory (see Figure~\ref{fig:server_handles_request_before_execveat}) and exit immediately. The supervisor process reports these errors as a request result.
\section{Request sending and receiving}
The request is sent via UNIX domain socket (see Figure~\ref{fig:server_waits_for_request}). The request consists of a constant-length header with file descriptors and a variable length body. The request header contains only the length of the request body. The request body contains all parameters of the request that are serialized to a custom binary format.
\section{File descriptors}
The sandbox server closes all file descriptors except the UNIX socket fd and opens \texttt{/dev/null} as \texttt{stdin}, \texttt{stdout}, and \texttt{stderr}. This is a small optimization, in case the request does not specify a standard file descriptor. For instance, if the request is to execute a program without \texttt{stdin}, the sandbox server has to set up the \texttt{stdin} of the tracee process to be \texttt{/dev/null} opened for reading. However, the tracee process inherits the file descriptors of the PID~1 process that inherits the file descriptors of the supervisor process. The supervisor process has already opened \texttt{/dev/null} as the \texttt{stdin} file descriptor. Therefore no action is needed in the PID~1 process and the tracee process for \texttt{stdin} of the tracee process to be \texttt{/dev/null} opened for reading. The same principle applies to \texttt{stdout} and \texttt{stderr} file descriptors.
All file descriptors are opened with \texttt{O\_CLOEXEC} flag so that they will not leak to the executed process. A unit test to check if any file descriptor leaks to the sandboxed program is in the test suite.
The PID~1 process inherits all request standard file descriptors and passes them to the tracee process. It has to close them after spawning the tracee process. To see why, let's consider a pipe of which one end is passed as a \texttt{stdin} to the sandboxed program. A pipe is broken if all file descriptors of one end become closed. If the PID~1 process did not close the standard file descriptors of the tracee, the pipe could not become broken until the PID~1 process dies. This changes the semantics of the pipe if the program is run inside the sandbox and is undesirable. Moreover, for hardening purposes the PID~1 process closes all unnecessary file descriptors after spawning the tracee --- in case, the tracee somehow takes control of the PID~1 process.
\section{Canceling the request}
The response to the request is passed via pipe that is provided alongside the request. If the pipe becomes broken i.e. the client closes the read end of the pipe, the request is considered cancelled and is immediately discarded if currently handled and omitted otherwise.
\section{Killing the request}
The \texttt{eventfd} file descriptor is sent with the request. The supervisor process monitors this file descriptor and if it becomes readable i.e. the client writes a value to it, the tracee is killed immediately. To avoid false-positive errors (the tracee process is killed unexpectedly), if the request is killed before \texttt{execveat} syscall executing the requested program, killing of the tracee is delayed to the \texttt{execve} call.
\section{Sandbox server upon client death}
The supervisor process monitors the UNIX socket through which the requests flow in. If the other end becomes read and write closed, the supervisor recognises it as a death of the client process and dies immediately. The PID~1 process is configured to die upon the supervisor process death, and the kernel kills all tracee processes when the PID~1 process dies. Therefore, all server processes die.
\section{PID~1 process upon supervisor death}
The PID~1 process configures the kernel to kill it upon the supervisor process death. This is done using \texttt{prctl}'s option \texttt{PR\_SET\_PDEATHSIG}. However, if the supervisor dies before the kernel configures the PID~1 process to die, the PID~1 process will still be alive and waste the resources. To solve this, one could check if the \texttt{getppid()} returns the expected PID of the supervisor process. However, this will not work, since the PID~1 process is in a new PID namespace, and \texttt{getppid()} will always return 0. A reliable solution is to pass a pidfd file descriptor of the supervisor process and check if the supervisor process is dead by checking if the pidfd file descriptor became readable. This way, upon supervisor process death, the PID~1 process is either killed by the kernel or it detects the death of the supervisor process and kills itself.
\section{Signals}
Signals are another way the processes can communicate with each other. They have their nuances and have to be isolated as well.
\subsection{Tracee signals}
The tracee can signal only to the visible processes and those are limited by the PID namespace. However, it can also send signals to its process group that can span multiple PID namespaces. Figure~\ref{fig:pgid_and_pid_namespace} illustrates this situation. Therefore it is necessary to set new process group for the tracee processes. Furthermore, as a hardening, a new process group is set for the PID~1 process as well, in case the tracee takes control of it.
However, it is better to also set a new session id using \texttt{setsid} syscall instead of just the process group id to avoid vulnerabilities connected to the current controlling terminal~\cite{bubblewrap_cve}.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\node [draw] (a) {process A};
\node [draw, right=0.7cm of a] (b) {process B};
\node [draw, below=0.5cm of b] (c) {process C};
\begin{scope}[on background layer]
\draw[dash dot, rounded corners, blue!70, fill=blue!10]
($(b.north west)+(-0.3,0.3)$) --
($(b.north east)+(0.3,0.3)$) -- node[above, xshift=14mm, pos=0.6, blue] {PID namespace}
($(c.south east)+(0.3,-0.3)$) --
($(c.south west)+(-0.3,-0.3)$) --
cycle;
\end{scope}
\begin{scope}[on background layer]
\draw[dashed, rounded corners]
($(b.north east)+(0.15,0.15)$) --
($(b.south east)+(0.15,-0.15)$) -- node[above, yshift=-5mm, xshift=-16mm, pos=0.6] {process group}
($(a.south west)+(-0.15,-0.15)$) --
($(a.north west)+(-0.15,0.15)$) --
cycle;
\end{scope}
\end{tikzpicture}
\caption{Process group can span multiple PID namespaces.}
\label{fig:pgid_and_pid_namespace}
\end{figure}
\subsection{\texttt{SIGPIPE} in the supervisor process}
Sending response to the client may generate \texttt{SIGPIPE} signal for the supervisor process if the client cancels the request approximately in the same moment. We have to ignore this signal in the supervisor process. However, this cannot be done using \texttt{SIG\_IGN} because this signal disposition is not reset upon \texttt{execveat} system call and it had to be reset manually. As an alternative, it was chosen to install an empty signal handler for \texttt{SIGPIPE} so that the disposition of the signal handler is reset upon \texttt{execveat} in the tracee process automatically by the kernel.
\subsection{Undefined Behavior Sanitizer}
The code of the sandbox may be compiled with the Undefined Behavior Sanitizer (UBSan) enabled. UBSan installs signal handlers for \texttt{SIGBUS}, \texttt{SIGFPE} and \texttt{SIGSEGV} signals. This is problematic, because the tracee could send these signals to the PID~1 process. For the init process in the PID namespace, the kernel only allows sending signals for which the init process (here the PID~1 process) has installed the signal handlers~\cite{man_pid_namespaces}. Therefore the PID~1 process resets signal dispositions of these handlers if the UBSan is used to prevent the tracee from sending these signals to the PID~1 process.
\section{Running as superuser}
The sandbox is not safe to be run by the superuser. If this is needed, then you have to switch to some unprivileged user first. This is because many global system resources are still available, even after dropping the capabilities, e.g. rising privileges works. The check is done in the supervisor process at startup. To make it user namespace-proof it is checked if \texttt{/dev/null} is the null device and if the effective user id of the process equals the owner of the \texttt{/dev/null} file.
\section{Performance optimizations}
Everything that can be done is done in the supervisor process at startup, before handling requests e.g. creating cgroups, entering the network namespace, opening \texttt{/dev/null} as standard file descriptors. Sharing this work between requests ensures minimal overhead of handling the request i.e. it increases throughput (handled requests per second). Some of the optimizations are described in this section.
\subsection{Seccomp filter of the PID~1 process}
The seccomp filter of the PID~1 process is created and compiled in the supervisor process. Therefore it is done once instead of for every request.
\subsection{Seccomp filter as file descriptor}
The seccomp filter in a request is sent as a file descriptor. This avoids unnecessary copies of the seccomp filter contents in case the filter is large. Moreover, the gain is more evident if the same filter is used for subsequent requests.
\subsection{Unsharing network, ipc, uts and time namespace}
Unsharing of network, ipc, uts and time namespace is done in the supervisor process, only once, at startup. This avoids doing it for every request in the PID~1 process and has non-negligible impact on the performance (see Chapter~\ref{chapter:performance}).
\section{Integration with Online Judge Platform}
To integrate the new sandbox with the Online Judge Platform, a suite for each language was needed. The suite sandboxes the compiler if the language is compiled and sandboxes the runtime of the tested program. The following suites were implemented:
\begin{itemize}
\item C, C++, Pascal and Rust --- fully compiled languages, the suite has to sandbox the compilation process and a fully compiled executable.
\item Python, Bash --- fully interpreted languages, the suite does not have a compilation stage, but requires sandboxing the interpreter when it runs the solution.
\end{itemize}
The Bash language is used only for testing due to its short start-up time.
Each of the compilation and run stages requires creating a root file system and a seccomp BPF filter. Root file system has to include the following bind mounts (due to dynamically linked executables):
\begin{itemize}
\item \texttt{/lib}
\item \texttt{/lib64}
\item \texttt{/usr/lib}
\item \texttt{/usr/lib64}
\end{itemize}
Additionally, C, and C++ compilers require \texttt{/usr/bin} and \texttt{/usr/include}.
Pascal compiler requires \texttt{/usr/bin} and \texttt{/tmp}.
Rust compiler requires \texttt{/usr/bin}, \texttt{/tmp}, and on Debian \texttt{/proc}.
Bash and Python require no additional bind mounts.
\subsection{Interactive problems}
Interactive problems require the tested program to communicate with the checker program i.e. the standard input of the tested program is the standard output of the checker program and the standard output of the tested program is the standard input of the checker program. Figure~\ref{fig:intaractive_problem_communication_schema} illustrates this configuration. To accomplish this we need two pipes, one for each communication channel. However, the judge needs to know which process died first to provide a reasonable verdict of checking the tested program on the test.
To see why, lets consider the two examples. In the first one, the checker decides early that the tested program answered wrong, it terminates with a message ''Wrong answer''. Then, the pipe closes and the tested program may get terminated by \texttt{SIGPIPE} for trying to write to the closed pipe. If this happens, the tested program's abnormal death is caused by the checker exiting early. In this situation verdict ''Wrong answer'' is the expected verdict. In the second example, an incorrect tested program terminates early and abnormally. In this case, the pipe closes after the tested program's death and the checker sees the output of the tested program as incomplete and decides ''Wrong answer''. However, in this example an expected verdict would be ''Runtime error'' because the tested program's death caused checker to decide ''Wrong answer'', therefore the tested program's abnormal death takes precedence here. If we don't know who died first in such cases, we cannot reliably deduce the primary cause and therefore cannot decide what is more important, a the tested program's abnormal death or the checker's verdict.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\coordinate (center) at (0,0);
\node [draw, left=0.5cm of center] (tested_program) {tested program};
\node [draw, right=0.5cm of center] (checker) {checker};
\draw[->] (checker.north) to [out=90, in=90] node[pos=0.05, above right] {\texttt{stdout}} node[pos=0.95, above left] {\texttt{stdin}} (tested_program.north);
\draw[->] (tested_program.south) to [out=-90, in=-90] node[pos=0.05, below left] {\texttt{stdout}} node[pos=0.95, below right] {\texttt{stdin}} (checker.south);
\end{tikzpicture}
\caption{Schema of the communication between the tested program and the checker in the interactive problem.}
\label{fig:intaractive_problem_communication_schema}
\end{figure}
To solve this, one could monitor the ends of the two pipes and see which end closes first. This is possible with e.g. \texttt{poll} syscall. However, it is prone to a race condition because the process monitoring the ends of the pipes may be scheduled after the checker and the tested program process and see them as if they died at the same moment. For example, the tested program process dies abnormally, then checker decides ''Wrong answer'' and exits and only then the \texttt{poll} syscall returns reporting all ends of the pipes as closed without the information which closed first. To avoid this race condition and decide reliably 4 pipes are needed and a process that glues both pairs of pipes together and detects which end is closed first. Figure~\ref{fig:interactive_problem_real_communication} illustrates this configuration. As long as, the third process holds open inner four ends of the pipe pairs, the tested program and the checker will not see their \texttt{stdin} and \texttt{stdout} as broken and will not proceed (to terminate, either normally or not). This way we can reliably detect who dies first and give a correct verdict in the scenarios where one's death causes the other's death. To efficiently pass messages in the third process, the \texttt{splice} syscall is used.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\node [draw, minimum height=4cm] (communicator) {communication\\process};
\node [draw, left=0.5cm of communicator] (tested_program) {tested program};
\node [draw, right=0.5cm of communicator] (checker) {checker};
\node [draw, dotted, above=1cm of tested_program] (pipe_up_left) {pipe};
\node [draw, dotted, below=1cm of tested_program] (pipe_down_left) {pipe};
\node [draw, dotted, above=1cm of checker] (pipe_up_right) {pipe};
\node [draw, dotted, below=1cm of checker] (pipe_down_right) {pipe};
\node [draw, circle, fill=black, inner sep=0, minimum size=1mm] (up_dot) at ($(pipe_up_left)!.5!(pipe_up_right)$) {};
\node [draw, circle, fill=black, inner sep=0, minimum size=1mm] (down_dot) at ($(pipe_down_left)!.5!(pipe_down_right)$) {};
\draw[->] (tested_program) -- node[left] {\texttt{stdout}} (pipe_down_left);
\draw[->] (pipe_down_left) -- (down_dot);
\draw[->] (down_dot) -- (pipe_down_right);
\draw[->] (pipe_down_right) -- node[right] {\texttt{stdin}} (checker);
\draw[->] (checker) -- node[right] {\texttt{stdout}} (pipe_up_right);
\draw[->] (pipe_up_right) -- (up_dot);
\draw[->] (up_dot) -- (pipe_up_left);
\draw[->] (pipe_up_left) -- node[left] {\texttt{stdin}} (tested_program);
\end{tikzpicture}
\caption{This is how communication between the tested program and the checker is implemented. Two pairs of pipes are used. This allows detection which process dies first --- the tested program or the checker. Because the communication process does not close its ends of the pipes first, it can detect which process died first without causing the second to die because of a broken pipe. To efficiently pass messages in the communication process, the \texttt{splice} syscall is used.}
\label{fig:interactive_problem_real_communication}
\end{figure}
\subsection{Non-interactive problems}
In the non-interactive problems, the semantics of the input is read-once and of the output is write-once. To achieve this without disallowing \texttt{dup}'ing, \texttt{close}'ing, \texttt{mmap}'ping, \texttt{pread}'ing etc. of the standard input and output file descriptors, we use pipes that are read-once and write-once. Input file is piped to \texttt{stdin} of the tested program, and the tested program's output is piped to the output file. Figure~\ref{fig:noninteractive_problem_communication} illustrates this configuration. To efficiently pass messages between a pipe and a file the \texttt{splice} syscall is used.
\begin{figure}[h]
\tikzset{>=latex} % set latex arrow tip
\centering
\begin{tikzpicture}[align=center]
\node [draw, minimum height=4cm] (communicator) {communication\\process};
\node [draw, right=0.5cm of communicator] (tested_program) {tested program};
\node [left=0.5cm of communicator] (left_node) {\phantom{tested program}};
\node [draw, dotted, above=1cm of left_node] (input_file) {input file};
\node [draw, dotted, below=1cm of left_node] (output_file) {output file};
\node [draw, dotted, above=1cm of tested_program] (pipe_up) {pipe};
\node [draw, dotted, below=1cm of tested_program] (pipe_down) {pipe};
\node [draw, circle, fill=black, inner sep=0, minimum size=1mm] (up_dot) at ($(input_file)!.5!(pipe_up)$) {};
\node [draw, circle, fill=black, inner sep=0, minimum size=1mm] (down_dot) at ($(output_file)!.5!(pipe_down)$) {};
\draw[->] (input_file) -- (up_dot);
\draw[->] (up_dot) -- (pipe_up);
\draw[->] (pipe_up) -- node[right] {\texttt{stdin}} (tested_program);
\draw[->] (tested_program) -- node[right] {\texttt{stdout}} (pipe_down);
\draw[->] (pipe_down) -- (down_dot);
\draw[->] (down_dot) -- (output_file);
\end{tikzpicture}
\caption{To provide the read-once semantics of the standard input and the write-once semantics of the standard output the pipes are used. The communication process passes contents of the file to the input pipe that outputs to the tested program's \texttt{stdin}. The tested program's \texttt{stdout} is piped to the output file. To efficiently pass data between files and pipes in the communication process, the \texttt{splice} syscall is used.}
\label{fig:noninteractive_problem_communication}
\end{figure}
\subsection{Conclusion}
Apart from the above difficulties, the integration was rather easy. It required changing usage of the old sandbox to the new sandbox knobs. A lot of code was simplified along the way.
\section{Testing and validation}
To test and validate the new sandbox a comprehensive set of unit tests was developed. Tests check that namespaces are used appropriately, the interface is implemented correctly, the limits are enforced, runtime statistics are provided and are correct, no file descriptor leaks to the traced process, etc. Error reporting was also tested e.g. that unexpected supervisor death is reported, the supervisor terminates as soon as the socket connection with the client becomes broken etc. These tests ensured no regressions during the development and in the end eased the development process.
\section{Challenges faced}
There were many challenges during the development of the sandbox. First was to understand the semantics of the kernel's interfaces i.e. namespaces, cgroups and capabilities, and nuances in every one of them. One of the great achievements was to desist from using \texttt{ptrace} and all its complexity, while controlling a group of processes. This reduced the overhead and simplified things a lot. It was possible thanks to the Linux namespaces and cgroups.
Another challenge was to orchestrate everything to work together: resource limits, file descriptors, communication between processes, setting up namespaces and cgroups, dropping capabilities etc. It all has to be done in the right order and was often unobvious how to do it right.
The hardest of all was to debug very obscure errors happening during setup of the namespaces and cgroups. Often configuration failed with \texttt{EPERM} or \texttt{EINVAL} and it required figuring out what was wrong with just such vague information. For example, mounting cgroup2 file system is not allowed without unsharing cgroup namespace first. It also requires unsharing mount namespace and user namespace, but this is completely reasonable. In Sections~\ref{execveat_returns_einval} and~\ref{execveat_returns_enoent} two examples are shown of the hard to debug errors.
\subsection{\texttt{execveat} returns \texttt{EINVAL}} \label{execveat_returns_einval}
Executing a dynamically linked executable may fail with an error \texttt{EINVAL}. The executable needs the dynamic linker, so one reason of the error is the missing dynamic linker in the new root file system, on x86\_64 it is at \texttt{/lib64/ld-linux-x86-64.so.2}. Another reason may be a missing shared library that usually resides in the directory \texttt{/usr/lib/}. It is important to bind mount all of these paths during creation of the root file system for the dynamic executables to work.
\subsection{\texttt{execveat} returns \texttt{ENOENT}} \label{execveat_returns_enoent}
Executing a file may fail with \texttt{ENOENT}, even though the file exists. This may happen because the file is a symbolic link and the destination does not exist, or if the symbolic link is recursive (refers to a symbolic link) and one of the intermediate files is non-existent in the new root file system. One of the solutions is to bind mount the executable without \texttt{AT\_SYMLINK\_NOFOLLOW}.
\section{Conclusion}
Despite challenges and complexity the sandbox was implemented and tested successfully. The usage of Linux namespaces provides isolation while cgroups and prlimits limit the resources. The sandbox is versatile enough to be used both as a sandbox for running a tested program as well as for running the compiler. The goal of optimizing the implementation for short-running programs was achieved with several optimizations.
\chapter{Performance Evaluation}\label{chapter:performance}
All performance tests were made on a laptop with Intel i5-8250U processor and 16 GB of RAM. The purpose of this chapter is to verify the efficiency of the sandbox in the context of the online judge platform and briefly compare it with nsjail.
\section{In the context of the online judge}
For the testing we use tasks from the finals of XXII Polish Olimpiad in Informatics: Myjnie (myj), Tablice kierunkowe (tab), Modernizacja autostrady (mod), Wycieczki (wyc). The people who prepare the problem package for the Olympiad write different solutions programs as part of creating the problem package. The solution programs are used to verify that test cases can differentiate between different user solutions e.g. those running in $O(n)$ and $O(n^2)$. The model solution program is the best solution program for the problem. In Polish it is called ''rozwiązanie wzorcowe''. The term ''solution'' can be misleading but is currently used in the English literature regarding Olympiads in Informatics~\cite{kolstad2009infrastructure, marevs2012new, merry2010performance, merry2009using}.
First, we compare the compilation times of all solution programs outside sandbox, inside sandbox and inside sandbox with seccomp BPF filters disabled. Then we compare the running times outside the sandbox and inside the sandbox of the model solution program of each problem.
For each solution and configuration (outside sandbox, inside sandbox, inside sandbox with seccomp BPF filters disable) 10 runs were performed and both real time and CPU time were recorded. For each model solution program and test (from the respective problem package) and configuration (outside sandbox, inside sandbox) 10 runs were performed and both real time and cpu time were recorded.
The Tables contain (for both real time and CPU time) the mean time from these 10 runs i.e. $\mu = \frac{1}{N} \sum_{i=1}^N x_i$ (where $N = 10$ and $x_1, \ldots, x_N$ are the measured times), standard deviation i.e. $\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}$, and standard error on the mean i.e. $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{N}}$.
The plots illustrate the distribution of the measured times i.e. distribution of values $x_1, \ldots, x_{10}$ for each configuration. Elements of the plots are described in Figure~\ref{figure:plot_legend}. The x-axis is logarithmic to better illustrate both small values and large values and differences between similar values on a single plot. This is not ideal but helps considerably.
\subsection{Compilation}
Table~\ref{table:myj_compilation} contains the compilation times of the solution programs of the problem Myjnie. Figure~\ref{figure:myj_compilation_real_time} presents real times of the compilation times and Figure~\ref{figure:myj_compilation_cpu_time} presents CPU times of the compilation times. Due to small differences in runtime only some compilations present statistically significant difference: myj.cpp with 6\% slowdown, myj2.cpp with 5\% slowdown, myjs5.cpp with 24\% slowdown, myjs7.cpp with 15\% slowdown, and myjs8.cpp with 19\% slowdown. In these cases, disabling the seccomp BPF filters eliminated the slowdown except for myjs5.cpp, where it still is 15\%. For all other solutions, the difference in the compilation time was statistically insignificant.
Table~\ref{table:tab_compilation} contains the compilation times of the solution programs of the problem Tablice kierunkowe. Figure~\ref{figure:tab_compilation_real_time} presents real times of the compilation times and Figure~\ref{figure:tab_compilation_cpu_time} presents CPU times of the compilation times. Due to high variance of the measurements compared to the small differences of the measurements only some compilations showed statistically significant difference. Of these measurements, the slowdown caused by the sandbox was the highest for tabb3 with slowdown of 15\% and tabb4.cpp with slowdown of 16\%. Again most of the slowdown was caused by seccomp BPF filters.
Table~\ref{table:mod_compilation} contains the compilation times of the solution programs of the problem Modernizacja autostrady. Figure~\ref{figure:mod_compilation_real_time} presents real times of the compilation times and Figure~\ref{figure:mod_compilation_cpu_time} presents CPU times of the compilation times. Here all compilation times present statistically significant difference. The slowdown caused by the sandbox was the highest for mod2.cpp with 13\% slowdown and mod5.cpp with 23\% slowdown. Again most of the slowdown was caused by seccomp BPF filters except for mod5.cpp where the slowdown without seccomp BPF filters is still 15\%.
Table~\ref{table:wyc_compilation} contains the compilation times of the solution programs of the problem Wycieczki. Figure~\ref{figure:wyc_compilation_real_time} presents real times of the compilation times and Figure~\ref{figure:wyc_compilation_cpu_time} presents CPU times of the compilation times. Only some compilations showed statistically significant difference: wycb1.cpp with 8\% slowdown of the sandbox, wycb2.cpp with 8\% slowdown, wycb3.cpp with 8\% slowdown and wycb5.cpp with 10\% slowdown. Disabling seccomp BPF filters reduces the slowdown to be insignificant.
From the measurements we see that most of the time, the overhead comes solely from the seccomp BPF filters. The compilation in the sandbox can be slower by up to 24\%, but most of the time it is no slower than 10\% and half of the time it is statistically no slower than compilation without the sandbox.
Commands used in testing:
\begin{itemize}
\item C --- \texttt{/usr/bin/gcc -std=c11 -O2 -static -o exe source.c}. Compiler version: 13.2.1.
\item C++ --- \texttt{/usr/bin/g++ -std=c++17 -O2 -static -o exe source.cc}. Compiler version: 13.2.1.
\item Pascal --- \texttt{/usr/bin/fpc -O2 -XS -Xt -oexe source.pas}. Compiler version 3.3.2.
\end{itemize}
\subsection{Model solutions' run times}
Table~\ref{table:myj_model_solution_runtimes} contains the run times of the model solution program of problem Myjnie. Figure~\ref{figure:myj_model_solution_real_time} presents real times and Figure~\ref{figure:myj_model_solution_cpu_time} the CPU times of the model solution program. Of all tests, only myj20d showed a statistically significant difference in runtime --- a 19\% slowdown. On other tests the variance of the measurements is too high compared to the differences to show anything.
Table~\ref{table:tab_model_solution_runtimes} contains the run times of the model solution program of problem Tablice kierunkowe. Figure~\ref{figure:tab_model_solution_real_time} presents real times and Figure~\ref{figure:tab_model_solution_cpu_time} the CPU times of the model solution program. Despite high variance of the measurements, some test showed statistically significant difference, e.g. tab4d with 44\% speed up inside the sandbox, tab4e with 18\% speed up, tab4f with 45\% speed up. For smaller tests the speed up is more apparent but can be caused by a method of measuring runtime i.e. \texttt{perf stat} vs. sandbox i.e. real time timer and cgroup \texttt{cpu.stat} file. Therefore the differences in smaller tests are not considered as statistically significant. The next subsection compares run times of short-running programs.
Table~\ref{table:mod_model_solution_runtimes} contains the run times of the model solution program of problem Modernizacja autostrady. Figure~\ref{figure:mod_model_solution_real_time} presents real times and Figure~\ref{figure:mod_model_solution_cpu_time} the CPU times of the model solution program. Due to high variance and small differences in the measurements, only test mod7a shows statistically significant difference with 8\% slowdown inside the sandbox.
Table~\ref{table:wyc_model_solution_runtimes} contains the run times of the model solution program of problem Wycieczki. Figure~\ref{figure:wyc_model_solution_real_time} presents real times and Figure~\ref{figure:wyc_model_solution_cpu_time} the CPU times of the model solution program. Here we have similar situation as with the model solution of problem Tablice kierunkowe. High variance and small differences in the measurements, renders most of the tests statistically indistinguishable. Some tests show speed-up in the sandbox e.g. wyc6d with 2\% speed up, and wyc6e with 4\% speed up. For smaller tests the speed up is more apparent but can be caused by a method of measuring runtime.
From the measurements we see that on large tests (those with high runtime) the slowdown in the sandbox can be as high as 19\%. However, in vast majority of tests, there was no significant difference in runtime between inside and without the sandbox. This can be attributed to a low number of system calls performed by the solution which mainly makes computation in the memory instead of performing IO. Some medium tests showed a speed up of up to 44\% if run inside the sandbox. This is unexpected. It may be caused by a simpler file system hierarchy inside the sandbox, but it requires further investigation. Differences on small tests were ignored due to a different method of measuring runtime i.e. \texttt{perf stat} vs. sandbox i.e. real time timer and cgroup \texttt{cpu.stat} file.
\section{Short-running programs and comparison with nsjail}
An example of a short-running program is \texttt{/bin/true} --- we will use it in the benchmark. The metric we will use is the round-trip time of the request to sandbox a program. Table~\ref{table:bin_true_times} contains the measurements of the time to handle request to run the \texttt{/bin/true} program, without the sandbox, with sandbox and using nsjail. From the data we see that the nsjail is more than 4 times slower, while at the same time it does not spawn a separate PID~1 process and does not provide the runtime statistics. Our sandbox, while 2.39 times slower, still allows for hundreds requests per second --- and that was the goal of the thesis.
\begin{small}
\begin{longtable}{|l|r|r|r|r|}
\hline
\makecell{Sandbox} & \makecell{Mean time} & \makecell{Std. dev.} & \makecell{Std. err.\\on the mean} & \makecell{Slowdown} \\
\hline
no sandbox & 0.893ms & 0.409ms (45.80\%) & 0.013ms (1.45\%) & 1x \\
sandbox & 2.348ms & 0.768ms (32.71\%) & 0.024ms (1.03\%) & 2.39x \\
nsjail & 10.393ms & 1.327ms (12.77\%) & 0.042ms (0.40\%) & 10.57x \\
\hline
\multicolumn{1}{c}{}\\ % adds vertical space between longtable and caption
\caption{Statistics for each row were collected from 1000 runs. Each row contains real time it took to handle request to sandbox the \texttt{/bin/true} program. While the slowdown of the sandbox is huge (more than twofold), it still allows for hundreds of runs per second and that was the goal of this thesis, whereas nsjail is more than 4 times slower than our sandbox.}
\label{table:bin_true_times}
\end{longtable}
\end{small}
nsjail does not provide any means to handle more than one program in one execution. This means that one execution of the nsjail program equals one execution of the sandboxed program. This way, the nsjail program cannot share resources e.g. namespaces between the runs and therefore should be slower than our solution. Our solution executes the server program once and it can handle more than one request for secure execution of a program therefore it can and shares the resources. This way, our solution has lower overhead per one execution of the sandboxed program.
Therefore the request handle time is drastically lower than for our sandbox. This is presented in Table~\ref{table:bin_true_times}. The nsjail can handle around 4.4 times less requests than our solution in a given time. Moreover, our sandbox allows easy collection of the runtime statistics that nsjail is incapable of doing. Command used for benchmarking the nsjail: \texttt{/usr/bin/nsjail -q -Mo -{}-chroot / -{}-use\_cgroupv2 -{}- /bin/true}.
\section{Impact of some optimizations}
From Table~\ref{table:optimization_impact} it is clear that unsharing namespaces once instead of for every request has positive impact on the performance. The most meaningful was the network namespace that if unshared for every request resulted in 26\% performance degradation.
\begin{small}
\begin{longtable}{|l|r|r|r|r|}
\hline
\multicolumn{1}{|c|}{Benchmark} & \makecell{Mean\\request time} & Std. dev. & Std. err. on the mean & \multicolumn{1}{c|}{Slowdown} \\
\hline
Baseline & 2.348ms & 0.768ms (32.71\%) & 0.024ms (1.03\%) & 0.00\% \\
\hline
\makecell{New network namespace \\ for each request} & 2.970ms & 0.856ms (28.83\%) & 0.027ms (0.91\%) & 26.49\% \\
\hline
\makecell{New IPC namespace \\ for each request} & 2.522ms & 0.782ms (31.02\%) & 0.025ms (0.98\%) & 7.41\% \\
\hline
\makecell{New UTS namespace \\ for each request} & 2.478ms & 0.771ms (31.14\%) & 0.024ms (0.98\%) & 5.54\% \\
\hline
\multicolumn{1}{c}{}\\ % adds vertical space between longtable and caption
\caption{Statistics for each row were collected from 1000 runs. Each row contains real time it took to handle request to sandbox the \texttt{/bin/true} program.}
\label{table:optimization_impact}
\end{longtable}
\end{small}
\iffalse
\section{1'000'000 syscall benchmark}
A test program that performs 1\,000\,000 times a \texttt{lseek} syscall was used to simulate a program that performs a very large number of syscalls. In table~\ref{table:million_lseek} are listed the time measurements. From the data, we see that running an executable inside sandbox has negligible impact on its runtime. Adding a seccomp BPF filter makes a slowdown that is below 4\% for this synthetic, syscall-intensive program. However, this benchmark is artificial because a typical tested program preforms normally at most tens of thousands syscalls and the seccomp overhead affects only the syscalls, so the real overhead will be lower.
\begin{table}[h]
\centering
\begin{tabular}{|l|c|c|c|r|}