How BeanHub works part1, contains the danger of processing Beancount data with sandbox
It has been more than two years since we launched BeanHub. Recently, we have been tirelessly releasing new features. Some of you may ask
What were you busy with at the very beginning? Why wait until now to start adding new features?
Well, we spent most of our time at the very beginning building the infrastructure to move faster later. We have adopted and developed many interesting technologies in-house. Sandbox is one of the technologies we explored and adopted.
Isn’t BeanHub just a simple web app for Beancount accounting books? What is the danger of it, and do you need sandboxing?
Actually, yes, there’s some potential risk of processing Beancount files. Most of Beancount accounting books are simple text files that look harmless like these:
2024-04-21 open Assets:Cash
2024-04-21 open Expenses:Food
2024-04-22 * "Dinner"
Assets:Cash -20.00 USD
Expenses:Food 20.00 USD
But when plugins are used, things get tricky. With a Beancount file plus a Python script in the same folder like this:
main.bean
:
option "insert_pythonpath" "true"
plugin "my_plugins"
2024-04-21 open Assets:Cash
2024-04-21 open Expenses:Food
2024-04-22 * "Dinner"
Assets:Cash -20.00 USD
Expenses:Food 20.00 USD
my_plugins.py
:
__plugins__ = ["evil"]
def evil(entries, options):
print('!!ALL YOUR ACCOUNTING BOOKS ARE BELONG TO US!!')
return entries, []
Then you run seemly harmless Beancount commands such as printing the balance report:
bean-report main.bean bal
And boom!
!!ALL YOUR ACCOUNTING BOOKS ARE BELONG TO US!!
Assets:Cash -20.00 USD
Equity
Expenses:Food 20.00 USD
Income
Liabilities
The attacker just runs their script on your computer. So keep in mind, next time, instead of being cautious about random people emailing you sexy photos to lure you into opening them, you also need to be careful about random people sending their sexy Beancount accounting books and wanting you to review them.
As you can see, the user can upload a beancount file with a plugin pointing to a local Python file and then run anything in the plugin file. Moreover, because BeanHub is based on Git, there are many Git operations. If there’s a zero-day exploit found in Git, the user could also take advantage of that.
Indeed, there’s an actual security risk. While we made it easy for our users to use BeanHub, the technology behind it is not that simple. We put quite some effort into making it as secure as possible. Sandboxing is just one of the technologies we adopt. Today, we would like to share how we use sandbox to ensure that it’s safe to process user-uploaded Beancount files.
Container or sandbox?
What is a sandbox? A sandbox is an isolated computing environment designed to run untrusted software or process untrusted data. For BeanHub, we are building our sandbox on top of open-sourced container technologies. Some may argue that a container is different from a sandbox, and that’s true. Most of these popular container tools focus not solely on security. However, because there’s a huge overlap of the Linux kernel technologies used between a container and a sandbox, we can leverage the container tools and harden them to make them more robust and meet the security requirements for sandbox usage without rebuilding the wheels from the ground up.
Here, we named just a few of the underlying Linux Kernel technologies shared between a sandbox and container:
It takes a whole book just to discuss container technologies. Still, since that’s not the purpose of this article, you can read Container Security: Fundamental Technology Concepts that Protect Containerized Applications by Liz Rice if you want to learn more about them. Today, we will focus more on how our sandbox works.
Running Podman in rootless mode
If you ever run Docker on your Linux machine, you may know that Docker is running as a daemon as the root user. And your docker
command line tool acts as just a client and talks to that daemon process. With the Docker’s client-server architecture, running a container requires talking to the daemon server running as root. While it’s possible to tell the Docker daemon to run your container as a non-root user, if there are any security vulnerabilities in the protocol or if you simply misconfigure it, the container might still be running as a root user. Therefore, the fact that Docker requires a daemon with the privilege to run a container makes it harder for us to secure it.
We looked at the container security vulnerabilities found in the past. Many of them require the container or the parent process to run as a root to work. To avoid potential risks like these, we think it’s better to run the container in a rootless mode without a daemon, be it from the host’s perspective or inside the container. Yet another reason we need this is that we are using Kubernetes, and all BeanHub’s services are also running inside a container. We don’t want our BeanHub’s service container running with root privilege and don’t want to expose the Docker socket to the service’s container either.
With these ideas in mind, we looked around and found Podman. Podman is an open-source alternative to Docker built with a rootless and daemon-less approach to containers, developed and released thanks to Red Hat. To learn more about Podman, you can watch this video Rootless Containers with Podman by Steven Ellis from Red Hat.
If you want to learn more details on running rootless containers inside a container, you can also read How to use Podman inside of a container by Dan Walsh from Red Hat.
We built and open-sourced container-helpers
Running a Docker or a Podman command with many arguments would be tedious and error-prone. To make it much easier to control the container environment and avoid making mistakes, we built a simple library to help us configure the container with the simple data structure written in Python. Here’s an example of how you run a complex container with Podman by using our library:
import asyncio
import asyncio.subprocess
from containers import Container
from containers import ImageMount
from containers import BindMount
from containers import ContainersService
service = ContainersService()
async def run():
container = Container(
image="alpine",
command=("/path/to/my/exe", "arg0", "arg1"),
mounts=[
ImageMount(
target="/data",
source="my-data",
read_write=True,
),
BindMount(
target="/artifacts",
source="/var/tmp/artifacts",
readonly=False,
relabel="private",
bind_propagation="rslave"
),
],
)
async with service.run(container, stdout=asyncio.subprocess.PIPE) as proc:
stdout = await proc.stdout.read()
code = await proc.wait()
if code != 0:
raise RuntimeError("Failed")
# ...
asyncio.run(run())
We realized others might also find this library handy, so we open-sourced it here as container-helpers under the MIT license.
Seccomp and strace
A container-based sandbox mostly relies on features provided by the Linux Kernel to ensure that a particular process can only do certain things and live in an isolated environment. For example, one can use user namespaces to create an isolated networking environment for the sandbox process. However, given that the process in the sandbox can make direct system calls to the Kernel, it’s still possible that if there’s a bug in the Linux Kernel code, the attacker might use the zero-day exploit to escape the isolation or escalate privilege. For example, a security vulnerability in the Linux Kernel allowed waitid syscall to overwrite memory in the Kernel that was found in the past.
Put it simply, the more syscall the process in the sandbox can make, the more likely they will find a bug or two that enables the attacker to escape the container boundary. To reduce the attack surface, we apply the most restrictive Seccomp profile on the container, allowing only the minimal essential syscalls required by running the service.
What’s Seccomp, you may ask? It’s a Linux library that allows you to write a set of rules to allow or block certain types of system calls. The rules will be compiled as BPF (Berkeley Packet Filter) and injected into specific Kernel entry points. By default, Docker or Podman comes with its default Seccomp profile, like this one from Docker. You can think it’s like a firewall between userspace and Linux Kernel for filtering system calls based on rules. A Seccomp profile of a container usually comes as a JSON file that looks like this:
{
"defaultAction": "SCMP_ACT_ERRNO",
"defaultErrnoRet": 1,
"syscalls": [
{
"names": [
"accept",
"accept4",
// Omit ...
"write",
"writev"
],
"action": "SCMP_ACT_ALLOW"
},
// Omit ...
]
}
As you can see, it provides simple operation rules such as SCMP_ACT_ALLOW
and SCMP_ACT_ERRNO
with a list of syscall. You can copy and modify the default Seccomp profile, remove the unwanted syscall, and only keep the absolutely necessary ones. Once your Seccomp profile is ready, you can pass them as --security-opt seccomp=/path/to/seccomp/profile.json
as an argument to run your container.
To learn which syscalls are required to run your program, you can use the strace command line tool. For example, to learn the minimal syscall required for running git version
command, you can run this:
sudo strace -o trace-result.txt -f -c git version
And here’s the result in the trace-result.txt
file
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 7 read
0.00 0.000000 0 1 write
0.00 0.000000 0 9 close
0.00 0.000000 0 25 mmap
0.00 0.000000 0 7 mprotect
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 rt_sigaction
0.00 0.000000 0 1 rt_sigprocmask
0.00 0.000000 0 2 pread64
0.00 0.000000 0 4 3 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 getcwd
0.00 0.000000 0 2 1 arch_prctl
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 28 20 openat
0.00 0.000000 0 24 12 newfstatat
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 prlimit64
0.00 0.000000 0 1 getrandom
0.00 0.000000 0 1 rseq
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000 0 121 36 total
Here you go. By running the strace command like this, you will now know which syscalls your program uses. However, please remember that you may need to design a set of tests to cover all the possible paths in your program to ensure that all the used syscalls will be covered. Otherwise, your program might crash or encounter errors due to system calls blocked by the Seccomp profile.
As we mentioned, allowing the container process to make direct syscalls to the Kernel poses a risk. Other than Seccomp, you can check out projects such as gVisor if you want even stricter enforcement. It’s a container runtime open-sourced by Google. They implemented most of the syscalls in Golang, and it only sends out very few system calls to the Linux Kernel, thus further reducing the attack surface. But that’s out-of-topic, and we will leave that to the reader to explore.
Disabling network
Allowing a process in the sandbox to access the internet or local network increases the security risk by a great deal of magnitude. Think about this: if you have an internal server running in your cloud, if the attacker knows where that address is, they can access it from your container. The AWS EC2 metadata URL is a great example. Any process inside an EC2 instance with network access can easily visit the metadata URL at http://169.254.169.254/latest/meta-data/
and retrieve critical information about the instance. That’s why AWS now disabled access to these URLs without authentication by default.
Other than the potential risk of internal service exposure, another possible risk is that the attacker can effortlessly phone home to notify the successful attack. Without access to the internet, the more likely way for the attacker to find out whether their attack works could only be changing the output from the container and hoping it will get picked up by our service and stored somewhere for them to inspect. However, since our service checks and validates the output before we even continue processing them, we will see an error report if there’s anything wrong with the container output. So, to break the sandbox without a valid way to evaluate what works or not is like driving with a blindfold on your head in a minefield without blowing up and getting noticed.
Considering the risks, we ensured all BeanHub operations inside a sandbox don’t need access to the internet or a local network. Our service communicates with the container via stdin, stdout, and file operations. Since there’s no need for a network, the containers are all running with the --network none
argument.
Final thought, defense in depth
With all the user-uploaded data processed inside a sandbox, now it’s way harder to compromise BeanHub’s service. The worst case would be the attacker compromising the sandbox container and modifying their own data unless they can find a way to escape.
We believe in defense in depth. Just because we use a sandbox to run and process the user-uploaded data doesn’t necessarily mean our system is 100% secure. This article only covered a few security measurements and the interesting tricks we’ve developed for building BeanHub. We left out many other details, such as the --security-opt=no-new-privileges
to keep the article length in control. If you want to learn more about container security and best practices besides the Container Security book, you can also read Docker Security Cheat Sheet.
We’ve learned a lot from building BeanHub, and I think sharing our experiences would be great for the community. This is just part 1 of a series of articles we will publish. I hope you enjoy it. Next, we will discuss our layer-based distributed Git repository system and explain how we made all the Git changes auditable. Even if the user corrupted or overwritten the Git history, we can still trace back to who did it, when it happened, and what the previous state was. Stay tuned by subscribing to our mailing list at the bottom of this page. See you next time!