We subject our clusters to a lot of automated tests in the widest sense – monitoring, health checks, load tests, penetration tests, vulnerability scans, the list goes on – but every so often I come across test cases that are not well served by any of them. They are usually specific to the way a cluster is used, or how the organisation operating it works. Sometimes there is no objectively correct or incorrect answer, no obvious expected value to specify in our assert statements. I will look at three examples to explain why I think these tests are worth your while.
The first test concerns the OpenShift default of letting all authenticated users create (or, more accurately, request) projects. Let’s say we want to deny non-admin users this power. How do we make sure we have complied with this rule?
Second, our architecture may require an application scaled to three pods to be distributed across three data centre zones for high availability. We need a test that shows that the built infrastructure matches the architectural requirement.
Third, let’s assume we have just experienced an unplanned downtime. Communication between two projects has failed. Clearly remediation comes first, but how would the administrator go about writing a test that makes sure the pod network is configured correctly?
The three scenarios have a number of things in common. Each requires direct access to the cluster state held in the master’s etcd
database. That interaction alone ensures that these are not inexpensive tests in performance terms. Broadly speaking, these tests should run daily, preferably at a time of reduced load, not every hour of the day. Running them is perhaps most useful after cluster maintenance or upgrades. We will look at sample implementations of these tests in just a moment.
How much work will creating tests like these involve? Thankfully, very little. If we are unsure what to test, a quick glance at our operational guidelines or architecture documentation will help us get started. Writing tests will come naturally to anyone familiar with OpenShift, and should take no more than five minutes in most cases. Kubernetes gives us all the tools we need to implement our test runner.
Test setup
The CronJob object triggers nightly test runs. The payload is a lightweight single-container pod with Kate Ward’s unit test framework shUnit2 , oc client, and assorted tools (curl
, psql
, mysql
, jq
, awk
). All test data is taken from a ConfigMap mounted at launch. The ConfigMap in turn is generated from a folder of test scripts in Git. We will return to the scripts in just a moment.
For now the CronJob object waits for the appointed hour, then triggers a test run. shunit2
processes the test suite (consisting of all test scripts in /etc/openshift-unit.d
) and then reports results. Due to a limitation of the CronJob API prior to Kubernetes 1.8, the pod reports success (zero) even in case of errors as returning an error leads to constant redeployments and considerable load on the cluster.
From a permissions point of view, administrator access is required to create the project initially, but from that point onward the service account is read-only and the container runs with the ‘restricted’ security context constraint and a non-privileged security context.
Logs are written to standard output and so managed by the existing log server. Using only the default suite of tests, the test pod reports the following:
test_nodes_ready
test_nodes_no_warnings
test_project_quotas
test_cluster_admin_bindings
test_container_resources
test_security_context_privileged
test_high_availability
test_anyuid
test_self_provisioner
Ran 9 tests.
OK
These tests are just placeholders, however. The tests that matter are the ones that reflect your organisation’s individual rules, guidelines and decisions.
Roles and permissions
Let’s return to the first example mentioned in the introduction, that is, the self-provisioner rule. It ensures that the administrator has taken the corresponding cluster role from the groups system:authenticated
and system:authenticated:oauth
, which usually means that an administrator has issued the following command:
$ oc adm policy remove-cluster-role-from-group \
self-provisioner \
system:authenticated system:authenticated:oauth
cluster role "self-provisioner" removed: ["system:authenticated" "system:authenticated:oauth"]
Using the oc
tool, verifying that this has not been forgotten or reversed at a later point, is as straightforward as asking who is entitled to create
(the verb) projectrequests
(the resource):
test_self_provisioner() {
count_self_provisioner=`oc adm policy who-can \
create projectrequests | \
grep -c system:authenticated`
assertEquals " non-admin users may not create project requests;" \
0 ${count_self_provisioner}
}
suite_addTest test_self_provisioner
The shUnit2 framework intrudes only very slightly on the code here. The utility function suite_addText
allows the framework to combine many files in a test suite with a single return value. The test code must reside in a function whose name contains the word test
. The writer also needs to be familiar with the framework’s assert functions, assertEquals
in this case. Placing a space at the start of the string and a semicolon at the end are conventions that make error messages more legible:
test_self_provisioner
ASSERT: non-admin users must not create project requests; expected: but was:
Whereas most infrastructure tests strive for objectivity and test coverage, cluster tests like this one are unrepentantly subjective and selective. A comparison with rspec-puppet
tests is instructive. Here is a brief excerpt from a Puppet manifest with matching rspec-puppet
test:
class bastion::install {
file { '/home/ec2-user/config.json':
ensure => file,
owner => 'ec2-user',
mode => '0644',
content => template('bastion/config.json.erb'),
}
}
The test asserts the following:
context 'in class Install' do
it {
should contain_file('/home/ec2-user/config.json')
.with_ensure('file')
.with_owner('ec2-user')
.with_mode('0644')
}
end
This approach makes it much harder to argue that some properties (e.g. users with basic-user
credentials are allowed to create projects) are more important than others (e.g. there’s a JSON file which is read-only unless you are the owner). If we place the two tests side by side, we are reminded that rspec-puppet
strives for full map coverage, whereas we are focused on points of interest. These points of interest may seem arbitrary, but so, perhaps, are the decisions and operational guidelines they support and reinforce.
Architecture
How to test for high availability, the second example outlined in the introduction? Anti-affinity rules give us fine-grained control over placement on nodes, but unless we only have one node per zone, we cannot rely on the scheduler alone here. One alternative approach is to identify the nodes and examine the zone label:
test_high_availability() {
for svc in docker-registry router do
nodes=`oc get po --all-namespaces -o wide | grep ${svc} | \
awk '{ print $8 }'`
zones=""
for node in ${nodes}; do
zones="${zones} `oc get node/${node} -L zone | awk '{print $6}' | tail -n +2`"
done
zone_count=`echo ${zones} | tr ' ' '\n' | sort -u | wc -l`
ha=false
if [ "${zone_count}" -gt "2" ]; then
ha=true
fi
assertTrue " ${svc} must be distributed across three zones;" ${ha}
done
}
suite_addTest test_high_availability
As before, we start with plain oc
requests and refine the output using basic command line tools. The label zone
expresses anti-affinity, the label region
affinity: services are spread out across zones and concentrated in regions. We fetch the nodes first (note the use of the wide
switch), then extract the zone from the node definition before counting the number of unique zones. The expected number is three.
Post-incident review
So far, operational guidelines and architectural decisions have directed our test selection. Incidents are another valuable guide. Making sure they occur only once trumps trying to anticipate weaknesses in our infrastructure.
For example, our multi-tenant cluster might contain a project alice
which accesses a project bob
using a pod network join:
$ oc adm pod-network join-projects --to=bob alice
Let’s assume that the join between the two projects has been lost. Perhaps an additional join from alice to eve was created. The fact that one (the original join is gone) does not intuitively follow from the other (an apparently unrelated new join was created) makes this all the more likely. Affected services then run into timeouts and stop processing requests.
The problem is quickly diagnosed and fixed, but having suffered one service failure, we really ought to write a test that alerts us should the join disappear again:
test_join_alice_bob() {
count_net_ids=`oc get netnamespace | \
grep 'alice\|bob' | \
awk '{ print $2 }' | \
sort -u | \
wc -l`
assertEquals " join between alice and bob is broken;" \
1 ${count_net_ids}
}
suite_addTest test_join_alice_bob
To follow the test, we need to appreciate what happens when oc adm pod-network join-projects
is called: the source project’s network ID is changed to that of the destination project. Once the two projects share a network ID, they can communicate with each other. (Hence the unfortunate side-effect of creating an additional join from project alice
to eve
: alice
receives the network ID of eve
and can no longer reach services in project bob
.) The test only has to fetch the network IDs of alice
and bob
, de-duplicate and count lines. If the join is still in place, the line count will be one.
Choosing a language
In case you are wondering why this is not a Go application, I have to confess to some library envy. Clearly the command line component would have been much more elegant, for example, and there is more repetition in the tests than I would like. The exports
script seeks to address this by bundling frequently used queries such as ‘list all projects created by users’, but that does not make up for the fact that we give up the luxury of one-line web servers, Bootstrap reports adorned with canvas charts, parallel execution for oc
and non-oc
test cases, and so on.
Those quibbles, however, hardly justify switching to a different language. If we were to do so, which language should we choose? Go? JavaScript? Python? Ruby? Each of these choices would exclude many users who happen to have prioritised other languages. Shell scripting is familiar to most OpenShift users and a natural extension of the way they interact with OpenShift anyway. Nearly everything of substance in our tests, moreover, relies on oc
calls; no standard library can abstract away the fundamental awkwardness of building an application around system calls. They only feel entirely natural in a shell environment.
Shorter paths, fewer destinations
Many tests are essential. The ones we have considered here, strictly speaking, are not. It comes down to an individual assessment of risk and usefulness. Personally, I am much more willing to grant anyuid
powers to a service account if I know the next nightly test will fail should I forget to remove them later. Sometimes safety nets get in the way, but they can also have a liberating effect.
This approach allows us to specify test conditions at the appropriate level and above all quickly, with minimal investment in infrastructure and training. The goal is the shortest path to a small number of valuable points of interest, not comprehensive map coverage: sightseeing, not cartography.
You may also find that there is nothing that colleagues cannot express more succinctly and elegantly than you thought possible, especially in the world of Bash. Learning from other people’s tests may be, for me, the most enjoyable aspect of it all.
For those still undeterred, log into your administrator’s account and set the timer:
$ git clone https://github.com/gerald1248/openshift-unit.git
$ make -C openshift-unit
More articles
fromGerald Schmidt
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
Blog author
Gerald Schmidt
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.