Understanding Why JFrog Products Cannot Be Joined If One Is in a Kubernetes Cluster

I was intrigued by the recent update to JFrog Xray’s Helm installation documentation around mid February, stating:

Currently, it is not possible to connect a JFrog product (e.g., Xray) that is within a Kubernetes cluster with another JFrog product (e.g., Artifactory) that is outside of the cluster, as this is considered a separate network. Therefore, JFrog products cannot be joined together if one of them is in a cluster.

https://www.jfrog.com/confluence/display/JFROG/Installing+Xray#InstallingXray-HelmInstallation.1

Upon seeing this, there was only one logical thing to do – to understand what is the underlying reason 🙂

Setup

For convenience, I chose AWS EKS as my Kubernetes cluster to attempt to figure this one out.

Next, I installed Artifactory 7 on a vanilla EC2 instance outside the cluster. Lastly, I grabbed the join key and attempted to deploy Xray 3 (via the jfrog-platform Helm chart) inside the EKS cluster.

Issue #1 – postgres-setup-init Init Container Expects Kubernetes DNS to Resolve Artifactory 7 Instance

The Xray 3 pod comprises of the following containers:

  • Init Containers
    • postgres-setup-init
    • copy-system.yaml
  • Containers
    • router
    • xray-server
    • xray-analysis
    • xray-indexer
    • xray-persist

The postgres-setup-init init container uses the init script found at jfrog-platform/templates/_helpers.tpl Line 70 to 83. For easy reference, I extracted it here:

- name: postgres-setup-init
  image: {{ .Values.global.database.initContainerSetupDBImage }}
  imagePullPolicy: {{ .Values.global.database.initContainerImagePullPolicy }}
  securityContext:
    runAsUser: 0
  command:
    - '/bin/bash'
    - '-c'
    - >
      {{- if and (ne .Chart.Name "artifactory-ha") (ne .Chart.Name "artifactory") }}
      until nc -z -w 5 {{ .Release.Name }}-artifactory-ha 8082 || nc -z -w 5 {{ .Release.Name }}-artifactory 8082; do echo "Waiting for artifactory to start"; sleep 10; done;
      {{- end }}
      echo "Running init db scripts";
      su postgres -c "bash /scripts/setupPostgres.sh"

Of interest is the URL used for the netcat, which attempts to check that the Artifactory service has started at {{ .Release.Name }}-artifactory-ha or {{ .Release.Name }}-artifactory. It relies on the DNS service within the cluster to resolve the IP addresses of the Artifactory pod.

However, as Artifactory exists outside the cluster, the cluster DNS is not able to resolve that domain name.

There are two possible workarounds:

  • Create a Service type:ExternalName that maps to the external address
  • Replace the URL with the actual URL/IP address of the Artifactory Service

Assuming you did one of the above workarounds (or even another one not listed), the init containers will be able to detect that Artifactory 7 has started and complete their tasks. Once they are completed successfully, the actual containers are started.

Issue #2 – Artifactory 7 Initiates a New Connection to Xray 3 as Part of Join Cluster Process

All 5 containers are started simultaneously, but the router container has to successfully initialize before xray-server (and the others) can finish their initialization steps. The dependency order between the containers is as such (from left to right):

router --> xray-server --> xray-[analysis|indexer|persist]

One of router container’s initialization steps is to reach out to Artifactory and attempt to join the cluster using the joinKey provided. This JFrog system architecture diagram helps to better understand the relationship between the two products. Based on my observation, the join process goes something like this:

  1. Xray 3 router container initiates join cluster request
  2. Artifactory 7 attempts to create a new connection to Xray 3 via the request’s source IP
  3. If connection back to Xray 3 was successful, report join cluster process successful. Otherwise respond with error

During my attempt, the pod went into a CrashLoopBackOff, and the router container had the following logs:

...
[jfrou][INFO][...][bootstrap.go:72][main] Router (jfrou) service initialization started. Version 7.11.2-3 Revision ...
[jfrou][INFO][...][bootstrap.go:75] [main] JFrog Router IP: (Xray-Pod-IP)
[jfrou][INFO][...][bootstrap.go:175] [main] System configuration encryption report:
...
[jfrou][INFO][...][bootstrap.go:81] [main] JFrog Router Node ID: jfrog-platform-xray-0
[jfrou][INFO][...][http_client_holder.go:155] [main] System cert pool contents were loaded as trusted CAs for TLS communication
[jfrou][INFO][...][join_executor.go:118] [main] Cluster join: Trying to rejoin the cluster
[jfrou][INFO][...][bootstrap.go:101] [main] Could not join access, err: Cluster join: Failed joining the cluster; Error: Error response from service registry, status code: 400; message: Could not validate router Check-url: http://(Xray-Pod-IP):8082/router/api/v1/system/ping; detail: I/O error on GET request for "http://(Xray-Pod-IP):8082/router/api/v1/system/ping": Connect to (Xray-Pod-IP):8082 [/(Xray-Pod-IP)] failed: connect timed out; nested exception is org.apache.http.conn.ConnectTimeoutException: Connect to (Xray-Pod-IP):8082 [/(Xray-Pod-IP)] failed: connect timed out

This error message is identical to this github issue, whereby the OP correctly identified that the error message was coming from Artifactory. Looking at Artifactory console.log, we will see the following error:

[jfac ][ERROR][...][.j.a.s.s.r.JoinServiceImpl:268] - Count not validate router Check-url: http://(Xray-Pod-IP):8082/router/api/v1/system/ping

I checked that the Xray-Pod-IP was indeed listed as a secondary private IP address of the elastic network interface that was attached to my EKS worker node, but realized that its security group ingress rules only allowed traffic from 1) ELBs belonging to services in the cluster, or 2) within the EKS cluster.

Although Artifactory 7 could not connect to Xray 3, the error message could propagate back to the Xray 3 router container because security groups are stateful:

Security groups are stateful — if you send a request from your instance, the response traffic for that request is allowed to flow in regardless of inbound security group rules. Responses to allowed inbound traffic are allowed to flow out, regardless of outbound rules.

https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html

What Happened

The JFrog Helm Chart assumes that all components in the suite will be deployed in the same Kubernetes cluster (Issue #1). Setting up a way for the components to identify where Artifactory is will address this (i.e. not relying on the cluster DNS).

The next blocker is the cluster join process, where I presume that Artifactory 7 initializes a new (instead of reusing the existing join request) connection back to Xray 3 for connectivity testing of its hooks that can initiate/trigger a scan on new artifacts that have been uploaded.

The security group rules prevented the new “callback” connection from completing, and caused the cluster join process to fail.

What Can Be Done?

The easiest would be to pay heed to JFrog’s notice and adopt an all-or-nothing approach when choosing to deploy in your Kubernetes cluster.

Though I did not attempt this, another possible way might be to open up the security group inbound rules to allow traffic from IP addresses of components outside the cluster. However, this may potentially expose the EKS worker node to the network/Internet. Last but not least, we should also note that such manual workarounds may not consistently work across deployments/environments.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s