Kubernetes-based Microservice Observability with Istio Service Mesh: Part 1Gary A.
StaffordBlockedUnblockFollowFollowingMar 10In this two-part post, we will explore the set of observability tools which are part of the latest version of Istio Service Mesh.
These tools include Zipkin, Jaeger, Kiali, Service Graph, Prometheus, and Grafana.
To assist in our exploration, we will deploy a Go-based, microservices reference platform to Google Kubernetes Engine, on the Google Cloud Platform.
What is Observability?Similar to blockchain, serverless, AI and ML, chatbots, cybersecurity, and service meshes, Observability is a hot buzz word in the IT industry right now.
According to Wikipedia, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
Logs, metrics, and traces are often known as the three pillars of observability.
These are the external outputs of the system, which we may observe.
The O’Reilly book, Distributed Systems Observability, by Cindy Sridharan, does an excellent job of detailing ‘The Three Pillars of Observability’, in Chapter 4.
I recommend reading this free online excerpt, before continuing.
A second great resource for information on observability is honeycomb.
io, a developer of observability tools for production systems, led by well-known industry thought-leader, Charity Majors.
The honeycomb.
io site includes articles, blog posts, whitepapers, and podcasts on observability.
As modern distributed systems grow ever more complex, the ability to observe those systems demands equally modern tooling that was designed with this level of complexity in mind.
Traditional logging and monitoring systems often struggle with today’s hybrid and multi-cloud, polyglot language-based, event-driven, container-based and serverless, infinitely-scalable, ephemeral-compute platforms.
Tools like Istio Service Mesh attempt to solve the observability challenge by offering native integrations with several best-of-breed, open-source telemetry tools.
Istio’s integrations include Zipkin and Jaeger for distributed tracing, Kiali and Service Graph for distributed system visualization, and Prometheus and Grafana for metric collection, monitoring, and alerting.
Combined with cloud platform-native monitoring and logging services, such as Stackdriver for Google Kubernetes Engine (GKE) on Google Cloud Platform (GCP), we have a complete observability platform for modern, distributed applications.
A Reference Microservices PlatformTo demonstrate the observability tools integrated with the latest version of Istio Service Mesh, we will deploy a reference microservices platform, written in Go, to GKE on GCP.
I developed the reference platform to demonstrate concepts such as API management, Service Meshes, Observability, DevOps, and Chaos Engineering.
The platform is comprised of (14) components, including (8) Go-based microservices, labeled generically as Service A — Service H, (1) Angular 7, TypeScript-based front-end, (4) MongoDB databases, and (1) RabbitMQ queue for event queue-based communications.
The platform and all its source code is free and open source.
The reference platform is designed to generate service-to-service, service-to-database (MongoDB), and service-to-queue-to-service (RabbitMQ) IPC (inter-process communication).
Service A calls Service B and Service C, Service B calls Service D and Service E, Service D produces a message on a queue that Service F consumes, and so on.
These distributed communications can be observed using Istio’s observability tools when the system is deployed to Kubernetes.
Each Go microservice contains a /ping and /health endpoint.
The /health endpoint can be used to configure Kubernetes Liveness and Readiness Probes.
Additionally, the edge service, Service A, is configured for Cross-Origin Resource Sharing (CORS) using the access-control-allow-origin: * response header.
CORS allows the Angular UI, running in end user’s web browser, to call the Service A /ping endpoint, residing in a different origin subdomain from UI.
Shown below is the Go source code for Service A.
For this demonstration, the MongoDB databases will be hosted, external to the services on GCP, on MongoDB Atlas, a MongoDB-as-a-Service, cloud-based platform.
Similarly, the RabbitMQ queues will be hosted on CloudAMQP, a RabbitMQ-as-a-Service, cloud-based platform.
I have used both of these SaaS providers in several previous posts.
Using external services will help us understand how Istio and its observability tools collect telemetry for communications between the Kubernetes cluster and external systems.
Shown below is the Go source code for Service F, This service consumers messages from the RabbitMQ queue, placed there by Service D, and writes the messages to MongoDB.
Service Response TracesOn the reference platform, each upstream service responds to requests from downstream services by returning a small informational JSON payload, which I have termed a trace.
These response traces should not be confused with transactional traces, a basic construct of distributed transaction monitoring, with tools like Zipkin and Jaeger.
The response traces are aggregated across the service call chain, resulting in an array of service response traces being returned to the edge service and on to the Angular-based UI, running in the end user’s web browser.
The response trace feature is simply used to confirm that the service-to-service communications, Istio components, and the telemetry tools are working properly.
Source CodeAll source code for this post is available on GitHub in two projects.
The Go-based microservices source code, all Kubernetes resources, and all deployment scripts are located in the k8s-istio-observe-backend project repository.
The Angular UI TypeScript-based source code is located in the k8s-istio-observe-frontend project repository.
Docker images referenced in the Kubernetes Deployment resource files, for the Go services and UI, are all available on Docker Hub.
The Go microservice Docker images were built using the official Golang Alpine base image on DockerHub, containing Go version 1.
12.
0.
Using the Alpine image to compile the Go source code ensures the containers will be as small as possible and contain a minimal attack surface.
System RequirementsTo follow along with the post, you will need the gcloud CLI, part of the Google Cloud SDK, Helm, and Istio 1.
0.
6 installed and configured locally or on your build machine.
The code in this post has not been tested with any of the Istio 1.
1.
x pre-release versions.
Set-up and InstallationTo deploy the microservices platform to GKE, we will proceed in the following order.
Create the MongoDB Atlas database cluster;Create the CloudAMQP RabbitMQ cluster;Modify the Kubernetes resources and scripts for your own environments;Create the GKE cluster on GCP;Create DNS records for the platform’s exposed resources;Deploy Istio to the GKE cluster, using Helm;Deploy the Go-based microservices, Angular UI, and associated resources to GKE;Test and troubleshoot the platform;Observe the results in Part Two!MongoDB Atlas ClusterMongoDB Atlas is a fully-managed MongoDB-as-a-Service, available on AWS, Azure, and GCP.
Atlas, a mature SaaS product, offers high-availability, guaranteed uptime SLAs, elastic scalability, cross-region replication, enterprise-grade security, LDAP integration, a BI Connector, and much more.
MongoDB Atlas currently offers four pricing plans, Free, Basic, Pro, and Enterprise.
Plans range from the smallest, M0-sized MongoDB cluster, with shared RAM and 512 MB storage, up to the massive M400 MongoDB cluster, with 488 GB of RAM and 3 TB of storage.
For this post, I have created an M2-sized MongoDB cluster in GCP’s us-central1 (Iowa) region, with a single user database account for this demo.
The account will be used to connect from four of the eight Go-based microservices, running on GKE.
Originally, I started with an M0-sized cluster, but the compute resources were insufficient to support the volume of calls from the Go-based microservices.
I suggest at least an M2-sized cluster or larger.
CloudAMQP RabbitMQ ClusterCloudAMQP provides full-managed RabbitMQ clusters on all major cloud and application platforms.
RabbitMQ will support a decoupled, eventually consistent, message-based architecture for a portion of our Go-based microservices.
For this post, I have created a RabbitMQ cluster in GCP’s us-central1 (Iowa) region, the same as our GKE cluster and MongoDB Atlas cluster.
I chose a minimally-configured free version of RabbitMQ.
CloudAMQP also offers robust, multi-node RabbitMQ clusters for Production use.
Modify ConfigurationsThere are a few configuration settings you will need to change in the GitHub project’s Kubernetes resource files and Bash deployment scripts.
Istio ServiceEntry for MongoDB AtlasModify the Istio ServiceEntry, external-mesh-mongodb-atlas.
yaml file, adding you MongoDB Atlas host address.
This file allows egress traffic from four of the microservices on GKE to the external MongoDB Atlas cluster.
apiVersion: networking.
istio.
io/v1alpha3kind: ServiceEntrymetadata: name: mongodb-atlas-external-meshspec: hosts: – {{ your_host_goes_here }} ports: – name: mongo number: 27017 protocol: MONGO location: MESH_EXTERNAL resolution: NONEIstio ServiceEntry for CloudAMQP RabbitMQModify the Istio ServiceEntry, external-mesh-cloudamqp.
yaml file, adding you CloudAMQP host address.
This file allows egress traffic from two of the microservices to the CloudAMQP cluster.
apiVersion: networking.
istio.
io/v1alpha3kind: ServiceEntrymetadata: name: cloudamqp-external-meshspec: hosts: – {{ your_host_goes_here }} ports: – name: rabbitmq number: 5672 protocol: TCP location: MESH_EXTERNAL resolution: NONEIstio Gateway and VirtualService ResourcesThere are numerous strategies you may use to route traffic into the GKE cluster, via Istio.
I am using a single domain for the post, example-api.
com, and four subdomains.
One set of subdomains is for the Angular UI, in the dev Namespace (ui.
dev.
example-api.
com) and the test Namespace (ui.
test.
example-api.
com).
The other set of subdomains is for the edge API microservice, Service A, which the UI calls (api.
dev.
example-api.
com and api.
test.
example-api.
com).
Traffic is routed to specific Kubernetes Service resources, based on the URL.
According to Istio, the Gateway describes a load balancer operating at the edge of the mesh, receiving incoming or outgoing HTTP/TCP connections.
Modify the Istio ingress Gateway, inserting your own domains or subdomains in the hosts section.
These are the hosts on port 80 that will be allowed into the mesh.
apiVersion: networking.
istio.
io/v1alpha3kind: Gatewaymetadata: name: demo-gatewayspec: selector: istio: ingressgateway servers: – port: number: 80 name: http protocol: HTTP hosts: – ui.
dev.
example-api.
com – ui.
test.
example-api.
com – api.
dev.
example-api.
com – api.
test.
example-api.
comAccording to Istio, a VirtualService defines a set of traffic routing rules to apply when a host is addressed.
A VirtualService is bound to a Gateway to control the forwarding of traffic arriving at a particular host and port.
Modify the project’s four Istio VirtualServices, inserting your own domains or subdomains.
Here is an example of one of the four VirtualServices, in the istio-gateway.
yaml file.
apiVersion: networking.
istio.
io/v1alpha3kind: VirtualServicemetadata: name: angular-ui-devspec: hosts: – ui.
dev.
example-api.
com gateways: – demo-gateway http: – match: – uri: prefix: / route: – destination: port: number: 80 host: angular-ui.
dev.
svc.
cluster.
localKubernetes SecretThe project contains a Kubernetes Secret, go-srv-demo.
yaml, with two values.
One is for the MongoDB Atlas connection string and one is for the CloudAMQP connections string.
Remember Kubernetes Secret values need to be base64 encoded.
apiVersion: v1kind: Secretmetadata: name: go-srv-configtype: Opaquedata: mongodb.
conn: {{ your_base64_encoded_secret }} rabbitmq.
conn: {{ your_base64_encoded_secret }}On Linux and Mac, you can use the base64 program to encode the connection strings.
> echo -n "mongodb+srv://username:password@atlas-cluster.
gcp.
mongodb.
net/test?retryWrites=true" | base64bW9uZ29kYitzcnY6Ly91c2VybmFtZTpwYXNzd29yZEBhdGxhcy1jbHVzdGVyLmdjcC5tb25nb2RiLm5ldC90ZXN0P3JldHJ5V3JpdGVzPXRydWU= > echo -n "amqp://username:password@rmq.
cloudamqp.
com/cluster" | base64YW1xcDovL3VzZXJuYW1lOnBhc3N3b3JkQHJtcS5jbG91ZGFtcXAuY29tL2NsdXN0ZXI=Bash Scripts VariablesThe bash script, part3_create_gke_cluster.
sh, contains a series of environment variables.
At a minimum, you will need to change the PROJECT variable in all scripts to match your GCP project name.
# Constants – CHANGE ME!readonly PROJECT='{{ your_gcp_project_goes_here }}'readonly CLUSTER='go-srv-demo-cluster'readonly REGION='us-central1'readonly MASTER_AUTH_NETS='72.
231.
208.
0/24'readonly NAMESPACE='dev'readonly GKE_VERSION='1.
12.
5-gke.
5'readonly MACHINE_TYPE='n1-standard-2'The bash script, part4_install_istio.
sh, includes the ISTIO_HOME variable.
The value should correspond to your local path to Istio 1.
0.
6.
On my local Mac, this value is shown below.
readonly ISTIO_HOME='/Applications/istio-1.
0.
6'Deploy GKE ClusterNext, deploy the GKE cluster using the included bash script, part3_create_gke_cluster.
sh.
This will create a Regional, multi-zone, 3-node GKE cluster, using the latest version of GKE at the time of this post, 1.
12.
5-gke.
5.
The cluster will be deployed to the same region as the MongoDB Atlas and CloudAMQP clusters, GCP’s us-central1 (Iowa) region.
Planning where your Cloud resources will reside, for both SaaS providers and primary Cloud providers can be critical to minimizing latency for network I/O intensive applications.
Modify DNS RecordsInstead of using IP addresses to route traffic the GKE cluster and its applications, we will use DNS.
As explained earlier, I have chosen a single domain for the post, example-api.
com, and four subdomains.
One set of subdomains is for the Angular UI, in the dev Namespace and the test Namespace.
The other set of subdomains is for the edge microservice, Service A, which the API calls.
Traffic is routed to specific Kubernetes Service resources, based on the URL.
Creating the GKE cluster also triggers the creation of a Google Load Balancer, four IP addresses, and all required firewall rules.
One of the four IP addresses, the one shown below, associated with the Forwarding rule, will be associated with the front-end of the load balancer.
Below, we see the new load balancer, with the front-end IP address and the backend VM pool of three GKE cluster’s worker nodes.
Each node is assigned one of the IP addresses, as shown above.
As shown below, using Google Cloud DNS, I have created the four subdomains and assigned the IP address of the load balancer’s front-end to all four subdomains.
Ingress traffic to these addresses will be routed through the Istio ingress Gateway and the four Istio VirtualServices, to the appropriate Kubernetes Service resources.
Use your choice of DNS management tools to create the four A Type DNS records.
Deploy Istio using HelmWith the GKE cluster and associated infrastructure in place, deploy Istio.
For this post, I have chosen to install Istio using Helm, as recommended my Istio.
To deploy Istio using Helm, use the included bash script, part4_install_istio.
sh.
The script installs Istio, using the Helm Chart in the local Istio 1.
0.
6 install/kubernetes/helm/istio directory, which you installed as a requirement for this demonstration.
The Istio install script overrides several default values in the Istio Helm Chart using the –set, flag.
The list of available configuration values is detailed in the Istio Chart’s GitHub project.
The options enable Istio’s observability features, which we will explore in part two.
Features include Kiali, Grafana, Prometheus, Service Graph, Zipkin, and Jaeger.
helm install $ISTIO_HOME/install/kubernetes/helm/istio –name istio –namespace istio-system –set global.
mtls.
enabled=true –set grafana.
enabled=true –set kiali.
enabled=true –set kiali.
createDefaultSecret=true –set prometheus.
enabled=true –set servicegraph.
enabled=true –set telemetry-gateway.
grafanaEnabled=true –set telemetry-gateway.
prometheusEnabled=true –set tracing.
enabled=true –set tracing.
provider=jaegerBelow, we see the Istio-related Workloads running on the cluster, including the observability tools.
Below, we see the corresponding Istio-related Service resources running on the cluster.
Deploy the Reference PlatformNext, deploy the eight Go-based microservices, the Angular UI, and the associated Kubernetes and Istio resources to the GKE cluster.
To deploy the platform, use the included bash deploy script, part5a_deploy_resources.
sh.
If anything fails and you want to remove the existing resources and re-deploy, without destroying the GKE cluster or Istio, you can use the part5b_delete_resources.
sh delete script.
The deploy script deploys all the resources two Kubernetes Namespaces, dev and test.
This will allow us to see how we can differentiate between Namespaces when using the observability tools.
Below, we see the Istio-related resources, which we just deployed.
They include the Istio Gateway, four Istio VirtualService, and two Istio ServiceEntry resources.
Below, we see the platform’s Workloads (Kubernetes Deployment resources), running on the cluster.
Here we see two Pods for each Workload, a total of 18 Pods, running in the dev Namespace.
Each Pod contains both the deployed microservice or UI component, as well as a copy of Istio’s Envoy Proxy.
Below, we see the corresponding Kubernetes Service resources running in the dev Namespace.
Below, a similar view of the Deployment resources running in the test Namespace.
Again, we have two Pods for each deployment with each Pod contains both the deployed microservice or UI component, as well as a copy of Istio’s Envoy Proxy.
Test the PlatformWe do want to ensure the platform’s eight Go-based microservices and Angular UI are working properly, communicating with each other, and communicating with the external MongoDB Atlas and CloudAMQP RabbitMQ clusters.
The easiest way to test the cluster is by viewing the Angular UI in a web browser.
The UI requires you to input the host domain of the Service A, the API’s edge service.
Since you cannot use my subdomain, and the JavaScript code is running locally to your web browser, this option allows you to provide your own host domain.
This is the same domain or domains you inserted into the two Istio VirtualService for the UI.
This domain route your API calls to either the FQDN (fully qualified domain name) of the Service A Kubernetes Service running in the dev namespace, service-a.
dev.
svc.
cluster.
local, or the test Namespace, service-a.
test.
svc.
cluster.
local.
You can also use performance testing tools to load-test the platform.
Many issues will not show up until the platform is under load.
I recently starting using hey, a modern load generator tool, as a replacement for Apache Bench (ab), Unlike ab, hey supports HTTP/2 endpoints, which is required to test the Go-based platform.
Below, I am running hey directly from Google Cloud Shell.
The tool is simulating 10 concurrent users, generating a total of 1,000 HTTP/2-based GET requests to Service A.
TroubleshootingIf for some reason the UI fails to display, or the call from the UI to the API fails, and assuming all Kubernetes and Istio resources are running on the GKE cluster (all green), the most common explanation is usually a misconfiguration of the following resources:Your four Cloud DNS records are not correct.
They are not pointing to the load balancer’s front-end IP address;You did not configure the four Kubernetes VirtualService resources with the correct subdomains;Your Go-based microservices cannot reach the external MongoDB Atlas and CloudAMQP RabbitMQ clusters.
Likely, the Kubernetes Secret is constructed incorrectly, or the two ServiceEntry resources contain the wrong host information for those external clusters;I suggest starting the troubleshooting by calling Service A, the API’s edge service, directly, using cURL or Postman.
You should see a JSON response payload, similar to the following.
This would suggest the issue is with the UI, not the API.
Next, confirm that the four MongoDB databases were created for Service D, Service, F, Service, G, and Service H.
Also, confirm that new documents are being written to the database’s collections.
Next, confirm the new RabbitMQ queue was created, using the CloudAMQP RabbitMQ Management Console.
Service D produces messages, which Service F consumes from the queue.
Lastly, review the Stackdriver logs to see if there are any obvious errors.
Part TwoIn part two of this post, we will explore each observability tool, and see how they can help us manage our GKE cluster and the reference platform running in the cluster.
Since the cluster only takes minutes to fully create and deploy resources to, if you want to tear down the GKE cluster, run the part6_tear_down.
sh script.
All opinions expressed in this post are my own and not necessarily the views of my current or past employers or their clients.
Originally published at programmaticponderings.
com on March 11, 2019.
.. More details