Loading...

Follow RHD Blog on Feedspot

Continue with Google
Continue with Facebook
or

Valid

The Apache Kafka project includes a Streams Domain-Specific Language (DSL) built on top of the lower-level Stream Processor API. This DSL provides developers with simple abstractions for performing data processing operations. However, how one builds a stream processing pipeline in a containerized environment with Kafka isn’t clear. This second article in a two-part series uses the basics from the previous article to build an example application using Red Hat AMQ Streams.

Now let’s create a multi-stage pipeline operating on real-world data and consume and visualize the data.

The system architecture

In this article, we build a solution that follows the architecture diagram below. It may be worth referring back here for each new component:

The dataset and problem

The data we chose for this example is the New York City taxi journey information from 2013, which was used for the ACM Distributed Event-Based Systems (DEBS) Grand Challenge in 2015. You can find a description of the data source here.

This example’s dataset is provided as a CSV file, with the columns detailed below:

Column Description
medallion an md5sum of the identifier of the taxi – vehicle bound
hack_license an md5sum of the identifier for the taxi license
pickup_datetime time when the passenger(s) were picked up
dropoff_datetime time when the passenger(s) were dropped off
trip_time_in_secs duration of the trip
trip_distance trip distance in miles
pickup_longitude longitude coordinate of the pickup location
pickup_latitude latitude coordinate of the pickup location
dropoff_longitude longitude coordinate of the drop-off location
dropoff_latitude latitude coordinate of the drop-off location
payment_type the payment method – credit card or cash
fare_amount fare amount in dollars
surcharge surcharge in dollars
mta_tax tax in dollars
tip_amount tip in dollars
tolls_amount bridge and tunnel tolls in dollars
total_amount total paid amount in dollars

Source: DEBS 2015 Grand Challenge.

We can explore interesting avenues within this dataset, such as following specific taxis to calculate:

  • The money earned from one taxi throughout the course of a day.
  • The distance from the last drop-off to the next pick-up to find out whether they travel far without a passenger.
  • The average speed of the taxi’s trip by using the distance and time; we then use the pick-up and drop-off coordinates to guess the amount of traffic encountered.
The example

The processing we chose for this example is relatively straightforward. We calculate the total amount of money (fare_amount + tip_amount) earned within a particular area of the city, based off of journeys starting there. This calculation involves splitting the input data into a grid of different cells and then summing the total amount of money taken for every journey that originates from any cell. To accomplish this, we have to consider splitting up processing in a way that ensures that our output is correct.

We will build this example step-by-step using what we learned in Part 1.

Send data into Apache Kafka

First, we need to make our dataset accessible from the cluster. This task would normally involve connecting to a service that lets us poll the live data in real-time, but the data chose is historical, so we have to emulate the real-time behavior.

Kafka Connect is ideal for this sort of function and is exactly what we will use in the end. However, for now, we avoid the additional complexity involved and use a Kafka Producer application as discussed in Part 1. We accomplish this by including the (smaller) data file in the JAR and reading it line-by-line to send to Kafka. See the TaxiProducerExample for an example of how this works. To start, we set the TOPIC to write the data to taxi-source-topic in the deployment configuration.

Now let’s check that the data is streaming to the topic:

$ oc run kafka-consumer -ti --image=registry.access.redhat.com/amq7/amq-streams-kafka:1.1.0-kafka-2.1.1 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-cluster-kafka-bootstrap:9092 --topic taxi-source-topic --from-beginning
  07290D3599E7A0D62097A346EFCC1FB5,E7750A37CAB07D0DFF0AF7E3573AC141,2013-01-01 00:00:00,2013-01-01 00:02:00,120,0.44,-73.956528,40.716976,-73.962440,40.715008,CSH,3.50,0.50,0.50,0.00,0.00,4.50
  22D70BF00EEB0ADC83BA8177BB861991,3FF2709163DE7036FCAA4E5A3324E4BF,2013-01-01 00:02:00,2013-01-01 00:02:00,0,0.00,0.000000,0.000000,0.000000,0.000000,CSH,27.00,0.00,0.50,0.00,0.00,27.50
  0EC22AAF491A8BD91F279350C2B010FD,778C92B26AE78A9EBDF96B49C67E4007,2013-01-01 00:01:00,2013-01-01 00:03:00,120,0.71,-73.973145,40.752827,-73.965897,40.760445,CSH,4.00,0.50,0.50,0.00,0.00,5.00
  ...
Create the Kafka Streams operations

Now that we have a topic with the String data, we can start creating our application logic. First, let’s set up the Kafka Streams application’s configuration options. We do this in a separate config class—see TripConvertConfig—that uses the same method of reading from environment variables described in Part 1. Note that we use this same method of providing configuration for each new application we build.

We can now generate the configuration options:

TripConvertConfig config = TripConvertConfig.fromMap(System.getenv());
Properties props = TripConvertConfig.createProperties(config);

As described in the basic example, the actions we perform will be to read from one topic, perform some kind of operation on the data, and write out to another topic.

Let’s create the source stream in the same method we saw before:

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream(config.getSourceTopic());

The data we receive is currently in a format that is not easy to use. Our long CSV data is represented as a String, and we do not have access to the individual fields.

To perform operations on the data, we need to convert the () events into a type we know. For this purpose, we created a Plain Old Java Object (POJO), representing the Trip data type, and an enum TripFields, representing each data element’s columns. The function constructTripFromString takes each of the lines of CSV data and converts it into Trips. This function is implemented in the TripConvertApp class.

The Kafka Streams DSL makes it easy to perform this function for every new record we receive:

KStream<String, Trip> mapped = source
                .map((key, value) -> {
                    new KeyValue<>(key, constructTripFromString(value))
                });

We could now write the mapped stream out to the sink topic. However, the serialization/deserialization (SerDes) process for our value field has changed from the Serdes.String() that we set it to from the TripConvertConfig class. Because our Trip type is custom-built for our application, we must create our own SerDes implementation.

This is where the JsonObjectSerde class comes into play. We use this class to handle converting our custom objects to and from JSON, letting the Vertx JsonObject class do the heavy lifting. A few annotations are required on the object’s constructor, which can be easily seen in the Location class.

We are now ready to output to our sinkTopic, using the following command:

final JsonObjectSerde tripSerde = new JsonObjectSerde<>(Trip.class);
mapped.to(config.getSinkTopic(), Produced.with(Serdes.String(), tripSerde));
Generate application-specific information

The intention of our application is to calculate the total amount of money received by all journeys originating from any particular cell. We, therefore, must perform calculations using the journey origin’s latitude and longitude, to determine which cell it belongs to. We use the logic laid out in the DEBS Grand Challenge for defining grid specifics. See the figure below for an example:

We must set the grid’s origin (blue point), which represents the center of grid cell (1,1), and a size in meters for every cell in the grid. Next, we convert the cell size into a latitude and longitude distance, dy and dx respectively, and calculate the grid’s top left position (red point). For any new arrival point, we can easily count how many dy and dx away the coordinates are, and therefore in the example above (yellow point), we can determine that the journey originates from cell (3,4).

The additional application logic in the Cell class and TripConvertApp performs this calculation, and we set the new record’s key as the Cell type. To write to the sinkTopic, we need a new SerDes, created in an identical fashion to the one we made before.

As we chose the default partitioning strategy, records are partitioned based on the keys’ different values, so this change ensures that every Trip corresponding to a particular pick-up Cell is distributed to the same partition. When we perform processing downstream, the same processing node receives all records corresponding to the same pickup cell, ensuring the operations’ correctness and reproducibility.

Aggregate the data

We have now converted all of the incoming data to a type of <Cell, Trip>, and can perform an aggregation operation. Our intention is to calculate the sum of the fare_amount + tip_amount for every journey originating from one pick-up cell, across a set time period.

Because our data is historical, the time window that we use should be in relation to the original time that the events occurred, rather than the time that the event entered the Kafka system. To do this, we need to provide a method of extracting this information from each record: a class that implements TimestampExtractor. The Trip fields already contain this information for both pick-up and drop-off times, and so the implementation is straightforward—see the implementation in TripTimestampExtractor for details.

Even though the topic we read from is already partitioned by cell, there are many more cells than partitions, so each of our replicas will process the data for more than one cell. To ensure that the windowing and aggregation are performed on a cell-by-cell basis, we call the groupByKey() function first, followed by a subsequent windowing operation. As seen below, the window size is easily changeable, but for now, we opted for a window of 15 minutes.

The data can now be aggregated to generate the output metric we want. Doing this is as simple as providing an accumulator value and the operation to perform for each record. The output is of type KTable, where each key represents one particular window, and the value is the output of our aggregation operation. We use the toStream() function to convert that output back to a Kafka stream so that it can be output to the sink profit.

KStream<Cell, Trip> source = builder.stream(config.getSourceTopic(), Consumed.with(cellSerde, tripSerde));
KStream<Windowed, Double> windowed = source
        .groupByKey(Serialized.with(cellSerde, tripSerde))
        .windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(15)))
        .aggregate(
                () -> (double) 0,
                (key, value, profit) -> {
                    profit + value.getFareAmount() + value.getTipAmount()
                },
                Materialized.<Cell, Double, WindowStore<Bytes, byte[]>>as("profit-store")
                        .withValueSerde(Serdes.Double()))
        .toStream();

As we do not require the information of which window the values belong to, we reset the cell as the record’s keys and round the value to two decimal places.

KStream<Cell, Double> rounded = windowed
                .map((window, profit) -> new KeyValue<>(window.key(), (double) Math.round(profit*100)/100));

Finally, the data can now be written to the output topic using the same method as defined before:

rounded.to(config.getSinkTopic(), Produced.with(cellSerde, Serdes.Double()));
Consume and visualize the data

We now have the windowed cell-based metric being output to the last topic, so the final step is to consume and visualize the data. For this step, we use the Vertx Kafka Client to read the data from our topic and stream it to a JavaScript dashboard using the Vertx EventBus and SockJS (WebSockets). See TripConsumerApp for the implementation.

This consumer application registers a handler that converts arriving records into a readable JSON format and publishes the output over an outbound EventBus channel. The JavaScript connects to this channel and registers a handler for all incoming messages that perform relevant actions to visualize the data:

KafkaConsumer<String, Double> consumer = KafkaConsumer.create(vertx, props, String.class, Double.class);
consumer.handler(record -> {
    JsonObject json = new JsonObject();
    json.put("key", record.key());
    json.put("value", record.value());
    vertx.eventBus().publish("dashboard", json);
});

We log the raw metric information in a window so it can be seen, and use a geographical mapping library (Leaflet) to draw the original cells, modifying the opacity based on the metric’s value:

By modifying the starting latitude and longitude—or the cell size—in both index.html and TripConvertApp, you can change the grid you are working with. You can also adjust the aggregate function’s logic to calculate alternative metrics from the data:

Create a Kafka connector

Up until now, the producer we are using has been sufficient, even though the JAR (and image) are more bloated due to the additional data file. However, if we wanted to process the full 12GB dataset, what we have is not an ideal solution.

The example connector we built relies on hosting the file on an FTP server, but there are existing connectors for several different file stores. We picked an FTP server as it allows our connector to easily communicate with a file external to the cluster. For convenience, we use a Python library pyftpdlib to host the file with the username and password set to amqstreams. However, hosting the file on any publicly accessible FTP server is sufficient.

A Kafka connector consists of both itself and tasks (also known as workers) that perform the data retrieval through calls to the poll() function. The connector passes configuration over to the workers, and several workers can be invoked as per the tasks.max parameter. For this purpose, we created an FTPConnection class, which provides the functions we require from the Apache Commons FTPClient. On each call to poll(), we retrieve the next line from the file and publish this record to the topic provided in the configuration.

We now need to add our connector plugin to the existing amq-streams-kafka-connect Docker image, which is done by adding the JAR to the plugins folder, as described in the Red Hat AMQ Streams documentation. We can then deploy the Kafka Connect cluster using the instructions from the default KafkaConnect example, but adding the spec.image field to our kafka-connect.yamlinstead, and pointing to the image containing our plugin.

Kafka Connect is exposed as a RESTful resource, so to check which connector plugins are present, we can run the following GET request:

$ oc exec -c kafka -i my-cluster-kafka-0 -- curl -s -X GET \
    http://my-connect-cluster-connect-api:8083/connector-plugins
  [{"class":"io.strimzi.TaxiSourceConnector","type":"source","version":"1.0-SNAPSHOT"},{"class":"org.apache.kafka.connect.file.FileStreamSinkConnector","type":"sink","version":"2.1.0"},{"class":"org.apache.kafka.connect.file.FileStreamSourceConnector","type":"source","version":"2.1.0"}]

Similarly, to create a new connector we can POST the JSON configuration, as shown in the example below. This new connector instance establishes an FTP connection to the server and streams the data to the taxi-source-topic. For this process to work, the following configuration options must be set correctly:

  • connect.ftp.address – FTP connection URL host:port.
  • connect.ftp.filepath – Path to file on remote FTP server from root.

If  needed, add this optional configuration:

  • connect.ftp.attempts – Maximum number of attempts to retrieve a valid FTP connection (default: 3).
  • connect.ftp.backoff.ms – Back-off time in milliseconds between connection attempts (default: 10000ms).
$ oc exec -c kafka -i my-cluster-kafka-0 -- curl -s -X POST \
    -H "Accept:application/json" \
    -H "Content-Type:application/json" \
    http://my-connect-cluster-connect-api:8083/connectors -d @- <<'EOF'

{
    "name": "taxi-connector",
    "config": {
        "connector.class": "io.strimzi.TaxiSourceConnector",
        "connect.ftp.address": "<ip-address>",
        "connect.ftp.user": "amqstreams",
        "connect.ftp.password": "amqstreams",
        "connect.ftp.filepath": "sorteddata.csv",
        "connect.ftp.topic": "taxi-source-topic",
        "tasks.max": "1",
        "value.converter": "org.apache.kafka.connect.storage.StringConverter"
    }
}
EOF
 ..
Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

The Apache Kafka project includes a Streams Domain-Specific Language (DSL) built on top of the lower-level Stream Processor API. This DSL provides developers with simple abstractions for performing data processing operations. However, how to build a stream processing pipeline in a containerized environment with Kafka isn’t clear. This two-part article series describes the steps required to build your own Apache Kafka Streams application using Red Hat AMQ Streams.

During this first part of this series, we provide a simple example to use as building blocks for part two. First things first, to set up AMQ Streams, follow the instructions in the AMQ Streams getting started documentation to deploy a Kafka cluster. The rest of this documentation assumes you followed these steps.

Create a producer

To get started, we need a flow of data streaming into Kafka. This data can come from a variety of different sources, but for the purposes of this example, let’s generate sample data using Strings sent with a delay.

Next, we need to configure the Kafka producer so that it talks to the Kafka brokers (see this article for a more in-depth explanation), as well as provides the topic name to write to and other information. As our application will be containerized, we can abstract this process away from the internal logic and read from environment variables. This reading could be done in a separate configuration class, but for this simple example, the following is sufficient:

private static final String DEFAULT_BOOTSTRAP_SERVERS = "localhost:9092";
private static final int DEFAULT_DELAY = 1000;
private static final String DEFAULT_TOPIC = "source-topic";

private static final String BOOTSTRAP_SERVERS = "BOOTSTRAP_SERVERS";
private static final String DELAY = "DELAY";
private static final String TOPIC = "TOPIC";

private static final String ACKS = "1";

String bootstrapServers = System.getenv().getOrDefault(BOOTSTRAP_SERVERS, DEFAULT_BOOTSTRAP_SERVERS);
long delay = Long.parseLong(System.getenv().getOrDefault(DELAY, String.valueOf(DEFAULT_DELAY)));
String topic = System.getenv().getOrDefault(TOPIC, DEFAULT_TOPIC);

We can now create appropriate properties for our Kafka producer using the environment variables (or the defaults we set):

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
props.put(ProducerConfig.ACKS_CONFIG, ACKS);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");

Next, we create the producer with the configuration properties:

KafkaProducer<String,String> producer = new KafkaProducer<>(props);

We can now stream the data to the provided topic, assigning the required delay between each message:

int i = 0;
while (true) {
    String value = String.format("hello world %d", i);
    ProducerRecord<String, String> record = new ProducerRecord<>(topic, value);
    log.info("Sending record {}", record);
    producer.send(record);
    i++;
    try {
        Thread.sleep(delay);
    } catch (InterruptedException e) {
        // sleep interrupted, continue
    }
}

Now that our application is complete, we need to package it into a Docker image and push to Docker Hub. We can then create a new deployment in our Kubernetes cluster using this image and pass the environment variables via YAML:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: basic-example
  name: basic-producer
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: basic-producer
    spec:
      containers:
      - name: basic-producer
        image: <docker-user>/basic-producer:latest
        env:
          - name: BOOTSTRAP_SERVERS
            value: my-cluster-kafka-bootstrap:9092
          - name: DELAY
            value: 1000
          - name: TOPIC
            value: source-topic

We can check that data is arriving at our Kafka topic by consuming from it:

$ oc run kafka-consumer -ti --image=registry.access.redhat.com/amq7/amq-streams-kafka:1.1.0-kafka-2.1.1 --rm=true --restart=Never -- bin/kafka-console-consumer.sh --bootstrap-server my-cluster-kafka-bootstrap:9092 --topic source-topic --from-beginning

The output should look like this:

  hello world 0
  hello world 1
  hello world 2
  ...
Create a Streams application

A Kafka Streams application typically reads/sources data from one or more input topics, and writes/sends data to one or more output topics, acting as a stream processor. We can set up the properties and configuration the same way as before, but this time we need to specify a SOURCE_TOPIC and a SINK_TOPIC.

To start, we create the source stream:

StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream(sourceTopic);

We can now perform an operation on this source stream. For example, we can create a new stream, called filtered, which only contains records with an even-numbered index:

KStream<String, String> filtered = source
        .filter((key, value) -> {
            int i = Integer.parseInt(value.split(" ")[2]);
            return (i % 2) == 0;
        });

This new stream can then output to the sinkTopic:

filtered.to(sinkTopic);

We have now created the topology defining our stream application’s operations, but we do not have it running yet. Getting it running requires creating the streams object, setting up a shutdown handler, and starting the stream:

final KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);

// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-shutdown-hook") {
    @Override
    public void run() {
        streams.close();
        latch.countDown();
    }
});

try {
    streams.start();
    latch.await();
} catch (Throwable e) {
    System.exit(1);
}
System.exit(0);

There you go. It’s as simple as this to get your Streams application running. Build the application into a Docker image, deploy in a similar way to the producer, and you can watch the SINK_TOPIC for the output.

The post Building Apache Kafka Streams applications using Red Hat AMQ Streams: Part 1 appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Red Hat Software Collections supply the latest, stable versions of development tools for Red Hat Enterprise Linux via two release trains per year. As part of the latest Software Collections 3.3 release, we are pleased to share that Ruby 2.6 is now generally available and supported on Red Hat Enterprise Linux 7.

The new Ruby 2.6.2 introduces several performance improvements, bug fixes, and new features:

  • Notable enhancements include:
    • Constant names are now allowed to begin with a non-ASCII capital letter.
    • Support for an endless range has been added.
    • A new Binding#source_location method has been provided.
    • $SAFE is now a process global state and it can be set back to 0.
  • The following performance improvements have been implemented:
    • The Proc#call and block.call processes have been optimized.
    • A new garbage collector managed heap, Transient heap (theap), has been introduced.
    • Native implementations of coroutines for individual architectures have been introduced.

Package name: rh-ruby26

Container image: rhscl/ruby-26-rhel7

System support: RHEL 7 for x86_64, s390x, aarch64, ppc64le

About Software Collections

Twice a year, Red Hat distributes new versions of compiler toolsets, scripting languages, open source databases, web tools, etc. so that application developers will have access to the latest, stable versions.

These Red Hat supported offerings are packaged as Red Hat Software Collections, Red Hat Developer Toolset with GCC, and the recently added compiler toolsets Clang/LLVM, Go, and Rust. All are yum installable and are included in most Red Hat Enterprise Linux subscriptions and all Red Hat Enterprise Linux Developer Subscriptions.

Most Red Hat Software Collections and Red Hat Developer Toolset components are also available as Linux container images for hybrid cloud development across Red Hat Enterprise Linux, Red Hat OpenShift Container Platform, etc.

Resources

The post Ruby 2.6 now available on Red Hat Enterprise Linux 7 appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Since the general release of Red Hat Enterprise Linux 8, we’ve had great response from those of you who have downloaded the product and used our complimentary RHEL 8 resources. RHEL 8 is the most developer-friendly version ever, but you may still have questions.

Join us on June 18 for our comprehensive virtual event: Conquer complexity with Red Hat Enterprise Linux 8. In this event, experts John Gantz, Senior Vice President, IDC, and Ron Pacheco, Director, Product Management Global, Red Hat, will explain what RHEL 8 can do for your organization.

In addition to development, topics will include management, scalability, performance, workloads and migration, security, and deploying to a hybrid cloud.

This event includes four tracks:

Track 1 – Developing Applications Quickly and with Confidence
Track 2 – Deploying and Managing the Intelligent OS
Track 3 – The Value of RHEL 8: A Foundation for Modern IT Environments
Track 4 – The Bedrock of Security

View the complete agenda for more details and register today.

The post Virtual event: Conquer complexity with Red Hat Enterprise Linux 8 appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In the fifth and final part of this series, we will look at exposing Apache Kafka in Strimzi using Kubernetes Ingress. This article will explain how to use Ingress controllers on Kubernetes, how Ingress compares with Red Hat OpenShift routes, and how it can be used with Strimzi and Kafka. Off-cluster access using Kubernetes Ingress is available only from Strimzi 0.12.0. (Links to previous articles in the series can be found at the end.)

Note: Productized and supported versions of the Strimzi and Apache Kafka projects are available as part of the Red Hat AMQ product.

Kubernetes Ingress

Ingress is a Kubernetes API for managing external access to HTTP/HTTPS services,  which was added in Kubernetes 1.1. Ingress is the Kubernetes counterpart to Red Hat OpenShift routes, which we discussed previously. It acts as a Layer 7 load balancer for HTTP or HTTPS traffic. The Ingress resources will define the rules for routing the traffic to different services and pods. An Ingress controller takes care of the actual routing. For more information about Ingress, check out the Kubernetes website.

Ingress is a sort of strange part of the Kubernetes API. The Ingress API itself is part of every Kubernetes cluster, but the Ingress controller that would do the routing is not part of core Kubernetes. So, while you may be able to create the Ingress resources, there may be nothing to actually route the traffic.

For the Ingress resources to actually do something, you need to make sure an Ingress controller is installed. There are many different Ingress controllers. The Kubernetes project itself has two controllers:

Many additional controllers are created and maintained by different communities and companies. A list of different controllers can be found on the Kubernetes website.

Most of the controllers rely on a load balancer or node port service, which will get the external traffic to the controller. Once the traffic reaches the controller, the controller will route it based on rules specified in the Ingress resource to the different services and pods. The controller itself usually also runs as yet another application inside the Kubernetes cluster.

Some of the controllers are tailored for a specific public cloud. For example, the AWS ALB Ingress Controller provisions the AWS Application Load Balancer to do the routing instead of doing it inside a pod in your Kubernetes cluster.

Ingress offers a lot of functionality for HTTP applications such as:

  • TLS termination
  • Redirecting from HTTP to HTTPS
  • Routing based on HTTP request path

Some of the controllers, such as the NGINX controller, also offer TLS passthrough, which is a feature we use in Strimzi.

Using Ingress in Strimzi

Ingress support in Strimzi has been added in Strimzi 0.12.0. It uses TLS passthrough and was tested with the NGINX Ingress Controller. Before using it, please make sure that the TLS passthrough is enabled in the controller. Note that Ingress support in Strimzi 0.12.0 is experimental. If you have any feedback or want to help make it work with different Ingress controllers, you can get in touch with us through Slack, our mailing list, or GitHub.

Although some Ingress controllers also support working directly with TCP connections, TLS passthrough seems to be more widely supported. Therefore, we decided to prefer TLS passthrough in Strimzi. That also means that when using Ingress, TLS encryption will always be enabled.

One of the main differences of Ingress, as compared with OpenShift routes, is that for Ingress you must specify the host address in your Kafka custom resource. The router in OpenShift will automatically assign a host address based on the route name and the project. In Ingress, however, the host address must be specified in the Ingress resource. You also have to take care that DNS resolves the host address to the Ingress controller. Strimzi cannot generate it for you, because it does not know which DNS addresses are configured for the Ingress controller.

If you want to try it, for example, on Minikube or in other environments where you don’t have any managed DNS service to add the hosts for the Kafka cluster, you can use a wildcard DNS service, such as nip.io or xip.io, and set it to point to the IP address of your Ingress controller. For example, you can do: broker-0.<minikube-ip-address>.nip.io.

The way Strimzi uses Ingress to expose Apache Kafka should be familiar to you from the previous articles. We create one service as a bootstrap service and additional services for individual access to each of the Kafka brokers in the cluster. For each of these services, we will also create one Ingress resource with the corresponding TLS passthrough rule.

When configuring Strimzi to use Ingress, you must specify the type of the external listener as ingress and specify the Ingress hosts used for the different brokers, as well as for bootstrap, in the configuration field:

apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    # ...
    listeners:
      # ...
      external:
        type: ingress
        configuration:
          bootstrap:
            host: bootstrap.192.168.64.46.nip.io
          brokers:
          - broker: 0
            host: broker-0.192.168.64.46.nip.io
          - broker: 1
            host: broker-1.192.168.64.46.nip.io
          - broker: 2
            host: broker-2.192.168.64.46.nip.io
    # ...

Using Ingress in your clients is very similar to OpenShift routes. Because it always uses TLS encryption, you have to first download the server certificate (replace my-cluster with the name of your cluster):

kubectl get secret cluster-name-cluster-ca-cert -o jsonpath='{.data.ca\.crt}'base64 -d > ca.crt
keytool -import -trustcacerts -alias root -file ca.crt -keystore truststore.jks -storepass password -noprompt

Once you have the TLS certificate, you can use the bootstrap host you specified in the Kafka custom resource and connect to the Kafka cluster. Because Ingress uses TLS passthrough, you always have to connect on port 443. The following example uses the kafka-console-producer.sh utility, which is part of Apache Kafka:

bin/kafka-console-producer.sh --broker-list :443 --producer-property security.protocol=SSL --producer-property ssl.truststore.password=password --producer-property ssl.truststore.location=./truststore.jks --topic 

For example:

bin/kafka-console-producer.sh --broker-list bootstrap.192.168.64.46.nip.io:443 --producer-property security.protocol=SSL --producer-property ssl.truststore.password=password --producer-property ssl.truststore.location=./truststore.jks --topic 

For more details, see the Strimzi documentation.

Customizations DNS annotations

Many users employ additional tools, such as ExternalDNS, to automatically manage DNS records for their load balancers. ExternalDNS uses annotations on Ingress resources to manage their DNS names. It also supports many different DNS services, such as Amazon AWS Route53, Google Cloud DNS, Azure DNS, etc.

Strimzi lets you assign these annotations through the Kafka custom resource using a field called dnsAnnotations. Using the DNS annotations is simple:

# ...
listeners:
  external:
    type: ingress
    configuration:
      bootstrap:
        host: kafka-bootstrap.mydomain.com
      brokers:
      - broker: 0
        host: kafka-broker-0.mydomain.com
      - broker: 1
        host: kafka-broker-1.mydomain.com
      - broker: 2
        host: kafka-broker-2.mydomain.com
    overrides:
      bootstrap:
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-bootstrap.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
      brokers:
      - broker: 0
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-0.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
      - broker: 1
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-1.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
      - broker: 2
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-2.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
# ...

Again, Strimzi lets you configure the annotations directly. That gives you more freedom and makes this feature usable even when you use a tool other than ExternalDNS. It also lets you configure other options than just the DNS names, such as the time-to-live of the DNS records, and so on.

Pros and cons

Kubernetes Ingress is not always easy to use because you have to install the Ingress controller, and you have to configure the hosts. It is also available only with TLS encryption because of the TLS passthrough functionality that Strimzi uses. However, it can offer an interesting option for clusters where node ports are not an option, for example, for security reasons and cases in which using load balancers would be too expensive.

When using Strimzi Kafka operator with Ingress, you must also consider performance. The Ingress controller usually runs inside your cluster as yet another deployment and adds an additional step through which your data has to flow between your clients and the brokers. You need to scale it properly to ensure it will not be a bottleneck for your clients.

Thus, Ingress might not be the best option when most of your applications using Kafka are outside of your Kubernetes cluster and you need to handle tens or hundreds of megabytes of throughput per second. However, especially in situations when most of your applications are inside your cluster and only a minority are outside and when the throughput you need is not so high, Ingress might be a convenient option.

The Ingress API and the Ingress controllers can usually be installed on OpenShift clusters as well, but they do not offer any advantages over the OpenShift routes. So, on OpenShift, you will likely want to use OpenShift routes instead.

What’s next?

This was, for now, the last post in this series about accessing Strimzi Kafka clusters. In the five articles, we covered all the supported mechanisms that the Strimzi operator supports for accessing Apach Kafka from both inside and outside of your Kubernetes or Red Hat OpenShift cluster. We will, of course, keep posting articles on other topics and if something about accessing Kafka changes in the future, we will add to this series.

If you liked this series, star us on GitHub and follow Strimzi on Twitter to stay up to date with the project.

Read more

The post Accessing Apache Kafka in Strimzi: Part 5 – Ingress appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Container-native development is primarily about consistency, flexibility, and scalability. Legacy Application Lifecycle Management (ALM) tooling often is not, leading to situations where it:

  • Places artificial barriers on development speed, and therefore time to value,
  • Creates single points of failure in the infrastructure, and
  • Stifles innovation through inflexibility.

Ultimately, developers are expensive, but they are the domain experts in what they build. With development teams often being treated as product teams (who own the entire lifecycle and support of their applications), it becomes imperative that they control the end-to-end process on which they rely to deliver their applications into production. This means decentralizing both the ALM process and the tooling that supports that process. In this article, we’ll explore this approach and look at a couple of implementation scenarios.

Figure 1: Move from centralized ALM to decentralized ALM.

Pipelines

Although this approach places more control in the hands of developers, it also means that they are directly responsible for what they ship. To cope with this, the automation of application delivery becomes even more critical than in a non-containerized world. It turns the domain of trust on its head—as an organization, you should now trust the process of container delivery, rather than the content of the container itself.

This is a more factory-oriented approach that allows organizations to scale their application delivery without incurring a significant governance overhead when applied to every single project running in the container platform. Red Hat has found through previous container-related projects that an efficient way to address this is via pipelines, although the same net result can be achieved with many popular build automation tools.

To this end, we often recommend the creation of a Development Community of Practice (name subject to local influences) that would own pipeline development within the container platform. The Development Community of Practice would consist of representatives of development teams working on the platform and seek to drive standards around technologies and approaches, while also serving as a forum for knowledge transfer and enablement.

The Development Community of Practice would create a library of pipeline steps (A Shared Library in Jenkins terms) that could be used to create either technology-specific (Java, .NET, Node.js, etc.) reference pipelines for users with limited interest in directly engaging with the platform or bespoke pipelines that would cater for specific use cases.

This Shared Library would be derived in the following manner:

  • Capture the critical steps on the application delivery pathway for a given technology stack. Undertake this activity with Platform, Development, Business, and Security stakeholders to agree on a mutual definition of “minimal good.”
  • Create Steps in the Shared Library to meet all of the requirements captured as part of the discovery assessment.
  • Actively audit those steps to prove compliance to interested parties. This could include things like automated test result capture, container CVE scanning, code coverage/quality assessments, and automated approvals.
  • Create reference pipelines that meet the definition of “minimal good” using this library of steps.
  • Open/Inner Source the Shared Library to the wider development community within the organization to allow stakeholders to extend, customize, and contribute repeatable steps and further reference pipelines that increase the capabilities of the Library within the environment.
  • Ensure both steps and reference pipelines are documented. Good documentation of the Shared Library is critical to driving adoption. A perfectly implemented, poorly documented solution is of no practical use to anyone. The steps are now part of the platform infrastructure and should be treated as such.

The use of pipelines in this manner allows platform providers to drive two distinct behaviors:

  1. Users who have no interest or requirement to interact directly with the container platform can utilize build automation directly. A pipeline allows them to take source code and deliver production-ready container images via the requisite governance step gates with near-zero interaction with the container platform.
  2. Users who have an understanding of pipelines and containerization technologies who want or need to add their own bespoke steps on top of the core steps are perfectly capable of doing so. They must ensure that these steps also meet the governance requirements set out by the Development Community of Practice and associated stakeholders.

These are not one-time actions. Management and maintenance of the pipeline lifecycle are just as critical as management and maintenance of the applications themselves.

  • The Development Community of Practice must continuously evaluate the requirements of the development community and improve and refine the pipelines and steps as required. The development community should also be permitted to fork, adapt, and push changes back to the Shared Library.
  • For Platform providers, the responsibility then becomes more about providing the technologies and capabilities that these pipelines rely on in a containerized manner. They must also ensure these capabilities are kept up-to-date, and the lifecycle of those containers is managed accordingly.
  • The Business must manage changing application requirements effectively and understand and accept the dependencies these changes may create in the automation solution.
  • Security teams must continuously assess new practices, requirements, standards, and technologies, and work with the Business, the Developers, and the Platform providers to implement these as required, in a sensible and controlled manner.

All of these aspects rely on constant communication and a continuous feedback cycle between all stakeholders of understanding the environment, implementing changes, and reviewing both the effects of the changes on the pipelines and the use of the Shared Library as a whole.

Figure 2: Shared Library lifecycle.

Ending up in a “Conway’s Law” situation is a total waste of time and effort for all concerned. However, committing to standards-based good practice around pipelines and container-native development provides developers with a path of least resistance between their source code and production and allows every stakeholder to recognize the benefits of containerization quickly.

Disconnected Environments

In a disconnected environment, it is advisable to follow the principles of decentralized ALM as much as is practical. However, compromises will always be made. A key compromise often centers around Dependency Management—how do you ensure that an application has all of its build and runtime dependencies available to it in a container platform with no direct connection to the public internet?

As with fully connected environments, it is good practice to use a dependency management solution (e.g., Sonatype Nexus, or JFrog Artifactory) to present dependencies to automated build processes in disconnected environments.

Once the content is in, developers can then stand up their own dependency management solutions for their projects, talking back to the centralized instance. This approach allows them to skirt the obvious pitfalls of a centralized single source of truth for dependencies in the container platform.

Typically, we talk about a “hard disconnect” (whereby there is no physical connection at all to the public internet) or a “soft disconnect” (whereby access to the public internet is heavily restricted to certain hosts or protocols). In either scenario, no direct curation of content should be required. Step gates built into the pipeline would ideally be configured to automatically scan application dependencies and fail in the case of vulnerabilities, errata, or license concerns being discovered.

Hard disconnect scenario

In this scenario, content is downloaded on a public internet connection and then uploaded to the chosen dependency management solution in the disconnected environment, where it can be resolved by the automation tooling within the environment. This process is likely to be manual and time-consuming.

Figure 3: Hard disconnect scenario.

Soft disconnect scenario

In this scenario, the dependency management solution is allowed to proxy through or is whitelisted to directly allow access to the public internet. This process permits a single controlled connection to the repositories containing the dependencies. This scenario is a lot more flexible, as no manual interaction is required to provide content to the environment.

Figure 4: Soft disconnect scenario.

Conclusion

Sorry, I lied. There is no conclusion. You are creating the building blocks on which your container-native developments will rely on for their entire lifecycle—and that lifecycle should be under constant review.

However, by decentralizing your automation dependencies and open sourcing the means by which you interact with those dependencies, you are giving the development communities the means to scale their activities to meet the ever-changing needs of all stakeholders concerned, without falling victim to the legacy of traditional approaches.

The post Application lifecycle management for container-native development appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In this fourth article of our series about accessing Apache Kafka clusters in Strimzi, we will look at exposing Kafka brokers using load balancers. (See links to previous articles at end.) This article will explain how to use load balancers in public cloud environments and how they can be used with Apache Kafka.

Note: Productized and supported versions of the Strimzi and Apache Kafka projects are available as part of the Red Hat AMQ product.

Load balancers

Load balancers automatically distribute incoming traffic across multiple targets. Different implementations do traffic distribution on different levels:

  • Layer 7 load balancers can distribute the traffic on the level of individual requests (e.g., HTTP requests).
  • Layer 4 load balancers distribute the TCP connections.

Load balancers are available in most public and private clouds. Examples of load balancers are Elastic Load Balancing services from Amazon AWS, Azure Load Balancer in Microsoft Azure public cloud, and Google Cloud Load Balancing service from Google. Load balancing services are also available in OpenStack. If you run your Kubernetes or Red Hat OpenShift cluster on bare metal, you might not have load balancers available on demand. In that case, using node ports, OpenShift routes, or Ingress might be a better option for you.

There’s no need to be intimidated by the long list of different load balancing services. Most of them are well integrated with Kubernetes. When the Kubernetes Service is configured with the type Loadbalancer, Kubernetes will automatically create the load balancer through the cloud provider, which understands the different services offered by a given cloud. Thanks to that, Kubernetes applications—including Strimzi—do not need to understand the differences and should work everywhere where the cloud infrastructure and Kubernetes are properly integrated.

Using load balancers in Strimzi

None of the common load balancing services supports the Kafka protocol, so Strimzi always uses the Layer 4 load balancing. Because Layer 4 works on the TCP level, the load balancer will always take the whole TCP connection and direct it to one of the targets. This approach has some advantages; you can, for example, decide whether TLS encryption should be enabled or disabled.

To give Kafka clients access to the individual brokers, Strimzi creates a separate service with type=Loadbalancer for each broker. As a result, each broker will get a separate load balancer. Note that despite the Kubernetes service being of a load balancer type, the load balancer is still a separate entity managed by the infrastructure/cloud. A Kafka cluster with N brokers will need N+1 load balancers.

At the beginning of this article, we defined a load balancer as something that distributes incoming traffic across multiple targets. However, as you can set from the diagram above, the per-broker load balancers have only one target and are technically not load balancing. That is true, but, in most cases, the actual implementation is a bit more complicated.

When Kubernetes creates the load balancer, they usually target it to all nodes of your Kubernetes cluster, not just to the nodes where your application is actually running. Thus, although the TCP connections will always end on the same node in the same broker, they might be routed through the other nodes of your cluster.

When the connection is sent by the load balancer to the node that does not host the Kafka broker, the kube-proxy component of Kubernetes will forward it to the right node where the broker runs. This can lead to delays because some connections might be routed through more hops than absolutely necessary.

The only exception is the bootstrap load balancer that is distributing the connections to all brokers in your Kafka cluster.

You can easily configure Strimzi Kafka operator to expose your Kafka cluster using load balancers by selecting the loadbalancer type in the external listener:

apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    # ...
    listeners:
      # ...
      external:
        type: loadbalancer
    # ...

Load balancers, in common with the node port external listener, have TLS enabled by default. If you don’t want to use TLS encryption, you can easily disable it:

    # ...
    listeners:
      external:
        type: loadbalancer
        tls: false
    # ...

After Strimzi creates the load balancer type Kubernetes services, the load balancers will be automatically created. Most clouds will automatically assign the load balancer some DNS name and IP addresses. These will be automatically propagated into the status section of the Kubernetes service. Strimzi will read it from there and use it to configure the advertised address in the Kafka brokers.

When available, Strimzi currently prefers the DNS name over the IP address. The reason is that IP addresses are often volatile, whereas the DNS name is fixed for the lifetime of the load balancer (this applies at least to Amazon AWS ELB load balancers). If the load balancer has only an IP address, Strimzi will, of course, use it.

As a user, you should always use the bootstrap load balancer address for the initial connection. You can get the address from the status section with following command (replace my-cluster with the name of your cluster):

kubectl get service my-cluster-kafka-external-bootstrap -o=jsonpath='{.status.loadBalancer.ingress[0].hostname}{"\n"}'

If no hostname is set, you can also try the IP address (replace my-cluster with the name of your cluster):

kubectl get service my-cluster-kafka-external-bootstrap -o=jsonpath='{.status.loadBalancer.ingress[0].ip}{"\n"}'

The DNS or IP address returned by one of these commands can be used in your clients as the bootstrap address. The load balancers use always port 9094 to expose Apache Kafka. The following example uses the kafka-console-producer.sh utility, which is part of Apache Kafka to connect the cluster:

bin/kafka-console-producer.sh --broker-list :9094 --topic 

For more details, see the Strimzi documentation.

Customizations Advertised hostnames and ports

In the section above, I explained how Strimzi always prefers to use the DNS name over the IP address when configuring the advertised listener address in Kafka brokers. Sometimes, this can be a problem, for example, when for whatever reason the DNS resolution doesn’t work for your Kafka clients. In that case, you can override the advertised hostnames in the Kafka custom resource.

# ...
listeners:
  external:
    type: loadbalancer
    overrides:
      brokers:
      - broker: 0
        advertisedHost: 216.58.201.78
      - broker: 1
        advertisedHost: 104.215.148.63
      - broker: 2
        advertisedHost: 40.112.72.205
# ...

I hope that in one of the future versions we will give users a more comfortable option to choose between the IP address and hostname. But, this feature might also be useful to handle different kinds of network configurations and translations. If needed, you can use it to override the node port numbers as well.

# ...
listeners:
  external:
    type: route
    authentication:
      type: tls
    overrides:
      brokers:
      - broker: 0
        advertisedHost: 216.58.201.78
        advertisedPort: 12340
      - broker: 1
        advertisedHost: 104.215.148.63
        advertisedPort: 12341
      - broker: 2
        advertisedHost: 40.112.72.205
        advertisedPort: 12342
# ...

Keep in mind that the advertisedPort option doesn’t really change the port used in the load balancer itself. It changes only the port number used in the advertised.listeners Kafka broker configuration parameter.

Internal load balancers

Many cloud providers differentiate between public and internal load balancers. The public load balancers will get a public IP address and DNS name, which will be accessible from the whole internet. On the other hand, the internal load balancers will only use private IP addresses and hostnames and will be available only from certain private networks (e.g., from other machines in the same Amazon AWS VPC).

You may want to share your Kafka cluster managed by Strimzi with applications running outside of your Kubernetes or OpenShift cluster but not necessarily with the whole world. In such cases, the internal load balancers might be handy.

Kubernetes will usually always try to create a public load balancer by default, and users can use special annotations to indicate that given Kubernetes service with load balancer type should have the load balancer created as internal.

For example:

  • For Google Cloud, use the annotation cloud.google.com/load-balancer-type: "Internal".
  • On Microsoft Azure, use service.beta.kubernetes.io/azure-load-balancer-internal: "true".
  • Amazon AWS uses service.beta.kubernetes.io/aws-load-balancer-internal: 0.0.0.0/0.
  • And, OpenStack uses service.beta.kubernetes.io/openstack-internal-load-balancer: "true".

As you can see, most of these are completely different. So, instead of integrating all of these into Strimzi, we decided to give you the option to specify the annotations for the services that Strimzi creates. Thanks to that, you can use these annotations with cloud providers we’ve never even heard of. The annotations can be specified in the templateproperty in Kafka.spec.kafka. The following example shows the OpenStack annotations:

apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    # ...
    template:
      externalBootstrapService:
        metadata:
          annotations:
            service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
      perPodService:
        metadata:
          annotations:
            service.beta.kubernetes.io/openstack-internal-load-balancer: "true"
    # ...

You can specify different annotations for the bootstrap and the per-broker services. After you specify these annotations, they will be passed by Strimzi to the Kubernetes services, and the load balancers will be created accordingly.

DNS annotations

This feature will be available from Strimzi 0.12.0.

Many users employ additional tools, such as ExternalDNS, to automatically manage DNS records for their load balancers. ExternalDNS uses annotations on load balancer type services (and Ingress resources—more about that next time) to manage their DNS names. It supports many different DNS services, such as Amazon AWS Route53, Google Cloud DNS, Azure DNS, etc.

Strimzi lets you assign these annotations through the Kafka custom resource using a field called dnsAnnotations. The main difference between the template annotations mentioned previously is that the dnsAnnotations let you configure the annotation per-broker, whereas the perPodService option of the template field will set the annotations on all services.

Using the DNS annotations is simple:

# ...
listeners:
  external:
    type: loadbalancer
    overrides:
      bootstrap:
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-bootstrap.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
      brokers:
      - broker: 0
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-0.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
      - broker: 1
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-1.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
      - broker: 2
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-2.mydomain.com.
          external-dns.alpha.kubernetes.io/ttl: "60"
# ...

Again, Strimzi lets you configure the annotations directly, which gives you more freedom and hopefully makes this feature usable even when you use a tool other than ExternalDNS. It also lets you configure other options besides the DNS names, such as the time-to-live of the DNS records, etc.

Note that the addresses used in the annotations will not be added to the TLS certificates or configured in the advertised listeners of the Kafka brokers. To do so, you need to combine them with the advertised name configuration described in one of the previous sections:

# ...
listeners:
  external:
    type: loadbalancer
    overrides:
      bootstrap:
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-bootstrap.mydomain.com.
          address: kafka-bootstrap.mydomain.com
      brokers:
      - broker: 0
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-0.mydomain.com.
          advertisedHost: kafka-broker-0.mydomain.com
      - broker: 1
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-1.mydomain.com.
          advertisedHost: kafka-broker-1.mydomain.com
      - broker: 2
        dnsAnnotations:
          external-dns.alpha.kubernetes.io/hostname: kafka-broker-2.mydomain.com.
          advertisedHost: kafka-broker-2.mydomain.com
# ...
Pros and cons

The integration of load balancers into Kubernetes and Red Hat OpenShift is convenient and makes them easy to use. Strimzi can use the power of Kubernetes to provision load balancers on many different public and private clouds. Thanks to the TCP routing, you can freely decide whether you want to use TLS encryption or not.

Load balancers stand between the applications and the nodes of the Kubernetes cluster. They minimize the attack surface and, for this reason, many admins would prefer load balancers over node ports.

Load balancers usually deliver very good performance. The typical load balancer is a service that runs outside of your cluster. Thus, you do not need to be afraid of the resources it needs, how much load will it put on your cluster, and so on. However, there are some considerations to keep in mind:

  • In most cases, Kubernetes will configure them to load balance across all cluster nodes. So, although there’s only one broker where the traffic will ultimately arrive, different connections might be routed to that broker through different cluster nodes, being forwarded through the kube-proxy to the right node where the Kafka broker actually runs. This would not happen, for example, with node ports, where the advertised address points directly to the node where the broker is running.
  • The load balancer itself is yet another service that the connection needs to go through, which might add a bit of latency.

Another aspect to consider is the price. In public clouds, load balancers are typically not free. Usually, you have to pay a fixed fee per instance, which depends only on how long the load balancer exists plus some fee for every transferred gigabyte. Strimzi always requires N+1 load balancers (where N is the number of brokers), one for each broker plus one for the bootstrapping. That means you will always need multiple load balancers, and the fees add up. Additionally, N of those load balancers don’t even balance any load, because there is only a single broker behind them.

Read more

The post Accessing Apache Kafka in Strimzi: Part 4 – Load Balancers appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

In the third part of this article series (see links to previous articles below), we will look at how Strimzi exposes Apache Kafka using Red Hat OpenShift routes. This article will explain how routes work and how they can be used with Apache Kafka. Routes are available only on OpenShift, but if you are a Kubernetes user, don’t be sad; a forthcoming article in this series will discuss using Kubernetes Ingress, which is similar to OpenShift routes.

Note: Productized and supported versions of the Strimzi and Apache Kafka projects are available as part of the Red Hat AMQ product.

Red Hat OpenShift routes

Routes are an OpenShift concept for exposing services to the outside of the Red Hat OpenShift platform. Routes handle both data routing as well as DNS resolution. DNS resolution is usually handled using wildcard DNS entries, which allows OpenShift to assign each route its own DNS name based on the wildcard entry. Users don’t have to do anything special to handle the DNS records. If you don’t own any domains where you can set up the wildcard entries, OpenShift can use services such as nip.io for the wildcard DNS routing. Data routing is done using the HAProxy load balancer, which serves as the router behind the domain names.

The main use case of the router is HTTP(S) routing. The routes are able to do path-based routing of HTTP and HTTPS (with TLS termination) traffic. In this mode, the HTTP requests will be routed to different services based on the request path. However, because the Apache Kafka protocol is not based on HTTP, the HTTP features are not very useful for Strimzi and Kafka brokers.

Luckily, the routes can be also used for TLS passthrough. In this mode, it uses TLS Server Name Indication (SNI) to determine the service to which the traffic should be routed and passes the TLS connection to the service (and eventually to the pod backing the service) without decoding it. This mode is what Strimzi uses to expose Kafka.

If you want to learn more about OpenShift routes, check the OpenShift documentation.

Exposing Kafka using OpenShift routes

Exposing Kafka using OpenShift routes is probably the easiest of all the available listener types. All you need to do is to configure it in the Kafka custom resource.

apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    # ...
    listeners:
      # ...
      external:
        type: route
    # ...

The Strimzi Kafka Operator and OpenShift will take care of the rest. To provide access to the individual brokers, we use the same tricks we used with node ports and which were described in the previous article. We create a dedicated service for each of the brokers, which will be used to address the individual brokers directly. Apart from that, we will also use one service for the bootstrapping of the clients. This service would round-robin between all available Kafka brokers.

Unlike when using node ports, these services will be only regular Kubernetes services of the clusterIP type. The Strimzi Kafka operator will also create a Route resource for each of these services, which will expose them using the HAProxy router. The DNS addresses assigned to these routes will be used by Strimzi to configure the advertised addresses in the different Kafka brokers.

Kafka clients will connect to the bootstrap route, which will route them through the bootstrap service to one of the brokers. From this broker, clients will get the metadata that will contain the DNS names of the per-broker routes. The Kafka clients will use these addresses to connect to the routes dedicated to the specific broker, and the router will again route it through the corresponding service to the right pod.

As explained in the previous section, the routers main use case is routing of HTTP(S) traffic. Therefore, it is always listening on ports 80 and 443. Because Strimzi is using the TLS passthrough functionality, the following will be true:

  • The port will always be 443 as the port used for HTTPS.
  • The traffic will always use TLS encryption.

Getting the address to connect to with your client is easy. As mentioned previously, the port will always be 443. This can cause problems when users try to connect to port 9094 instead of 443. But, 443 is always the correct port number with OpenShift routes. You can find the host in the status of the Route resource (replace my-clusterwith the name of your cluster):

oc get routes my-cluster-kafka-bootstrap -o=jsonpath='{.status.ingress[0].host}{"\n"}'

By default, the DNS name of the route will be based on the name of the service it points to and on the name of the OpenShift project. For example, for my Kafka cluster named my-cluster running in project named myproject, the default DNS name will be my-cluster-kafka-bootstrap-myproject.<router-domain>.

Because the traffic will always use TLS, you must always configure TLS in your Kafka clients. This includes getting the TLS certificate from the broker and configuring it in the client. You can use following commands to get the CA certificate used by the Kafka brokers and import it into Java keystore file, which can be used with Java applications (replace my-cluster with the name of your cluster):

oc extract secret/my-cluster-cluster-ca-cert --keys=ca.crt --to=- > ca.crt
keytool -import -trustcacerts -alias root -file ca.crt -keystore truststore.jks -storepass password -noprompt

With the certificate and address, you can connect to the Kafka cluster. The following example uses the kafka-console-producer.sh utility, which is part of Apache Kafka:

bin/kafka-console-producer.sh --broker-list :443 --producer-property security.protocol=SSL --producer-property ssl.truststore.password=password --producer-property ssl.truststore.location=./truststore.jks --topic 

For more details, see the Strimzi documentation.

Customizations

As explained in the previous section, by default, the routes get automatically assigned DNS names based on the name of your cluster and namespace. However, you can customize this and specify your own DNS names:

# ...
listeners:
  external:
    type: route
    authentication:
      type: tls
    overrides:
      bootstrap:
        host: bootstrap.myrouter.com
      brokers:
      - broker: 0
        host: broker-0.myrouter.com
      - broker: 1
        host: broker-1.myrouter.com
      - broker: 2
        host: broker-2.myrouter.com
# ...

The customized names still need to match the DNS configuration of the OpenShift router, but you can give them a friendlier name. The custom DNS names (as well as the names automatically assigned to the routes) will, of course, be added to the TLS certificates and your Kafka clients can use TLS hostname verification.

Pros and cons

Routes are only available on Red Hat OpenShift. So, if you are using Kubernetes, this is clearly a deal-breaking disadvantage. Another potential disadvantage is that routes always use TLS encryption. You will always have to deal with the TLS certificates and encryption in your Kafka clients and applications.

You will also need to carefully consider performance. The OpenShift HAProxy router will act as a middleman between your Kafka clients and brokers. This approach can add latency and can also become a performance bottleneck. Applications using Kafka often generate a lot of traffic—hundreds or even thousands of megabytes per second. Keep this in mind and make sure that the Kafka traffic will still leave some capacity for other applications using the router. Luckily, the OpenShift router is scalable and highly configurable so you can fine-tune its performance and, if needed, even set up a separate instance of the router for the Kafka routes.

The main advantage of using Red Hat OpenShift routes is that they are so easy to get working. Unlike the node ports discussed in the previous article, which are often tricky to configure and require a deeper knowledge of Kubernetes and the infrastructure, OpenShift routes work very reliably out of the box on any OpenShift installation.

Read more

The post Accessing Apache Kafka in Strimzi: Part 3 – Red Hat OpenShift routes appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

This article series explains how Apache Kafka and its clients work and how Strimzi makes it accessible for clients running outside of Kubernetes. In the first article, we provided an introduction to the topic, and here we will look at exposing an Apache Kafka cluster managed by Strimzi using node ports.

Specifically, in this article, we’ll look at how node ports work and how they can be used with Kafka. We also will cover the different configuration options available to users and the pros and cons of using node ports.

Note: Productized and supported versions of the Strimzi and Apache Kafka projects are available as part of the Red Hat AMQ product.

Node ports

A NodePort is a special type of Kubernetes service. When such a service is created, Kubernetes will allocate a port on all nodes of the Kubernetes cluster and make sure that all traffic to this port is routed to the service and eventually to the pods behind this service.

The routing of the traffic is done by the kube-proxy Kubernetes component. It doesn’t matter which node your pod is running on. The node ports will be open on all nodes, and the traffic will always reach your pod. So, your clients need to connect to the node port on any of the nodes of the Kubernetes cluster and let Kubernetes handle the rest.

The node port is selected from the port range 30000-32767 by default, but this range can be changed in Kubernetes configuration (see Kubernetes docs for more details about configuring the node port range).

How do we use NodePort services in Strimzi to expose Apache Kafka?

Exposing Kafka using node ports

As a user, you can easily expose Kafka using node ports. All you need to do is to configure it in the Kafka custom resource.

apiVersion: kafka.strimzi.io/v1beta1
kind: Kafka
metadata:
  name: my-cluster
spec:
  kafka:
    # ...
    listeners:
      # ...
      external:
        type: nodeport
        tls: false
    # ...

What happens after you configure it, however, is a bit more complicated.

The first thing we need to address is how the clients will access individual brokers. As explained in the previous article, having a service that will round-robin across all brokers in the cluster will not work with Kafka. The clients need to be able to reach each of the brokers directly.

Inside the Kubernetes cluster, we addressed this by using the pod DNS names as the advertised addresses. The pod hostnames or IP addresses are not recognized outside of Kubernetes, however, so we cannot use them. How does Strimzi solve this?

Instead, of using the pod hostnames or IP addresses, we create additional services—one for each Kafka broker. So, in a Kafka cluster with N brokers, we will have N+1 node port services:

  • One can be used by the Kafka clients as the bootstrap service for the initial connection and for receiving the metadata about the Kafka cluster.
  • Another N services—one for each broker—can address the brokers directly.

All of these services are created with the type NodePort. Each of these services will be assigned different node port so that the traffic for the different brokers can be distinguished.

Since Kubernetes 1.9, every pod in a stateful set is automatically labeled with statefulset.kubernetes.io/pod-name, which contains the name of the pod. Using this label in the pod selector inside the Kubernetes service definition allows us to target only the individual Kafka brokers and not the whole Kafka cluster.

The following YAML snippet shows how the service created by Strimzi targets just one pod from the stateful set by using the statefulset.kubernetes.io/pod-name label in the selector:

apiVersion: v1
kind: Service
metadata:
  name: my-cluster-kafka-0
  # ...
spec:
  # ...
  selector:
    statefulset.kubernetes.io/pod-name: my-cluster-kafka-0
    strimzi.io/cluster: my-cluster
    strimzi.io/kind: Kafka
    strimzi.io/name: my-cluster-kafka
  type: NodePort
  # ...

The node port services are just the infrastructure that can route the traffic to the brokers. We still need to configure the Kafka brokers to advertise the right address, so that the clients use this infrastructure. With node ports, the client connecting to the broker needs to connect to:

  • the address of one of the Kubernetes nodes
  • the node port assigned to the service

Strimzi needs to gather these and configure these as the advertised addresses in the broker configuration. Strimzi uses separate listeners for external and internal access. That means any applications running inside the Kubernetes or Red Hat OpenShift cluster will still use the old services and DNS names as described in the first article.

Although node port services can route the traffic to the broker from all Kubernetes nodes, we can use only a single address, which will be advertised to the clients. Using the address of the actual node where the broker is running will mean less forwarding, but every time the broker restarts, the node might change. Therefore, Strimzi uses an init container, which is run every time the Kafka broker pod starts. It collects the address of the node and uses it to configure the advertised address.

To get the node address, the init container must talk with the Kubernetes API and get the node resource. In the status of the node resource, the address is normally listed as one of the following:

  • External DNS
  • External IP
  • Internal DNS
  • Internal IP
  • Hostname

Sometimes, only some of these are listed in the status. The init container will try to get one of them in the same order as they are listed above and will use the first one it finds.

Once the address is configured, the client can use the bootstrap node port service to make the initial connection. From there, the client will get the metadata containing the addresses of the individual brokers and start sending and receiving messages.

After Strimzi configures everything inside the Kubernetes and Kafka clusters, you need to do just two things:

  • Get the node port number of the external bootstrap service (replace my-cluster with the name of your cluster):
kubectl get service my-cluster-kafka-external-bootstrap -o=jsonpath='{.spec.ports[0].nodePort}{"\n"}'
  • Get the address of one of the nodes in your Kubernetes cluster (replace node-name with the name of one of your nodes – use kubectl get nodes to list all nodes):
kubectl get node node-name -o=jsonpath='{range .status.addresses[*]}{.type}{"\t"}{.address}{"\n"}'

The node address and node port number give you all that is needed to connect to your cluster. The following example uses the kafka-console-producer.sh utility, which is part of Apache Kafka:

bin/kafka-console-producer.sh --broker-list : --topic 

For more details, see the Strimzi documentation.

Troubleshooting node ports

Node ports are fairly complex to configure. There are many things which can go wrong.

A common problem is that the address presented to Strimzi by the Kubernetes API in the node resource is not accessible from outside. This can happen, for example, because the DNS name or IP address used is only internal and cannot be reached by the client. This problem can happen with production-grade clusters as well as local development tools, such as Minikube or Minishift. In such cases, you might get the following errors from your client:

[2019-04-22 21:04:11,976] WARN [Consumer clientId=consumer-1, groupId=console-consumer-42133] Connection to node 1 (/10.0.2.15:31301) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

or

[2019-04-22 21:11:37,295] WARN [Producer clientId=console-producer] Connection to node -1 (/10.0.2.15:31488) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

When you see one of these errors, you can compare the following addresses:

  • The address under which you expect your nodes to be reachable by your Kafka clients (with Minikube, that should be the IP address returned by the minikube ip command).
  • The address of the nodes where the Kafka pods are running
kubectl get node <node-name> -o=jsonpath='{range .status.addresses[*]}{.type}{"\t"}{.address}{"\n"}'
  • The address advertised by the Kafka broker
kubectl exec my-cluster-kafka-0 -c kafka -it -- cat /tmp/strimzi.propertiesgrep advertised

If these addresses are different, they are probably causing the problem, but Strimzi can help. You can use the override options to change the addresses advertised by the Kafka pods. Instead of reading the address from the Kubernetes API from the node resources, you can configure them in the Kafka custom resource. For example:

# ...
listeners:
  external:
    type: nodeport
    tls: false
    overrides:
      brokers:
      - broker: 0
        advertisedHost: XXX.XXX.XXX.XXX
      - broker: 1
        advertisedHost: XXX.XXX.XXX.XXX
      - broker: 2
        advertisedHost: XXX.XXX.XXX.XXX
# ...

The overrides can specify either DNS names or IP addresses. This solution is not ideal, because you need to maintain the addresses in the custom resource and remember to update them every time you are, for example, scaling up your Kafka cluster. In many cases, however, you might not be able to change the addresses reported by the Kubernetes APIs. So, this approach at least gives you a way to get it working.

Another thing which might make using node ports complicated is the presence of a firewall. If your client cannot connect, you should check whether Kubernetes node and the port are reachable using simple tools such as telnet or ping. In public clouds, such as Amazon AWS, you will also need to enable access to the nodes/node ports in the security groups.

TLS support

Strimzi supports TLS when exposing Kafka using node ports. For historical reasons, TLS encryption is enabled by default, but you can disable it if you want.

# ...
listeners:
  external:
    type: nodeport
    tls: false
# ...

When exposing Kafka using node ports with TLS, Strimzi currently doesn’t support TLS hostname verification. The main reason is that, with node ports, it is hard to pin down the addresses that will be used and add them to the TLS certificates. This is mainly because:

  • The node where the broker runs may change every time the pod or node is restarted.
  • The nodes in the cluster might sometimes change frequently, and we would need to refresh the TLS certificates every time nodes are added or removed and addresses change.
Customizations

Strimzi aims to make node ports work out of the box, but there are several options you can use to customize the Kafka cluster and its node port services.

Preconfigured node port numbers

By default, the node port numbers are generated/assigned by the Kubernetes controllers. That means that every time you delete your Kafka cluster and deploy a new one, a new set of node ports will be assigned to the Kubernetes services created by Strimzi. So, after every redeployment, you have to reconfigure all your applications using the node ports with the new node port of the bootstrap service.

Strimzi allows you to customize the node ports in the Kafka custom resource:

# ...
listeners:
  external:
    type: nodeport
    tls: true
    authentication:
      type: tls
    overrides:
      bootstrap:
        nodePort: 32100
      brokers:
      - broker: 0
        nodePort: 32000
      - broker: 1
        nodePort: 32001
      - broker: 2
        nodePort: 32002
# ...

The example above requests node port 32100 for the bootstrap service and ports 32000, 32001, and 32002 for the per-broker services. This allows you to redeploy your cluster without changing the node ports numbers in all your applications.

Note that Strimzi doesn’t do any validation of the requested port numbers, so you have to make sure that they are:

  • within the range assigned for node ports in the configuration of your Kubernetes cluster, and
  • not used by any other service.

You do not have to configure all the node ports. You can decide to configure only some of them, for example, only the one for the external bootstrap service.

Configuring advertised hosts and ports

Strimzi also allows you to customize the advertised hostname and port that will be used in the configuration of the Kafka pods:

# ...
listeners:
  external:
    type: nodeport
    authentication:
      type: tls
    overrides:
      brokers:
      - broker: 0
        advertisedHost: example.hostname.0
        advertisedPort: 12340
      - broker: 1
        advertisedHost: example.hostname.1
        advertisedPort: 12341
      - broker: 2
        advertisedHost: example.hostname.2
        advertisedPort: 12342
# ...

The advertisedHost field can contain either DNS name or an IP address. You can, of course, also decide to customize just one of these.

Changing the advertised port will only change the advertised port in the Kafka broker configuration. It will have no impact on the node port that is assigned by Kubernetes. To configure the node port numbers used by the Kubernetes services, use the nodePort option described above.

Overriding the advertised hosts is something we used in the troubleshooting section above when the node address provided by the Kubernetes API was not the correct one. It can be useful in other situations as well, such as when your network does some network address translation:

Another example might be when you don’t want the clients to connect directly to the nodes where the Kafka pods are running. You can have only selected Kubernetes nodes exposed to the clients and use the advertisedHost option to configure the Kafka brokers to use these nodes.

Pros and cons

Exposing your Kafka cluster to the outside using node ports can give you a lot of flexibility. It can also deliver very good performance. Compared to other solutions, such as load balancers, routes, or ingress, there is no middleman to be a bottleneck or add latency. Your client connections will go to your Kafka broker in the most direct way possible; however, there is also a price to pay for this. Node ports are a very low-level solution. Often, you will run into problems with the detection of the advertised addresses as described in the sections above. Additionally, node ports may expect you to expose your Kubernetes nodes to the clients, which is often seen as security risk by the administrators.

Read more

The post Accessing Apache Kafka in Strimzi: Part 2 – Node ports appeared first on Red Hat Developer Blog.

Read Full Article
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Strimzi is an open source project that provides container images and operators for running Apache Kafka on Kubernetes and Red Hat OpenShift. Scalability is one of the flagship features of Apache Kafka. It is achieved by partitioning the data and distributing them across multiple brokers. Such data sharding has also a big impact on how Kafka clients connect to the brokers. This is especially visible when Kafka is running within a platform like Kubernetes but is accessed from outside of that platform.

This article series will explain how Kafka and its clients work and how Strimzi makes it accessible for clients running outside of Kubernetes.

Note: Productized and supported versions of the Strimzi and Apache Kafka projects are available as part of the Red Hat AMQ product.

It would, of course, be insufficient just to shard the data into partitions. The ingress and egress data traffic also needs to be properly handled. The clients writing to or reading from a given partition have to connect directly to the leader broker which is hosting it. Thanks to the clients connecting directly to the individual brokers, the brokers don’t need to do any forwarding of data between the clients and other brokers. That helps to significantly reduce the amount of work the brokers have to do and the amount of traffic flowing around within the cluster.

The only data traffic between the different brokers is due to replication, when the follower brokers are fetching data from the lead broker for a given partition. That makes the data shards independent of each other, which also makes Kafka scale so well.

How do the clients know where to connect?

Kafka’s discovery protocol

Kafka has its own discovery protocol. When a Kafka client is connecting to the Kafka cluster, it first connects to any broker that is a member of the cluster and asks it for metadata for one or more topics. The metadata contains the information about the topics, its partitions, and the brokers that host these partitions. All brokers should have the data for the whole cluster because they are all synced through Zookeeper. Therefore, it doesn’t matter to which broker the client is connected as first—all of them will give it the same response.

Once the client gets the metadata, it will use that data to figure out where to connect when it wants to write to or read from a given partition. The broker addresses used in the metadata will be either created by the broker itself based on the hostname of the machine where the broker runs, or it can be configured by the user using the advertised.listeners option.

The client will use the address from the metadata to open one or more new connections to the addresses of the brokers that host the particular partitions it is interested in. Even when the metadata points to the same broker where the client already connected and received metadata from, it would still open a second connection. And, these connections will be used to produce or consume data.

Note that this description of the Kafka protocol is intentionally simplified for the purposes of this article.

What does this mean for Kafka on Kubernetes?

So, what does all this mean for running Kafka on Kubernetes? If you are familiar with Kubernetes, you probably know that the most common way to expose some application is using a Kubernetes Service. Kubernetes services work as layer 4 load balancers. They provide a stable DNS address, where the clients can connect, and they forward the connections to one of the pods that is backing the service.

This approach works reasonably well with most stateless applications, which just want to connect randomly to one of the back ends behind the service. But, it can get a lot trickier if your application requires some kind of stickiness because of some state associated with a particular pod. This can be session stickiness, for example, where a client needs to connect to the same pod as last time because of some session information that the pod already has. Or it can be a data stickiness, where a client needs to connect to a particular pod because it contains some particular data.

This is also the case with Kafka. A Kubernetes service can be used for the initial connection only—it will take the client to one of the brokers within the cluster where it can get the metadata. However, the subsequent connections cannot be done through that service because it would route the connection randomly to one of the brokers in the cluster instead of leading it to one particular broker.

How does Strimzi deal with this? There are two general ways to solve this problem:

  • Write your own proxy/load balancer, which would do more intelligent routing on the application layer (layer 7). Such a proxy could, for example, abstract the architecture of the Kafka cluster from the client and pretend that the cluster has just one big broker running everything and just route the traffic to the different brokers in the background. Kubernetes already does this for the HTTP traffic using the Ingress resource.
  • Make sure you use the advertised.listeners option in the broker configuration in a way that allows the clients to connect directly to the broker.

In Strimzi, we currently support the second option.

Connecting from inside the same Kubernetes cluster

Doing this for clients running inside the same Kubernetes cluster as the Kafka cluster is quite simple. Each pod has its own IP address, which other applications can use to connect directly to it. This is normally not used by regular Kubernetes applications. One reason for this is that Kubernetes doesn’t offer a nice way to discover these IP addresses. To find out the IP address, you need to use the Kubernetes API, then find the right pod and its IP address. And you would need to have the appropriate rights to do this. Instead, Kubernetes uses the services with their stable DNS names as the main discovery mechanism.

With Kafka, this is not an issue, because it has its own discovery protocol. We do not need the clients to figure out the API address from the Kubernetes API. We just need to configure it and the advertised address, and then the clients will discover it through the Kafka metadata.

There is an even better option, which is used by Strimzi. For StatefulSets (which Strimzi is using to run the Kafka broker), you can use the Kubernetes headless service to give each of the pods a stable DNS name. Strimzi uses these DNS names as the advertised addresses for the Kafka brokers. So, with Strimzi:

  • The initial connection is done using a regular Kubernetes service to get the metadata.
  • The subsequent connections are opened using the DNS names given to the pods by another headless Kubernetes service. The diagram below shows how it looks with an example Kafka cluster named my-cluster.

Both approaches have pros and cons. Using the DNS can sometimes cause problems with cached DNS information. When the underlying IP addresses of the pods change (e.g., during rolling updates), the clients connecting to the brokers need to have the latest DNS information. However, we found that using IP addresses causes even worse problems, because sometimes Kubernetes re-uses them very aggressively, and a new pod gets an IP address that was used just a few seconds before by some other Kafka node.

Connecting from the outside

Although the access for clients running inside the same Kubernetes cluster is relatively simple, it will get a bit harder from the outside. There are some tools for joining the Kubernetes network with the regular network outside of Kubernetes, but most Kubernetes clusters run on their own network, which is separated from the world outside. That means things like pod IP addresses or DNS names are not resolvable for any clients running outside the cluster. Thanks to that, it is clear that we need to use a separate Kafka listener for access from inside and outside of the cluster, because the advertised addresses will need to be different.

Kubernetes and Red Hat OpenShift have many different ways of exposing applications, such as node ports, load balancers, or routes. Strimzi supports all of these to let users find the way that best suits their use case. We will look at them in more detail in the subsequent articles in this series.

The post Accessing Apache Kafka in Strimzi: Part 1 – Introduction appeared first on Red Hat Developer Blog.

Read Full Article

Read for later

Articles marked as Favorite are saved for later viewing.
close
  • Show original
  • .
  • Share
  • .
  • Favorite
  • .
  • Email
  • .
  • Add Tags 

Separate tags by commas
To access this feature, please upgrade your account.
Start your free month
Free Preview