John Page

SOLR indexes tend to be larger than the documents they index.

Examining the relative size of a data store and the size of the SOLR index of that data, one finds the size of the index is usually larger than the data indexed. This may seem counter-intuitive at first, but it actually makes perfect sense.

In order to understand why, it’s helpful to create a simplified version of a document store and an index. Consider the following collection of 7 documents. Each row consists of a reference number and a document. This data store is 91 bytes in size.

[0,"ABC DEF"]
[1,"AB DEF"]
[2,"AC DEF"]
[3,"BC DEF"]
[4,"ABC DE"]
[5,"ABC EF"]
[6,"ABC DF"]

The index is intended to provide a quick lookup by letter combination. Instead of having to scan all the documents to identify the documents with a given text fragment, we simply lookup the query and receive a list of documents containing the character sequence.

"A"=[0,1,2,4,5,6]
"AB"=[0,1,4,5,6]
"ABC"=[0,4,5,6]
"AC"=[2]
"B"=[0,1,3,4,5,6]
"BC"=[0,3,4,5,6]
"C"=[0,2,3,4,5,6]
"D"=[0,1,2,3,4,6]
"DE"=[0,1,2,3,4]
"DEF"=[0,1,2,3]
"DF"=[6]
"E"=[0,1,2,3,4,5]
"EF"=[0,1,2,3,5]
"F"=[0,1,2,3,5,6]

The index is 225 bytes. That’s more than twice the size of our document store.

Weirdness when every function returns a Column: Chained when (Spark)

When when is chained, the chain breaks at the point that the test returns true.

import org.apache.spark.sql.Column     
val isTrue = lit(true)

def getWithChainedWhen():Column = {
  when(isTrue,"1st").when(isTrue,"2nd").when(isTrue,"3rd")     
}

val df =
  sc.parallelize(List[(String)](("A")))
    .toDF("a")
    .withColumn( "chained",getWithChainedWhen() )
    .show(false)

+---+-------+
|a  |chained|
+---+-------+
|A  |1st    |
+---+-------+

Moral: Use when with otherwise, not when with when.

Stream IIS logs to Kafka

Introducing KafkaTailer

Kafka is a game-changer. As a powerful, centralized messaging tool, it performs extraordinarily well compared to other messaging applications. Popular in the JVM-Nux-Nix realms, it is now possible to add your favorite Microsoft IIS application to your streaming pipeline. Using the best open-source libraries available, KafkaTailer can stream your IIS logs to any Kafka topic.

The flexibility of this tool comes from the simplicity of its approach: It simply tails standard log files. Combining the Apache IO Tailer, the latest Kafka Producer, and the Apache Commons Daemon, KafkaTailer watches your IIS log directory and sends the log messages up to a Kafka server. Within minutes, it’s possible to start sending your IIS logs out to Kafka.

Quick Start

Open up the administration you IIS instance.
Configure the logs that interest you to log to a dedicated directory.
If you don’t have a Java Virtual Machine on your Windows machine, you will need to install it.
Setup a Kafka instance to publish your logs to, if you don’t have one already.
Go to https://github.com/johnmpage/KafkaTailer. Read the summary.
Download the latest release of KafkaTailer (currently v2.1).

Configure the Kafka Producer with a kafka-producer.properties file. The minimal set of values would be as follows:

bootstrap.servers=127.0.0.1:9092
value.serializer=org.apache.kafka.common.serialization.StringSerializer
key.serializer=org.apache.kafka.common.serialization.StringSerializer

Open up a command prompt and type the following command, substituting values that reflect your environment as needed:

java -classpath kafka-tailer-2.1-jar-with-dependencies.jar net.johnpage.kafka.KafkaTailer directoryPath=C:\\iis-logs\\W3SVC1\\ producerPropertiesPath=C:\\iis-logs\\kafka-producer.properties kafkaTopic=a-topic

Open you browser and navigate to your IIS website.
KafkaTailer reports its operations in the command prompt. Confirm that it has started up successfully and no exceptions are being thrown.
Monitor your Kafka topic and review the logs are being added to the topic.

Once the basic setup is working, you will probably want to configure your Kafka Producer to use SSL, refine which fields the IIS logs make use of, and run KafkaTailer as a Windows Service.

Setting up a Windows Service

Included in the KafkaTailer project is the skeleton of a Microsoft Windows Service. If you’ve run Tomcat on Windows, the GUI will be familiar to you. Apache Tomcat uses the same Daemon project as KafkaTailer does.

The Windows service is included in the winsrvc directory. The install.bat script and the kafka-producer.properties file will require customization to reflect your environment.

Note

Please report any bugs or issues!

Why We Tag

Alternate Title: The Lynch Pin of Safe Patch Releases

Some would argue Patch Releases to Production are inherently risky. In fact with the right approach the risk involved in a Patch Release can be small. The key to managing this risk is Continuous Integration and a disciplined release process.

When a bug is discovered in code that was released 2 or 3 weeks ago, developers may have already begun work on the next big release. Ambitious new features that touch sensitive business logic may be partially implemented. Developers sometimes struggle to navigate this moment when there seems to be no firm ground to stand on. If they release the code in the state it is in, they have untested changes and stand a good chance of introducing new bugs into the Production system.

What is required is a “snapshot” of the code exactly as it appeared the last time it was thoroughly tested and reviewed… that is… the last time it was released. If the bug fix can be applied to well-tested code, the risks of introducing new bugs are greatly reduced.

In the diagram, we can see how tagged code can make patch releases fast and low-risk. In the first section, the normal loop of development occurs. The source code is in a state of flux. Developers are making changes and releasing to the shared server, where it is visible to business stakeholders. Product owners, testers, and stakeholders are reviewing the work, providing their thoughts, and identifying issues. When the new features and improvements are completed to the developers satisfaction, a release is prepared. The release is tagged. Tagging a release is a process that takes minutes, but is crucial to the managing risk. It marks the code in time. The release that goes into production is built from this snapshot.

This tested, reviewed, and approved code is released. Once again, new features begin to take shape. The code base is in flux. Two weeks into this new effort, a bug report comes in. The schema changes have had a an expected effect on another feature. An important client is unhappy. Business raises the issue to highest level. This bug must be patched.

The developers can proceed calmly and with confidence. In seconds they can pull down a snapshot of the code exactly as it looked when it was released. The bug was missed but the fix is simple. Testing is completed in a short time. Only the places that might be influenced by the one new line of code. The team knows they are working with a body of source code which was tested thoroughly during the last major release.

The patch can go out quickly. The developers are relaxed. Business is ecstatic.