Recent Posts
Archives

Posts Tagged ‘APM’

PostHeaderIcon A Post-Mortem on a Docker Compatibility Break

Have you ever had a perfectly working Docker Compose stack that mysteriously stopped working after a routine software update? It’s a frustrating experience that can consume hours of debugging. This post is a chronicle of just such a problem, involving a local Elastic Stack, Docker’s recent versions, and a simple, yet critical, configuration oversight.

The stack in question was a straightforward setup for local development, enabling a quick start for Elasticsearch, Kibana, and the APM Server. The key to its simplicity was the environment variable xpack.security.enabled=false, which effectively disabled security for a seamless, local-only experience.

The configuration looked like this:

version: "3.9"

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.16.1
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
      - "9600:9600"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    restart: always

  kibana:
    image: docker.elastic.co/kibana/kibana:8.16.1
    container_name: kibana
    depends_on:
      - elasticsearch
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - xpack.apm.enabled=true
    ports:
      - "5601:5601"
    restart: always

  apm-server:
    image: docker.elastic.co/apm/apm-server:8.16.1
    container_name: apm-server
    depends_on:
      - elasticsearch
    environment:
      - APM_SERVER_LICENSE=trial
      - X_PACK_SECURITY_USER=elastic
      - X_PACK_SECURITY_PASSWORD=changeme
    ports:
      - "8200:8200"
    restart: always

This setup worked flawlessly for months. But after a hiatus and a few Docker updates, the stack refused to start. Countless hours were spent trying different versions, troubleshooting network issues, and even experimenting with new configurations like Fleet and health checks—all without success. The solution, it turned out, was to roll back to a four-year-old version of Docker (20.10.x), which immediately got the stack running again.

The question was: what had changed?

The Root Cause: A Subtle Security Misalignment

The culprit wasn’t a major Docker bug but a subtle incompatibility in the configuration that was handled differently by newer Docker versions. The issue lies with the apm-server configuration.

Even though security was explicitly disabled in the elasticsearch service with xpack.security.enabled=false, the apm-server was still configured to use authentication with X_PACK_SECURITY_USER=elastic and X_PACK_SECURITY_PASSWORD=changeme.

In older Docker versions, the APM server’s attempt to authenticate against an unsecured Elasticsearch instance might have failed silently or been handled gracefully, allowing the stack to proceed. However, recent versions of Docker and the Elastic stack are more stringent and robust in their security protocols. The APM server’s inability to authenticate against the non-secured Elasticsearch instance led to a fatal startup error, halting the entire stack.

The Solution: A Simple YAML Fix

The solution is to simply align the security settings across all services. Since Elasticsearch is running without security, the APM server should also be configured to connect without authentication.

By removing the authentication environment variables from the apm-server service, the stack starts correctly on the latest Docker versions.

Here is the corrected docker-compose.yml:

version: "3.9"

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.16.1
    container_name: elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    ports:
      - "9200:9200"
      - "9600:9600"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    restart: always

  kibana:
    image: docker.elastic.co/kibana/kibana:8.16.1
    container_name: kibana
    depends_on:
      - elasticsearch
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - xpack.apm.enabled=true
    ports:
      - "5601:5601"
    restart: always

  apm-server:
    image: docker.elastic.co/apm/apm-server:8.16.1
    container_name: apm-server
    depends_on:
      - elasticsearch
    # The fix is here: remove security environment variables
    environment:
      - APM_SERVER_LICENSE=trial
    ports:
      - "8200:8200"
    restart: always

This experience highlights an important lesson in development: what works today may not work tomorrow due to underlying changes in a platform’s behavior. While a quick downgrade can get you back on track, a deeper investigation into the root cause often leads to a more robust and forward-compatible solution.

PostHeaderIcon How to Bypass Elasticsearch’s 10,000-Result Limit with the Scroll API

If you’ve ever worked with the Elasticsearch API, you’ve likely run into its infamous 10,000-result limit. It’s a default cap that can feel like a brick wall when you’re dealing with large datasets—think log analysis, report generation, or bulk data exports. Fortunately, there’s a slick workaround: the Scroll API. In this post, I’ll walk you through why this limit exists, how the Scroll API solves it, and share practical examples to get you started.

Why the 10,000-Result Limit Exists

Elasticsearch caps standard search results at 10,000 to protect performance. Fetching millions of records in one shot with from and size parameters can strain memory and slow things down. But what if you need all that data? That’s where the Scroll API shines—it’s designed for deep pagination, letting you retrieve everything in manageable chunks.

What Is the Scroll API?

Unlike a typical search, the Scroll API maintains a temporary “scroll context” on the server. You grab a batch of results, get a scroll_id, and use it to fetch the next batch—no need to rerun your query. It’s efficient, scalable, and perfect for big data tasks.

How to Use the Scroll API: Step by Step

Let’s break it down with examples you can try yourself.

Step 1: Start the Scroll

Kick things off with a search request. Add the scroll parameter (like 1m for a 1-minute timeout) and set size to control your batch size. Here’s a basic example:
GET /my_index/_search?scroll=1m
{
  "size": 1000,
  "query": {
    "match_all": {}
  }
}
This pulls the first 1,000 hits and returns a `scroll_id`—a long, encoded string you’ll need for the next step.

Step 2: Fetch More Results

Using that `scroll_id`, request the next batch. You don’t need to repeat the query—just send the ID and timeout:
POST /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE..."
}
Loop this call until you’ve retrieved all your data. Each response includes a new `scroll_id` (sometimes the same, depending on the version), so keep updating it.

Step 3: Clean Up

When you’re done, delete the scroll context to free up server resources. It’s a small but critical step:
DELETE /_search/scroll/c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE...

Skip this, and you’ll leave dangling contexts that could bog down your cluster.

A Real-World Example

Let’s say you’re sifting through millions of logs for a specific error. Here’s a targeted scroll query:
GET /logs/_search?scroll=2m
{
  "size": 500,
  "query": {
    "match": {
      "error_message": "timeout"
    }
  }
}

Then, use the Scroll API to paginate through every matching log entry. It’s way cleaner than hacking around with `from` and `size`.
Tips for Scroll API Success
  • Batch Size: Stick to a `size` like 500–1000. Too large, and you’ll strain memory; too small, and you’ll make too many requests.
  • Timeout Tuning: Set the scroll duration (e.g., `1m`, `5m`) based on how fast your script processes each batch. Too short, and the context expires mid-run.
  • Automation: Use a script to handle the loop. Python’s `elasticsearch` library, for instance, has a handy scroll helper:
from elasticsearch import Elasticsearch

es = Elasticsearch(["http://localhost:9200"])
scroll = es.search(index="logs", scroll="2m", size=500, body={"query": {"match": {"error_message": "timeout"}}})
scroll_id = scroll["_scroll_id"]

while len(scroll["hits"]["hits"]):
    print(scroll["hits"]["hits"])  # Process this batch
    scroll = es.scroll(scroll_id=scroll_id, scroll="2m")
    scroll_id = scroll["_scroll_id"]

es.clear_scroll(scroll_id=scroll_id)  # Cleanup

Why Scroll Beats the Alternatives

You could tweak `index.max_result_window` to raise the limit, but that’s a performance gamble. Export tools or aggregations might work for summaries, but for raw data retrieval, Scroll is king—efficient and built for the job.

Conclusion

The Scroll API has been a game-changer for my Elasticsearch projects, especially when wrestling with massive indices. It’s simple once you get the hang of it, and the payoff is huge.

PostHeaderIcon Elastic APM: When to Use @CaptureSpan vs. @CaptureTransaction?

If you’re working with Elastic APM in a Java application, you might wonder when to use `@CaptureSpan` versus `@CaptureTransaction`. Both are powerful tools for observability, but they serve different purposes.
🔹 `@CaptureTransaction`:
Use this at the entry point of a request, typically at a controller, service method, or a background job. It defines the start of a transaction and allows you to trace how a request propagates through your system.
🔹 `@CaptureSpan`:
Use this to track sub-operations within a transaction, such as database queries, HTTP calls, or specific business logic. It helps break down execution time and pinpoint performance bottlenecks inside a transaction.

📌 Best Practices:

✅ Apply @CaptureTransaction at the highest-level method handling a request.
✅ Use @CaptureSpan for key internal operations you want to monitor.
✅ Avoid excessive spans—instrument only critical code paths to reduce overhead.

By balancing these annotations effectively, you can get detailed insights into your app’s performance while keeping APM overhead minimal.