Jonathan Lalou's Blog

Posts Tagged ‘Elastic’

How to Bypass Elasticsearch’s 10,000-Result Limit with the Scroll API

If you’ve ever worked with the Elasticsearch API, you’ve likely run into its infamous 10,000-result limit. It’s a default cap that can feel like a brick wall when you’re dealing with large datasets—think log analysis, report generation, or bulk data exports. Fortunately, there’s a slick workaround: the Scroll API. In this post, I’ll walk you through why this limit exists, how the Scroll API solves it, and share practical examples to get you started.

Why the 10,000-Result Limit Exists

Elasticsearch caps standard search results at 10,000 to protect performance. Fetching millions of records in one shot with from and size parameters can strain memory and slow things down. But what if you need all that data? That’s where the Scroll API shines—it’s designed for deep pagination, letting you retrieve everything in manageable chunks.

What Is the Scroll API?

Unlike a typical search, the Scroll API maintains a temporary “scroll context” on the server. You grab a batch of results, get a scroll_id, and use it to fetch the next batch—no need to rerun your query. It’s efficient, scalable, and perfect for big data tasks.

How to Use the Scroll API: Step by Step

Let’s break it down with examples you can try yourself.

Step 1: Start the Scroll

Kick things off with a search request. Add the scroll parameter (like 1m for a 1-minute timeout) and set size to control your batch size. Here’s a basic example:

GET /my_index/_search?scroll=1m
{
  "size": 1000,
  "query": {
    "match_all": {}
  }
}

This pulls the first 1,000 hits and returns a `scroll_id`—a long, encoded string you’ll need for the next step.

Step 2: Fetch More Results

Using that `scroll_id`, request the next batch. You don’t need to repeat the query—just send the ID and timeout:

POST /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE..."
}

Loop this call until you’ve retrieved all your data. Each response includes a new `scroll_id` (sometimes the same, depending on the version), so keep updating it.

Step 3: Clean Up

When you’re done, delete the scroll context to free up server resources. It’s a small but critical step:

DELETE /_search/scroll/c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE...

Skip this, and you’ll leave dangling contexts that could bog down your cluster.

A Real-World Example

Let’s say you’re sifting through millions of logs for a specific error. Here’s a targeted scroll query:

GET /logs/_search?scroll=2m
{
  "size": 500,
  "query": {
    "match": {
      "error_message": "timeout"
    }
  }
}

Then, use the Scroll API to paginate through every matching log entry. It’s way cleaner than hacking around with `from` and `size`.

Tips for Scroll API Success

Batch Size: Stick to a `size` like 500–1000. Too large, and you’ll strain memory; too small, and you’ll make too many requests.
Timeout Tuning: Set the scroll duration (e.g., `1m`, `5m`) based on how fast your script processes each batch. Too short, and the context expires mid-run.
Automation: Use a script to handle the loop. Python’s `elasticsearch` library, for instance, has a handy scroll helper:

from elasticsearch import Elasticsearch

es = Elasticsearch(["http://localhost:9200"])
scroll = es.search(index="logs", scroll="2m", size=500, body={"query": {"match": {"error_message": "timeout"}}})
scroll_id = scroll["_scroll_id"]

while len(scroll["hits"]["hits"]):
    print(scroll["hits"]["hits"])  # Process this batch
    scroll = es.scroll(scroll_id=scroll_id, scroll="2m")
    scroll_id = scroll["_scroll_id"]

es.clear_scroll(scroll_id=scroll_id)  # Cleanup

Why Scroll Beats the Alternatives

You could tweak `index.max_result_window` to raise the limit, but that’s a performance gamble. Export tools or aggregations might work for summaries, but for raw data retrieval, Scroll is king—efficient and built for the job.

Conclusion

The Scroll API has been a game-changer for my Elasticsearch projects, especially when wrestling with massive indices. It’s simple once you get the hang of it, and the payoff is huge.

Posted in en-US | Tags: APM, Elastic | No Comments »

Elastic APM: When to Use @CaptureSpan vs. @CaptureTransaction?

Author: Jonathan Lalou

If you’re working with Elastic APM in a Java application, you might wonder when to use `@CaptureSpan` versus `@CaptureTransaction`. Both are powerful tools for observability, but they serve different purposes.
🔹 `@CaptureTransaction`:
Use this at the entry point of a request, typically at a controller, service method, or a background job. It defines the start of a transaction and allows you to trace how a request propagates through your system.
🔹 `@CaptureSpan`:
Use this to track sub-operations within a transaction, such as database queries, HTTP calls, or specific business logic. It helps break down execution time and pinpoint performance bottlenecks inside a transaction.

📌 Best Practices:

✅ Apply @CaptureTransaction at the highest-level method handling a request.
✅ Use @CaptureSpan for key internal operations you want to monitor.
✅ Avoid excessive spans—instrument only critical code paths to reduce overhead.

By balancing these annotations effectively, you can get detailed insights into your app’s performance while keeping APM overhead minimal.

Posted in en-US | Tags: APM, Elastic, Kibana, Telemetry | No Comments »

[DevoxxUK2024] Is It (F)ake?! Image Classification with TensorFlow.js by Carly Richmond

Author: Jonathan Lalou

Carly Richmond, a Principal Developer Advocate at Elastic, captivated the DevoxxUK2024 audience with her engaging exploration of image classification using TensorFlow.js. Inspired by her love for the Netflix show Is It Cake?, Carly embarked on a project to build a model distinguishing cakes disguised as everyday objects from their non-cake counterparts. Despite her self-professed lack of machine learning expertise, Carly’s journey through data gathering, pre-trained models, custom model development, and transfer learning offers a relatable and insightful narrative for developers venturing into AI-driven JavaScript applications.

Gathering and Preparing Data

Carly’s project begins with the critical task of data collection, a foundational step in machine learning. To source images of cakes resembling other objects, she leverages Playwright, a JavaScript-based automation framework, to scrape images from bakers’ websites and Instagram galleries. For non-cake images, Carly utilizes the Unsplash API, which provides royalty-free photos with a rate-limited free tier. She queries categories like reptiles, candles, and shoes to align with the deceptive cakes from the show. However, Carly acknowledges limitations, such as inadvertently including biscuits or company logos in the dataset, highlighting the challenges of ensuring data purity with a modest set of 367 cake and 174 non-cake images.

Exploring Pre-Trained Models

To avoid building a model from scratch, Carly initially experiments with TensorFlow.js’s pre-trained models, Coco SSD and MobileNet. Coco SSD, trained on the Common Objects in Context (COCO) dataset, excels in object detection, identifying bounding boxes and classifying objects like cakes with reasonable accuracy. MobileNet, designed for lightweight classification, struggles with Carly’s dataset, often misclassifying cakes as cups or ice cream due to visual similarities like frosting. CORS issues further complicate browser-based MobileNet deployment, prompting Carly to shift to a Node.js backend, where she converts images into tensors for processing. These experiences underscore the trade-offs between model complexity and practical deployment.

Building and Refining a Custom Model

Undeterred by initial setbacks, Carly ventures into crafting a custom convolutional neural network (CNN) using TensorFlow.js. She outlines the CNN’s structure, which includes convolution layers to extract features, pooling layers to reduce dimensionality, and a softmax activation for binary classification (cake vs. not cake). Despite her efforts, the model’s accuracy languishes at 48%, plagued by issues like tensor shape mismatches and premature tensor disposal. Carly candidly admits to errors, such as mislabeling cakes as non-cakes, illustrating the steep learning curve for non-experts. This section of her talk resonates with developers, emphasizing perseverance and the iterative nature of machine learning.

Leveraging Transfer Learning

Recognizing the limitations of her dataset and custom model, Carly pivots to transfer learning, using MobileNet’s feature vectors as a foundation. By adding a custom classification head with ReLU and softmax layers, she achieves a significant improvement, with accuracy reaching 100% by the third epoch and correctly classifying 319 cakes. While not perfect, this approach outperforms her custom model, demonstrating the power of leveraging pre-trained models for specialized tasks. Carly’s comparison of human performance—90% accuracy by the DevoxxUK audience versus her model’s results—adds a playful yet insightful dimension, highlighting the gap between human intuition and machine precision.

Links:

Posted in en-US | Tags: CarlyRichmond, DevoxxUK2024, Elastic, ImageClassification, JavaScript, MachineLearning, TensorFlowJS | No Comments »

[DevoxxGR2024] Butcher Virtual Threads Like a Pro at Devoxx Greece 2024 by Piotr Przybyl

Author: Jonathan Lalou

Piotr Przybyl, a Java Champion and developer advocate at Elastic, captivated audiences at Devoxx Greece 2024 with a dynamic exploration of Java 21’s virtual threads. Through vivid analogies, practical demos, and a touch of humor, Piotr demystified virtual threads, highlighting their potential and pitfalls. His talk, rich with real-world insights, offered developers a guide to leveraging this transformative feature while avoiding common missteps. As a seasoned advocate for technologies like Elasticsearch and Testcontainers, Piotr’s presentation was a masterclass in navigating modern Java concurrency.

Understanding Virtual Threads

Piotr began by contextualizing virtual threads within Java’s concurrency evolution. Introduced in Java 21 under Project Loom, virtual threads address the limitations of traditional platform threads, which are costly to create and limited in number. Unlike platform threads, virtual threads are lightweight, managed by a scheduler that mounts and unmounts them from carrier threads during I/O operations. This enables a thread-per-request model, scaling applications to handle millions of concurrent tasks. Piotr likened virtual threads to taxis in a busy city like Athens, efficiently transporting passengers (tasks) without occupying resources during idle periods.

However, virtual threads are not a universal solution. Piotr emphasized that they do not inherently speed up individual requests but improve scalability by handling more concurrent tasks. Their API remains familiar, aligning with existing thread practices, making adoption seamless for developers accustomed to Java’s threading model.

Common Pitfalls and Pinning

A central theme of Piotr’s talk was “pinning,” a performance issue where virtual threads remain tied to carrier threads, negating benefits. Pinning occurs during I/O or native calls within synchronized blocks, akin to keeping a taxi running during a lunch break. Piotr demonstrated this with a legacy Elasticsearch client, using Testcontainers and Toxiproxy to simulate slow network calls. By enabling tracing with flags like -J-DTracePinnThreads, He identified and resolved pinning issues, replacing synchronized methods with modern, non-blocking clients.

Piotr cautioned against misuses like thread pooling or reusing virtual threads, which disrupt their lightweight design. He advocated for careful monitoring using JFR events to ensure threads remain unpinned, ensuring optimal performance in production environments.

Structured Concurrency and Scope Values

Piotr explored structured concurrency, a preview feature in Java 21, designed to eliminate thread leaks and cancellation delays. By creating scopes that manage forks, developers can ensure tasks complete or fail together, simplifying error handling. He demonstrated a shutdown-on-failure scope, where a single task failure cancels all others, contrasting this with the complexity of managing interdependent futures.

Scope Values, another preview feature, offer immutable, one-way thread locals to prevent bugs like data leakage in thread pools. Piotr illustrated their use in maintaining request context, warning against mutability to preserve reliability. These features, he argued, complement virtual threads, fostering robust, maintainable concurrent applications.

Practical Debugging and Best Practices

Through live coding, Piotr showcased how debugging with logging can inadvertently introduce I/O, unmounting virtual threads and degrading performance. He compared this to a concert where logging scatters tasks, reducing completion rates. To mitigate this, he recommended avoiding I/O in critical paths and using structured concurrency for monitoring.

Piotr’s best practices included using framework-specific annotations (e.g., Quarkus, Spring) to enable virtual threads and ensuring tasks are interruptible. He urged developers to test thoroughly, leveraging tools like Testcontainers to simulate real-world conditions. His blog post on testing unpinned threads provides further guidance for practitioners.

Conclusion

Piotr’s presentation was a clarion call to embrace virtual threads with enthusiasm and caution. By understanding their mechanics, avoiding pitfalls like pinning, and leveraging structured concurrency, developers can unlock unprecedented scalability. His engaging analogies and practical demos made complex concepts accessible, empowering attendees to modernize Java applications responsibly. As Java evolves, Piotr’s insights ensure developers remain equipped to navigate its concurrency landscape.

Links:

Posted in en-US | Tags: Concurrency, DevoxxGR2024, DevoxxGreece2024, Elastic, Java21, PiotrPrzybyl, ProjectLoom, Testcontainers, VirtualThreads | No Comments »

[DevoxxPL2022] Did Anyone Say SemVer? • Philipp Krenn

Author: Jonathan Lalou

Philipp Krenn, a developer advocate at Elastic, captivated audiences at Devoxx Poland 2022 with a witty and incisive exploration of semantic versioning (SemVer). Drawing from Elastic’s experiences with Elasticsearch, Philipp dissected the nuances of versioning, revealing why SemVer often ignites passionate debates. His talk navigated the ambiguities of defining APIs, the complexities of breaking changes, and the cultural dynamics of open-source versioning, offering a pragmatic lens for developers grappling with version management.

Decoding Semantic versioning

Philipp introduced SemVer, as formalized on semver.org, with its major version structure, where patch fixes bugs, minor adds features, and major introduces breaking changes. This simplicity, however, belies complexity in practice. He posed a sorting challenge with version strings like alpha.-, 2.-, and 11.-, illustrating SemVer’s arcane precedence rules, humorously cautioning against such obfuscation unless “trolling users.” Philipp noted that SemVer’s focus on APIs raises fundamental questions: what constitutes an API? For Elasticsearch, the REST API is sacrosanct, warranting major version bumps for changes, whereas plugin APIs, exposing internal Java packages, tolerate frequent breaks, sparking user frustration when plugins fail.

The Ambiguity of Breaking Changes

The definition of a breaking change varies by perspective, Philipp argued. Upgrading a supported JDK version, for instance, divides opinions—some view it as a system-altering break, others as an implementation detail. Security fixes further muddy the waters, as seen in Elastic’s handling of unintended insecure usage, where API “fixes” disrupted user workflows. Philipp cited the Log4j2 vulnerability, where maintainers supported multiple JDK versions across minor releases, avoiding major version increments. Accidental breaks, common in open-source projects, and asymmetric feature additions—easy to add, hard to remove—compound SemVer’s challenges, often leading to user distrust when expectations misalign.

Cultural and Practical Dilemmas

Philipp explored why SemVer debates are so heated, attributing it to differing interpretations of “correct” versioning. He critiqued version ranges, prevalent in npm but rare in Java, for introducing instability due to transitive dependency updates, advocating for tools like Dependabot to manage updates explicitly. Experimental APIs, marked as unstable, offer an escape hatch for breaking changes without major version bumps, though they demand diligent release note scrutiny. Pre-1.0 versions, dubbed the “Wild West,” lack SemVer guarantees, enabling unfettered changes but risking user confusion. Philipp contrasted SemVer with alternatives like calendar versioning, used by Ubuntu, noting its decline as SemVer dominates modern ecosystems.

Links:

Posted in en-US | Tags: APIManagement, DevoxxPL2022, DevoxxPoland, Elastic, Elasticsearch, PhilippKrenn, SemanticVersioning | No Comments »