Posts Tagged ‘Elastic’
How to Bypass Elasticsearch’s 10,000-Result Limit with the Scroll API
Why the 10,000-Result Limit Exists
What Is the Scroll API?
How to Use the Scroll API: Step by Step
Step 1: Start the Scroll
GET /my_index/_search?scroll=1m
{
"size": 1000,
"query": {
"match_all": {}
}
}
Step 2: Fetch More Results
POST /_search/scroll
{
"scroll": "1m",
"scroll_id": "c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE..."
}
Step 3: Clean Up
DELETE /_search/scroll/c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE...
A Real-World Example
GET /logs/_search?scroll=2m
{
"size": 500,
"query": {
"match": {
"error_message": "timeout"
}
}
}
-
Batch Size: Stick to a `size` like 500–1000. Too large, and you’ll strain memory; too small, and you’ll make too many requests.
-
Timeout Tuning: Set the scroll duration (e.g., `1m`, `5m`) based on how fast your script processes each batch. Too short, and the context expires mid-run.
-
Automation: Use a script to handle the loop. Python’s `elasticsearch` library, for instance, has a handy scroll helper:
from elasticsearch import Elasticsearch
es = Elasticsearch(["http://localhost:9200"])
scroll = es.search(index="logs", scroll="2m", size=500, body={"query": {"match": {"error_message": "timeout"}}})
scroll_id = scroll["_scroll_id"]
while len(scroll["hits"]["hits"]):
print(scroll["hits"]["hits"]) # Process this batch
scroll = es.scroll(scroll_id=scroll_id, scroll="2m")
scroll_id = scroll["_scroll_id"]
es.clear_scroll(scroll_id=scroll_id) # Cleanup
Why Scroll Beats the Alternatives
Conclusion
Elastic APM: When to Use @CaptureSpan vs. @CaptureTransaction?
If you’re working with Elastic APM in a Java application, you might wonder when to use `@CaptureSpan` versus `@CaptureTransaction`. Both are powerful tools for observability, but they serve different purposes.
🔹 `@CaptureTransaction`:
Use this at the entry point of a request, typically at a controller, service method, or a background job. It defines the start of a transaction and allows you to trace how a request propagates through your system.
🔹 `@CaptureSpan`:
Use this to track sub-operations within a transaction, such as database queries, HTTP calls, or specific business logic. It helps break down execution time and pinpoint performance bottlenecks inside a transaction.
📌 Best Practices:
✅ Apply @CaptureTransaction at the highest-level method handling a request.
✅ Use @CaptureSpan for key internal operations you want to monitor.
✅ Avoid excessive spans—instrument only critical code paths to reduce overhead.
By balancing these annotations effectively, you can get detailed insights into your app’s performance while keeping APM overhead minimal.
[DevoxxUK2024] Is It (F)ake?! Image Classification with TensorFlow.js by Carly Richmond
Carly Richmond, a Principal Developer Advocate at Elastic, captivated the DevoxxUK2024 audience with her engaging exploration of image classification using TensorFlow.js. Inspired by her love for the Netflix show Is It Cake?, Carly embarked on a project to build a model distinguishing cakes disguised as everyday objects from their non-cake counterparts. Despite her self-professed lack of machine learning expertise, Carly’s journey through data gathering, pre-trained models, custom model development, and transfer learning offers a relatable and insightful narrative for developers venturing into AI-driven JavaScript applications.
Gathering and Preparing Data
Carly’s project begins with the critical task of data collection, a foundational step in machine learning. To source images of cakes resembling other objects, she leverages Playwright, a JavaScript-based automation framework, to scrape images from bakers’ websites and Instagram galleries. For non-cake images, Carly utilizes the Unsplash API, which provides royalty-free photos with a rate-limited free tier. She queries categories like reptiles, candles, and shoes to align with the deceptive cakes from the show. However, Carly acknowledges limitations, such as inadvertently including biscuits or company logos in the dataset, highlighting the challenges of ensuring data purity with a modest set of 367 cake and 174 non-cake images.
Exploring Pre-Trained Models
To avoid building a model from scratch, Carly initially experiments with TensorFlow.js’s pre-trained models, Coco SSD and MobileNet. Coco SSD, trained on the Common Objects in Context (COCO) dataset, excels in object detection, identifying bounding boxes and classifying objects like cakes with reasonable accuracy. MobileNet, designed for lightweight classification, struggles with Carly’s dataset, often misclassifying cakes as cups or ice cream due to visual similarities like frosting. CORS issues further complicate browser-based MobileNet deployment, prompting Carly to shift to a Node.js backend, where she converts images into tensors for processing. These experiences underscore the trade-offs between model complexity and practical deployment.
Building and Refining a Custom Model
Undeterred by initial setbacks, Carly ventures into crafting a custom convolutional neural network (CNN) using TensorFlow.js. She outlines the CNN’s structure, which includes convolution layers to extract features, pooling layers to reduce dimensionality, and a softmax activation for binary classification (cake vs. not cake). Despite her efforts, the model’s accuracy languishes at 48%, plagued by issues like tensor shape mismatches and premature tensor disposal. Carly candidly admits to errors, such as mislabeling cakes as non-cakes, illustrating the steep learning curve for non-experts. This section of her talk resonates with developers, emphasizing perseverance and the iterative nature of machine learning.
Leveraging Transfer Learning
Recognizing the limitations of her dataset and custom model, Carly pivots to transfer learning, using MobileNet’s feature vectors as a foundation. By adding a custom classification head with ReLU and softmax layers, she achieves a significant improvement, with accuracy reaching 100% by the third epoch and correctly classifying 319 cakes. While not perfect, this approach outperforms her custom model, demonstrating the power of leveraging pre-trained models for specialized tasks. Carly’s comparison of human performance—90% accuracy by the DevoxxUK audience versus her model’s results—adds a playful yet insightful dimension, highlighting the gap between human intuition and machine precision.
Links:
[DevoxxGR2024] Butcher Virtual Threads Like a Pro at Devoxx Greece 2024 by Piotr Przybyl
Piotr Przybyl, a Java Champion and developer advocate at Elastic, captivated audiences at Devoxx Greece 2024 with a dynamic exploration of Java 21’s virtual threads. Through vivid analogies, practical demos, and a touch of humor, Piotr demystified virtual threads, highlighting their potential and pitfalls. His talk, rich with real-world insights, offered developers a guide to leveraging this transformative feature while avoiding common missteps. As a seasoned advocate for technologies like Elasticsearch and Testcontainers, Piotr’s presentation was a masterclass in navigating modern Java concurrency.
Understanding Virtual Threads
Piotr began by contextualizing virtual threads within Java’s concurrency evolution. Introduced in Java 21 under Project Loom, virtual threads address the limitations of traditional platform threads, which are costly to create and limited in number. Unlike platform threads, virtual threads are lightweight, managed by a scheduler that mounts and unmounts them from carrier threads during I/O operations. This enables a thread-per-request model, scaling applications to handle millions of concurrent tasks. Piotr likened virtual threads to taxis in a busy city like Athens, efficiently transporting passengers (tasks) without occupying resources during idle periods.
However, virtual threads are not a universal solution. Piotr emphasized that they do not inherently speed up individual requests but improve scalability by handling more concurrent tasks. Their API remains familiar, aligning with existing thread practices, making adoption seamless for developers accustomed to Java’s threading model.
Common Pitfalls and Pinning
A central theme of Piotr’s talk was “pinning,” a performance issue where virtual threads remain tied to carrier threads, negating benefits. Pinning occurs during I/O or native calls within synchronized blocks, akin to keeping a taxi running during a lunch break. Piotr demonstrated this with a legacy Elasticsearch client, using Testcontainers and Toxiproxy to simulate slow network calls. By enabling tracing with flags like -J-DTracePinnThreads, He identified and resolved pinning issues, replacing synchronized methods with modern, non-blocking clients.
Piotr cautioned against misuses like thread pooling or reusing virtual threads, which disrupt their lightweight design. He advocated for careful monitoring using JFR events to ensure threads remain unpinned, ensuring optimal performance in production environments.
Structured Concurrency and Scope Values
Piotr explored structured concurrency, a preview feature in Java 21, designed to eliminate thread leaks and cancellation delays. By creating scopes that manage forks, developers can ensure tasks complete or fail together, simplifying error handling. He demonstrated a shutdown-on-failure scope, where a single task failure cancels all others, contrasting this with the complexity of managing interdependent futures.
Scope Values, another preview feature, offer immutable, one-way thread locals to prevent bugs like data leakage in thread pools. Piotr illustrated their use in maintaining request context, warning against mutability to preserve reliability. These features, he argued, complement virtual threads, fostering robust, maintainable concurrent applications.
Practical Debugging and Best Practices
Through live coding, Piotr showcased how debugging with logging can inadvertently introduce I/O, unmounting virtual threads and degrading performance. He compared this to a concert where logging scatters tasks, reducing completion rates. To mitigate this, he recommended avoiding I/O in critical paths and using structured concurrency for monitoring.
Piotr’s best practices included using framework-specific annotations (e.g., Quarkus, Spring) to enable virtual threads and ensuring tasks are interruptible. He urged developers to test thoroughly, leveraging tools like Testcontainers to simulate real-world conditions. His blog post on testing unpinned threads provides further guidance for practitioners.
Conclusion
Piotr’s presentation was a clarion call to embrace virtual threads with enthusiasm and caution. By understanding their mechanics, avoiding pitfalls like pinning, and leveraging structured concurrency, developers can unlock unprecedented scalability. His engaging analogies and practical demos made complex concepts accessible, empowering attendees to modernize Java applications responsibly. As Java evolves, Piotr’s insights ensure developers remain equipped to navigate its concurrency landscape.
Links:
[DevoxxPL2022] Did Anyone Say SemVer? • Philipp Krenn
Philipp Krenn, a developer advocate at Elastic, captivated audiences at Devoxx Poland 2022 with a witty and incisive exploration of semantic versioning (SemVer). Drawing from Elastic’s experiences with Elasticsearch, Philipp dissected the nuances of versioning, revealing why SemVer often ignites passionate debates. His talk navigated the ambiguities of defining APIs, the complexities of breaking changes, and the cultural dynamics of open-source versioning, offering a pragmatic lens for developers grappling with version management.
Decoding Semantic versioning
Philipp introduced SemVer, as formalized on semver.org, with its major version structure, where patch fixes bugs, minor adds features, and major introduces breaking changes. This simplicity, however, belies complexity in practice. He posed a sorting challenge with version strings like alpha.-, 2.-, and 11.-, illustrating SemVer’s arcane precedence rules, humorously cautioning against such obfuscation unless “trolling users.” Philipp noted that SemVer’s focus on APIs raises fundamental questions: what constitutes an API? For Elasticsearch, the REST API is sacrosanct, warranting major version bumps for changes, whereas plugin APIs, exposing internal Java packages, tolerate frequent breaks, sparking user frustration when plugins fail.
The Ambiguity of Breaking Changes
The definition of a breaking change varies by perspective, Philipp argued. Upgrading a supported JDK version, for instance, divides opinions—some view it as a system-altering break, others as an implementation detail. Security fixes further muddy the waters, as seen in Elastic’s handling of unintended insecure usage, where API “fixes” disrupted user workflows. Philipp cited the Log4j2 vulnerability, where maintainers supported multiple JDK versions across minor releases, avoiding major version increments. Accidental breaks, common in open-source projects, and asymmetric feature additions—easy to add, hard to remove—compound SemVer’s challenges, often leading to user distrust when expectations misalign.
Cultural and Practical Dilemmas
Philipp explored why SemVer debates are so heated, attributing it to differing interpretations of “correct” versioning. He critiqued version ranges, prevalent in npm but rare in Java, for introducing instability due to transitive dependency updates, advocating for tools like Dependabot to manage updates explicitly. Experimental APIs, marked as unstable, offer an escape hatch for breaking changes without major version bumps, though they demand diligent release note scrutiny. Pre-1.0 versions, dubbed the “Wild West,” lack SemVer guarantees, enabling unfettered changes but risking user confusion. Philipp contrasted SemVer with alternatives like calendar versioning, used by Ubuntu, noting its decline as SemVer dominates modern ecosystems.