How to Bypass Elasticsearch’s 10,000-Result Limit with the Scroll API
If you’ve ever worked with the Elasticsearch API, you’ve likely run into its infamous 10,000-result limit. It’s a default cap that can feel like a brick wall when you’re dealing with large datasets—think log analysis, report generation, or bulk data exports. Fortunately, there’s a slick workaround: the Scroll API. In this post, I’ll walk you through why this limit exists, how the Scroll API solves it, and share practical examples to get you started.
Why the 10,000-Result Limit Exists
Elasticsearch caps standard search results at 10,000 to protect performance. Fetching millions of records in one shot with from and size parameters can strain memory and slow things down. But what if you need all that data? That’s where the Scroll API shines—it’s designed for deep pagination, letting you retrieve everything in manageable chunks.
What Is the Scroll API?
Unlike a typical search, the Scroll API maintains a temporary “scroll context” on the server. You grab a batch of results, get a scroll_id, and use it to fetch the next batch—no need to rerun your query. It’s efficient, scalable, and perfect for big data tasks.
How to Use the Scroll API: Step by Step
Let’s break it down with examples you can try yourself.
Step 1: Start the Scroll
Kick things off with a search request. Add the scroll parameter (like 1m for a 1-minute timeout) and set size to control your batch size. Here’s a basic example:
GET /my_index/_search?scroll=1m
{
"size": 1000,
"query": {
"match_all": {}
}
}
This pulls the first 1,000 hits and returns a `scroll_id`—a long, encoded string you’ll need for the next step.
Step 2: Fetch More Results
Using that `scroll_id`, request the next batch. You don’t need to repeat the query—just send the ID and timeout:
POST /_search/scroll
{
"scroll": "1m",
"scroll_id": "c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE..."
}
Loop this call until you’ve retrieved all your data. Each response includes a new `scroll_id` (sometimes the same, depending on the version), so keep updating it.
Step 3: Clean Up
When you’re done, delete the scroll context to free up server resources. It’s a small but critical step:
DELETE /_search/scroll/c2NhbjsxMDAwO...YOUR_SCROLL_ID_HERE...
Skip this, and you’ll leave dangling contexts that could bog down your cluster.
A Real-World Example
Let’s say you’re sifting through millions of logs for a specific error. Here’s a targeted scroll query:
GET /logs/_search?scroll=2m
{
"size": 500,
"query": {
"match": {
"error_message": "timeout"
}
}
}
Then, use the Scroll API to paginate through every matching log entry. It’s way cleaner than hacking around with `from` and `size`.
Tips for Scroll API Success
-
Batch Size: Stick to a `size` like 500–1000. Too large, and you’ll strain memory; too small, and you’ll make too many requests.
-
Timeout Tuning: Set the scroll duration (e.g., `1m`, `5m`) based on how fast your script processes each batch. Too short, and the context expires mid-run.
-
Automation: Use a script to handle the loop. Python’s `elasticsearch` library, for instance, has a handy scroll helper:
from elasticsearch import Elasticsearch
es = Elasticsearch(["http://localhost:9200"])
scroll = es.search(index="logs", scroll="2m", size=500, body={"query": {"match": {"error_message": "timeout"}}})
scroll_id = scroll["_scroll_id"]
while len(scroll["hits"]["hits"]):
print(scroll["hits"]["hits"]) # Process this batch
scroll = es.scroll(scroll_id=scroll_id, scroll="2m")
scroll_id = scroll["_scroll_id"]
es.clear_scroll(scroll_id=scroll_id) # Cleanup
Why Scroll Beats the Alternatives
You could tweak `index.max_result_window` to raise the limit, but that’s a performance gamble. Export tools or aggregations might work for summaries, but for raw data retrieval, Scroll is king—efficient and built for the job.
Conclusion
The Scroll API has been a game-changer for my Elasticsearch projects, especially when wrestling with massive indices. It’s simple once you get the hang of it, and the payoff is huge.