Incrementally updating Search

Context

Content being crawled by Sitecore Search needs to be regularly updated - but commonly, we do not require indexes to be updated for all website content at the exact same time. Sites are indexed periodically via crawlers and need to be incrementally updated when a content editor modifies that page, without the need to completely re-index the entire site.

This article discusses how to integrate Sitecore Search with XM Cloud. If you need to integrate a different search or content provider, please consider these techniques when deciding if they can be applied to your scenario.

Any functionality that depends on incremental updates, including the following, must have Snapshot Publishing configured in XM Cloud.

The provided code is intended as a guideline and must be tailored to suit your specific implementation requirements and an end-to-end implementation needs to be setup considering your full authoring lifecycle and requirements. Please ensure thorough end-to-end testing is conducted to validate its functionality and performance in your environment.

Execution

Let’s consider that the website pages are indexed by using the following meta data:

The og:title is populated by using the page Title while the og:description uses the page’s Content field. The property og:idis the page item ID, while og:category is a static value and depends on the website section - for example the page /our-products/men/t-shirt-abc could have the og:category value Product.

The site is indexed initially via a crawler, and from then on, once a page is modified it is to be indexed individually.

High Level Architecture, Sitecore products and a middleware

To meet the requirement of indexing the page as soon as it’s modified, we have to implement an Incremental Update indexing. That is a Search’s web crawler able to accept incremental updates: the single page is not crawled (to be updated) while it is updated via the API PUT or PATCH and re-indexed in the next available batch.

Utilizing the following process adds an update operation to the Search index build queue. While that queue is being processed, updates may have a slight delay, however this is negligible compared to the wait for the next scheduled website crawl (e.g 24 hours). Therefore, index updates may not be ‘immediate', but they are ‘near-immediate’.

XM Cloud & Search setup

First we need to make sure that the meta data is added to the header in XM Cloud, for example:

The field og:idis used as document id in Search.

Create a web crawler in Search, the Document Extractor could be something like:

It is important to enable Incremental Updates in the source:

Incremental Updates in Sitecore Search Crawler

If the crawler is going to be scheduled, which is most of the cases, it is recommended to configure it as daily or weekly.

Middleware

Now that XMCloud and Search is setup, we need to move to the Middleware to setup the incremental update - in this recipe, we’ll focus on content updates to existing pages, see Insights for more information on Create/Delete and handling large updates. The Middleware can be implemented in any technology of your preference. The main steps to implement are always the same:

Receive the webhook notification from Experience Edge - The webhook payload has a JSON format like the snippet below named “Webhook notification format”, where the cardinality of the itemupdates is equal to the number of items that XM Cloud sends to Edge, for this reason a single massive publishing operation could generate more than one webhook notification like a chain. The chain is identified via the field invocation_id and terminated when the field continues assumes false as value. You could decide to process one notification at a time or wait for the whole chain of notifications (the final batch when continues:false) from Experience Edge before processing it; it is an implemented choice.

Filter the updates array by selecting the ones with entity_definition equals to LayoutData.
It is important to remove the ‘-layout’ token from the identifier and format it as GUID otherwise the graphQL query may not work.
For each identifier, retrieve the Title and the Content from Experience Edge via graphQL (the GraphQL query and the fields to retrieve can change based on your requirements; in this recipe, we are going to patch the title and the content of the document).

It could be done in one single query, for example the query below is able to retrieve more than one item in a single call. As you can see, one or more template ID(s) can be used to be sure to retrieve the expected contents.
If we are updating existing pages, the field related to the Category will not be retrieved because it is expected that it will not change as it is related to the section of the website.

The GraphQL results is a JSON as the snippet below loop on it to patch Search one document at time - the GraphQL results could contain more items then expected, so it is important to compare the id in resultswith the identifier in the webhook notification.

At the time of writing, batch updates are not supported in Search: a single request updates a single document. Pay attention to the following when generating the PATCH operation:

The Search Ingestion API endpoint could be geo specific,
The URL has to contain the documentID that, in this case, is the itemID (og:id) in lowercase format,
The URL has to contain the entityType that, in this case, is content but it could be different if you are using a multi-entity tenant,
The URL has to contain the locale that, in this case, is en_us but it could be different if you are using a multi-locale source,

For example, to patch {02818D14-7B0B-45FB-A91A-B5E94C6AF773}, the URL is as follows - where 1234567890 is the domain ID and 09876543 is the source ID:

An example of PATCH operation is the following snippet:

Experience Edge

Once the Middleware is setup, it needs to be registered via webhook as OnUpdate:

Considerations

Setting this up for an end-to-end setup, there are other area’s that need to be considered based on your requirements:

Consider the creation of new pages in XM Cloud (documents in Search) in comparison to updates. It could be implemented by executing a create document if the patch operation returned 404. Another alternative is using update - the PUT (update) and PATCH (partially update) methods differ:
- Use PUT (update) to replace all fields with the values you pass and delete fields you do not pass. If the document does not exist, the'PUT' operation returns 200 and creates the document.
- Use PATCH to replace the fields you pass and leave other fields as-is. If the document does not exist, the PATCH operation returns 404.
For this reason, in our case, if you implement PUT (update) then you need to pass the attributes name, description and category.
In this recipe, we use the meta data to index a page - the page content could be indexed by retrieving the rendered field in the GraphQL query.
Regarding the page locale, only en is used. The locale is part of the webhook notification, field entity_culture. You can leverage it to work with a multi-locale source in Search.
The Middleware needs to handle removing of existing pages. In this case you should receive a notification from Experience Edge with operation: 'Delete', allowing you to follow the same pattern to implement a delete operation for Search. As per regular publishing practices, the parent of the deleted item needs to be published with it's sub items to trigger the unpublish.
Web pages generated by the front-end application often use multiple, external datasources in addition to the content inside of XM Cloud, in which case you would need to retrieve this additional information from the middleware and update the documents in Search accordingly.
Updates are always coarse because it could be very complex to selectively identify from the source which attributes have been modified for each document.

Insights

It makes sense to ask if and when to use the incremental approach. First of all we must consider that Sitecore Search’s ingestion API has an upper limit on daily requests. Furthermore, as mentioned, updates are not immediate, so this approach may not be ideal for cases when indexing large portions of the site or the entire site. It is seen as a good compromise to use it to update content promptly without necessarily having to re-index the entire site.

A basic solution is to define a new page property in XM Cloud, name it Reindex (boolean):

when it is ‘true’ then the middleware will process the updates. This property could be 'false' by default and set to 'true' when needed:

The Reindex property ediable in Explorer

The middleware could use the query like the following to leverage on the Reindex attribute:

If the item has the Reindex as ‘true’ then the middleware sends the incremental update to Search. If it is ‘false’ then the middleware skips the item

Similarly, a more advanced middleware could manage an additional page property named ReindexDescendants that when ‘true’ causes Search to index its descendants if any identifier appears in any notifications.

Consider that these page attributes do not change the way XM Cloud publishes, nor how Experience Edge notifies the middleware via webhook; they are just attributes like name and description. Reindex and ReindexDescendants could be used by the middleware to make a decision on how to perform incremental update against Search.

How to protect from large updates

The Reindex property is an option to control the incremental update but it could be not enough, for example, the user sets it to ‘true’ to all pages or misuse the ReindexDescendants property and then republish the site. In this case Search would be overwhelmed by individual operations.

A more clever middleware could instead observe a stateful approach and leverage on the XM Cloud webhook notifications in addition to Experience Edge notifications. The middleware could use the publish:begin event (XM Cloud notification) to understand if it is a full site publishing:

or if it is not:

Single publish event — Single Publish event

If it is a full site publish operation then the middleware will susped to update Search until an OnEnd notification (Experience Edge notification) will come. All notification between publish:begin and OnEnd will be dropped.

Integrating Sitecore Search

Publishing to Experience Edge

Enable incremental updates for a crawler

Webhook objects

Creating a document by passing attribute values

Webhook event handler configuration fields