Working with XPaths
How to control what content is indexed and used in search results
We made the Site Search 360 crawler as intelligent as possible when it comes to analyzing your website and picking the right title, image, and content for your search results.
Nonetheless, it might still be necessary to fine-tune your indexing rules by pointing the crawler directly to the desired content or exclude unwanted pieces of information from being indexed and, therefore, used in search.
This can be done via XPath expressions placed in the Site Search 360 control panel. You can set up the general rules on the Crawler Settings page and content group-specific rules on the Content Groups page (if you're using any).
In the following video, we show you how to configure your search engine to only index the content that your visitors are searching for.
Check out these steps:
First let's search and install the Google Chrome extension called "XPath Helper". It will allow us to easily define XPaths right from your own site.
Navigate to one of your website's pages. Press the XPath Helper icon in the top right corner of your browser to open the black overlay which will reveal the currently selected XPath expression.
Now we want to extract the main content. After opening the XPath Helper, hold the [Shift] key and hover your mouse over your website's elements.
You will see how the extension highlights them in yellow while displaying the XPath query in the black overlay box. As you move your mouse this XPath query will change. Try to get all the content you are targeting highlighted in yellow.
The Result half of the black overlay box allows you to preview the targeted content.
Tweak your XPath expression by shortening it. There are two ways of shortening an XPath query — you can remove something from the end to match more child nodes or you can leave the tail and cut the head off to make it match more generally. Make sure your XPath always starts with
//when shortening from the front. You can find a live example of this step in the video.
Copy the XPath query over to the Site Search 360 control panel and place it under Indexing Control > Crawler Settings in the appropriate XPath section: Title XPaths, Image XPaths, Include and Exclude Content XPaths, or Search Snippet Xpath.
Press the "Test" button and enter your webpage URL to test the XPath query. If everything is fine, you will see the extracted content, headline, or image URL below. You can also re-index a specific URL to check what's going to be extracted from this page all at once.
Default XPaths and common strategies for your search results
You can use XPath expressions (one per line) for:
Title XPaths pointing to the main title of the page. Default is
//h1, i.e. the crawler takes your <h1> heading. Other common scenarios include
//title, to pick up the page title tag content, or sometimes
//h2. Change it according to your site structure.
Title Regexp allows you to apply a regular expression condition on the extracted titles, if you need even more control. For example, you might have your brand or company name repeated in every page title:
<title>Working with XPaths – Site Search 360</title>
To only use the "Working with XPaths" part as a search result title, use
//titleas your Title Xpath and add
([^–])+as Title Regexp, and the "– Site Search 360" part will be cut off.
Image XPaths pointing to the main picture on your page. These images, if available, are automatically shown as search result thumbnails. Leave this field empty if our default crawler settings work well for your site or adjust to point to a specific image instead. For example:
If your images are lazy-loaded, try something similar to the following pattern:
You can also tell the crawler to ignore all images by toggling "Extract Images" off. Alt texts and captions can be indexed separately.
Default Image XPath pointing to the default image to be used when no other image is found. For example,
Include Content XPaths pointing to the content blocks that should be indexed. One XPath per line. Leave empty if everything should be indexed.
Exclude Content XPaths pointing to the content blocks that should be ignored by the crawler. One XPath per line. Leave empty if everything should be indexed.
Search Snippet XPath pointing to the content that you want to display in the search results. By default, we show the content around the terms matching the search query.
Another common strategy would be using your page meta descriptions instead. That's why
//meta[@name="description"]/@contentis pre-filled for you. To start showing meta descriptions in your search snippets, go to Search Settings and change the Search Snippet Source.