Adding custom search text extractors for files

Custom extractors allow you to use the attachment search feature of page indexes for non-default file types.

To define a custom search text extractor:

Create a class that implements the ISearchTextExtractor interface (CMS.Search namespace).
- You can add the class into your web project inside the App_Code folder (or CMSApp_AppCode -> Old_App_Code on web application projects) or as part of a custom assembly.
Define the ExtractContent method.
The ExtractContent method has the following parameters:
- BinaryData data - contains the data of the attachment file.
- ExtractionContext context - provides information about the page whose attachment is being indexed. Use the context.Culture property to get the culture code of the page’s language version.The method must return an XmlData object. The system adds the object’s data to page search indexes for the given attachment. Use the XmlData.SetValue method to enter values into individual search index fields.
Register the extractor for a file extension by calling the SearchTextExtractorManager.RegisterExtractor method.
```
SearchTextExtractorManager.RegisterExtractor("extension", new CustomExtractorClass());
```
Note: You need to call the RegisterExtractor method at the start of your application (when the application and modules are being initialized).
Rebuild your page search indexes.

The search uses the custom extractor to index page attachments with the specified extension.

Modifying custom extractors

The system stores the text content extracted from page attachments in the database. When rebuilding page indexes, the search loads the “cached” attachment text from the database. The text extraction process only runs for attachments that do not have any search content saved.

If you change the functionality of a custom extractor, you need to clear the attachment content from the database:

Log in to the Kentico administration interfance and open the System application.
Select the Files -> Attachments tab.
Click Clear attachment search cache.

You can then Rebuild your page indexes, which updates the attachment content according to your extractor’s new functionality.

Example - Implementing a search text extractor

The following example demonstrates how to create a very basic extractor for .txt files:

Open your web project in Visual Studio.
Create a new class in the App_Code folder (or CMSApp_AppCode -> Old_App_Codeon web application projects). For example, name the class CustomSearchTextExtractor.cs.

Add the following using statements to the class:

using CMS.Search;
using CMS.Base;
using CMS.Core;
using CMS.Helpers;
using CMS.IO;
using CMS.DataEngine;

Make the class implement the ISearchTextExtractor interface.

public class CustomSearchTextExtractor : ISearchTextExtractor

Define the ExtractContent method:

public XmlData ExtractContent(BinaryData data, ExtractionContext context)
{
    if (data.Stream != null)
    {
        // Reads the text file's data stream
        data.Stream.Position = 0;
        string resultString = StreamReader.New(data.Stream).ReadToEnd();

        // Adds text into the CONTENT search index field
        XmlData contentData = new XmlData();
        contentData.SetValue(SearchFieldsConstants.CONTENT, resultString);

        return contentData;
    }

    return null;
}

Extend the CMSModuleLoader partial class (outside of the CustomSearchTextExtractor class), and add a CMSLoaderAttribute attribute that registers your custom extractor for the .txt extension:

[RegisterCustomExtractors]
public partial class CMSModuleLoader
{
    private class RegisterCustomExtractorsAttribute : CMSLoaderAttribute
    {
        // The system automatically calls the Init method when the application starts
        public override void Init()
        {
            // Registers the CustomSearchText extractor for the .txt extension
            SearchTextExtractorManager.RegisterExtractor("txt", new CustomSearchTextExtractor());
        }
    }
}

Save the CustomSearchTextExtractor.cs file.
Log in to the Kentico administration interface.
Open the Smart search application and Rebuild your page search indexes.

The example is only intended as a demonstration — the system already contains a default extractor for .txt files. You can use the same approach to create extractors for other file types.