How to Configure SharePoint Document Libraries in Azure AI Search (Preview)

A step-by-step guide on integrating SharePoint with Azure AI Search.

スポンサーリンク

Adding SharePoint Online as a Data Source to AI Search

While still in preview, Azure AI Search now supports indexing document libraries from SharePoint Online.

Note: There are several important limitations to be aware of:

  • File extension restrictions apply
  • SharePoint lists are not supported
  • ASPX files are excluded, among other limitations

I found the process of adding SharePoint to Azure AI Search somewhat challenging, so I’ve documented the steps here for reference.

Procedure

Here’s an overview of the main steps:

  1. Create a SharePoint site for search indexing
  2. Create an Azure AI Search resource
  3. Register an application in Entra ID
  4. Set up data source, index, and indexer

Detailed instructions below.

Step 0: Creating a SharePoint Site

First, create a SharePoint site that will serve as the search target.

Step 1: Creating an Azure AI Search Resource

Open the Azure portal and select [Create] for AI Search
Select a service name and region, then create the resource. Note: We’re using the Free tier here as this is not for production use.
Leave all other settings at their default values.
Once the resource is created, go to the [Identity] tab of AI Search and enable [System-assigned managed identity]. Note: You don’t need to record the Object ID

Step 2: Register an Application in Entra ID

Next, select Entra ID
Choose [App registrations] and select [New registration]
Since we’re targeting a SharePoint Online site within the same tenant, select “Accounts in this organizational directory only” and register.
After registration, note down the [Application ID] from the [Overview] tab (Note 1).
Go to [API permissions] of the registered app and select [Add a permission]
Select [Microsoft Graph]
Under [Application permissions], select “Sites.Read.All” and
“Files.Read.All”, then click [Add permissions]
Grant [Admin consent]
In [Authentication], set [Enable the following mobile and desktop flows] to “Yes” and save
Click [Add a platform] and select [Mobile and desktop applications]
Select the “https://login.microsoftonline~~~” option and configure
Finally, go to [Certificates & secrets] tab and add a [New client secret] with your preferred expiration date
Note down the generated client secret (Note 2)

This completes the Entra ID registration process.

Step 3: Azure AI Search: Adding a Data Source

Return to AI Search and select [Add data source (JSON)] from the [Data sources] tab
Configure the following JSON:

{
 "name": "【Your desired data source name (example: sharepoint-datasource)】",
 "type": "sharepoint",
 "credentials": {
    "connectionString":"SharePointOnlineEndpoint=【SPO site URL (up to ~/sites/site-name)】;ApplicationId=【App ID (Note 1)】;ApplicationSecret=【Secret (Note 2)】;"
 }, 
 "container": {
     "name": "【Target document library (details explained later)】" 
 }
}
■type: Specify “sharepoint” for SharePoint data sources ■container/name

  • defaultSiteLibrary: Indexes all content in the site’s default document library
  • allSiteLibraries: Indexes all content across all document libraries within the site
  • useQuery: Only indexes content defined in the “query” parameter

※For more information about queries, please refer to this documentation

Step 4: Creating the Index

Next, define the index (columns and attributes used for searching).
Select [Add index (JSON)] from the [Indexes] section and specify the desired columns.
For this demonstration, we’ll use minimal settings (avoiding vector search for now as embeddings incur additional costs).

{
    "name" : "sharepoint-index",
    "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true },
        { "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false },
        { "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
    ]
}

Step 5: Creating the Indexer

Finally, create an indexer. Select [Add indexer (JSON)] from the [Indexers] section and configure the following settings before saving. Note: As this is a sample, we’re not using skillsets (chunking was already done on the Python side).

{
    "name" : "sharepoint-indexer",
    "dataSourceName" : "sharepoint-datasource",
    "targetIndexName" : "sharepoint-index",
    "parameters": {
        "batchSize": null,
        "maxFailedItems": null,
        "maxFailedItemsPerBatch": null,
        "base64EncodeKeys": null,
        "configuration": {
            "indexedFileNameExtensions" : ".txt, .pdf",
            "excludedFileNameExtensions" : ".png, .jpg",
            "dataToExtract": "contentAndMetadata",
            "failOnUnsupportedContentType" : false,
            "failOnUnprocessableDocument" : false
        }
    },
    "fieldMappings" : [
        { 
          "sourceFieldName" : "metadata_spo_site_library_item_id", 
          "targetFieldName" : "id", 
          "mappingFunction" : { 
            "name" : "base64Encode" 
          } 
         }
    ]
}
  • indexedFileNameExtensions: Specify file extensions to be indexed (in this case, only PDF and txt files)
  • excludedFileNameExtensions: Specify file extensions to be excluded from indexing (in this case, image files)
  • failOnUnsupportedContentType: When set to false, the indexer will skip unsupported documents instead of stopping
  • failOnUnprocessableDocument: When set to false, the indexer will ignore unidentifiable content
Once the indexer is created successfully, it will perform an initial crawl and enable searching.

Testing the Implementation

Testing shows that the search functionality delivers fairly accurate results.

I plan to use this search functionality to experiment with various RAG (Retrieval-Augmented Generation) implementations.

コメント

Copied title and URL