How to Configure SharePoint Document Libraries in Azure AI Search (Preview)

A step-by-step guide on integrating SharePoint with Azure AI Search.

Adding SharePoint Online as a Data Source to AI Search
Procedure
Testing the Implementation

Adding SharePoint Online as a Data Source to AI Search

While still in preview, Azure AI Search now supports indexing document libraries from SharePoint Online.

Note: There are several important limitations to be aware of:

File extension restrictions apply
SharePoint lists are not supported
ASPX files are excluded, among other limitations

I found the process of adding SharePoint to Azure AI Search somewhat challenging, so I’ve documented the steps here for reference.

Procedure

Here’s an overview of the main steps:

Create a SharePoint site for search indexing
Create an Azure AI Search resource
Register an application in Entra ID
Set up data source, index, and indexer

Detailed instructions below.

Step 0: Creating a SharePoint Site

First, create a SharePoint site that will serve as the search target.

In the document library, I’ve stored Wikipedia information about BLEACH, NARUTO, and ONE PIECE, which was scraped and chunked using Python.

Python WikipediaのURLから記事内容を抽出しチャンク分割＆テキスト保存する

RAGを試すためのデータが欲しかったので、WikipediaのURLを指定し、コンテンツをチャンク化してtxtに保存するコードを作成。必要なライブラリ以下、必要なライブラリ。pip install langchain langchain_c...

Step 1: Creating an Azure AI Search Resource

Open the Azure portal and select [Create] for AI Search

Select a service name and region, then create the resource. Note: We’re using the Free tier here as this is not for production use.

Leave all other settings at their default values.

Once the resource is created, go to the [Identity] tab of AI Search and enable [System-assigned managed identity]. Note: You don’t need to record the Object ID

Step 2: Register an Application in Entra ID

Next, select Entra ID

Choose [App registrations] and select [New registration]

Since we’re targeting a SharePoint Online site within the same tenant, select “Accounts in this organizational directory only” and register.

After registration, note down the [Application ID] from the [Overview] tab (Note 1).

Go to [API permissions] of the registered app and select [Add a permission]

Select [Microsoft Graph]

Under [Application permissions], select “Sites.Read.All” and

“Files.Read.All”, then click [Add permissions]

Grant [Admin consent]

In [Authentication], set [Enable the following mobile and desktop flows] to “Yes” and save

Click [Add a platform] and select [Mobile and desktop applications]

Select the “https://login.microsoftonline~~~” option and configure

Finally, go to [Certificates & secrets] tab and add a [New client secret] with your preferred expiration date

Note down the generated client secret (Note 2)

This completes the Entra ID registration process.

Step 3: Azure AI Search: Adding a Data Source

Return to AI Search and select [Add data source (JSON)] from the [Data sources] tab

Configure the following JSON:

{
 "name": "【Your desired data source name (example: sharepoint-datasource)】",
 "type": "sharepoint",
 "credentials": {
    "connectionString":"SharePointOnlineEndpoint=【SPO site URL (up to ~/sites/site-name)】;ApplicationId=【App ID (Note 1)】;ApplicationSecret=【Secret (Note 2)】;"
 }, 
 "container": {
     "name": "【Target document library (details explained later)】" 
 }
}

■type: Specify “sharepoint” for SharePoint data sources ■container/name

defaultSiteLibrary: Indexes all content in the site’s default document library
allSiteLibraries: Indexes all content across all document libraries within the site
useQuery: Only indexes content defined in the “query” parameter

※For more information about queries, please refer to this documentation

Step 4: Creating the Index

Next, define the index (columns and attributes used for searching).
Select [Add index (JSON)] from the [Indexes] section and specify the desired columns.
For this demonstration, we’ll use minimal settings (avoiding vector search for now as embeddings incur additional costs).

{
    "name" : "sharepoint-index",
    "fields": [
        { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
        { "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true },
        { "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false },
        { "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false },
        { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
    ]
}

Step 5: Creating the Indexer

Finally, create an indexer. Select [Add indexer (JSON)] from the [Indexers] section and configure the following settings before saving. Note: As this is a sample, we’re not using skillsets (chunking was already done on the Python side).

{
    "name" : "sharepoint-indexer",
    "dataSourceName" : "sharepoint-datasource",
    "targetIndexName" : "sharepoint-index",
    "parameters": {
        "batchSize": null,
        "maxFailedItems": null,
        "maxFailedItemsPerBatch": null,
        "base64EncodeKeys": null,
        "configuration": {
            "indexedFileNameExtensions" : ".txt, .pdf",
            "excludedFileNameExtensions" : ".png, .jpg",
            "dataToExtract": "contentAndMetadata",
            "failOnUnsupportedContentType" : false,
            "failOnUnprocessableDocument" : false
        }
    },
    "fieldMappings" : [
        { 
          "sourceFieldName" : "metadata_spo_site_library_item_id", 
          "targetFieldName" : "id", 
          "mappingFunction" : { 
            "name" : "base64Encode" 
          } 
         }
    ]
}

indexedFileNameExtensions: Specify file extensions to be indexed (in this case, only PDF and txt files)
excludedFileNameExtensions: Specify file extensions to be excluded from indexing (in this case, image files)
failOnUnsupportedContentType: When set to false, the indexer will skip unsupported documents instead of stopping
failOnUnprocessableDocument: When set to false, the indexer will ignore unidentifiable content

Once the indexer is created successfully, it will perform an initial crawl and enable searching.

Testing the Implementation

Testing shows that the search functionality delivers fairly accurate results.

I plan to use this search functionality to experiment with various RAG (Retrieval-Augmented Generation) implementations.

Official Documentation:

SharePoint Online indexer (preview) - Azure AI Search

Set up a SharePoint Online indexer to automate indexing of document library content in Azure AI Search.