Mongodb Indexes Cheat Sheet

Codylillyw
9 min readDec 16, 2024

--

Index vs Search Index

This post was written with the help of ChatGPT, an AI language model developed by OpenAI.

Standard Index

Regular indexes (e.g., B-tree indexes, hash indexes, etc.) are used to improve the performance of common database queries that rely on exact matches or range queries.

Use Case:

  • Equality queries (e.g., db.collection.find({ field: value }))
  • Range queries (e.g., db.collection.find({ field: { $gte: value } }))
  • Sorting (e.g., db.collection.find().sort({ field: 1 }))
  • Aggregation pipelines (when performing operations like $match or $sort on indexed fields)

Using explain you can see if the index exists for a given query. An index should exist for all queries that are performed by the backend.

In the Codebase: You can define regular indexes directly in your MongoDB codebase, usually through MongoDB’s drivers or in an aggregation pipeline using methods like createIndex() (e.g., in JavaScript with the MongoDB Node.js driver).

In MongoDB Atlas: Regular indexes can also be created through the Atlas UI or the Atlas API by navigating to the Indexes tab of your collection and defining the index.

Search Index

Search indexes enable more sophisticated searching capabilities than the built-in text indexes including things like advanced full-text search with relevance scoring, autocomplete, fuzzy matching. These are used with $search in aggregation.

Where to find search index: In atlas select the collection you want to see the search index for and select the search indexes tab.

How to see history of queries and their latency — Go to Query Insights -> Query Profiler choose a length of time and check the queries with the highest latency there

Search Analyzers

Both analyzer and searchAnalyzer typically use the same set of options because they rely on the same set of underlying text-processing tools.

While it’s common to use the same analyzer for both indexing and querying, there are scenarios where different analyzers are advantageous.

An analyzer is composed of

  • 0+ character filters ie remove punctuation
  • 1 tokenizer
  • 0+ token filters ie remove common english words

You can add a custom analyzer in your search index:

Token Filters

Token analyzers are responsible for breaking the text into individual tokens (words or terms) that can be indexed and searched.

  1. standard - The default tokenizer, which splits the text into tokens based on word boundaries, removing punctuation.
  2. whitespace - Splits the text into tokens based on whitespace, preserving punctuation as part of the token.
  3. keyword - Treats the entire input string as a single token without breaking it up.
  4. letter - Tokenizes based on alphabetic characters only, ignoring digits and punctuation.
  5. path - Used for splitting paths, typically useful for fields that represent paths or URLs.
  6. edgeNGram - Tokenizes by breaking text into tokens of increasing length from the start, useful for prefix-based searches.
  7. ngram - Tokenizes by breaking the text into fixed-length substrings (ngrams), which helps in fuzzy matching or partial word search.
  8. delimited - Splits the text into tokens based on delimiters such as punctuation marks or other custom separators.
  9. custom - Allows for custom tokenization based on a user-defined set of rules.

When creating a custom token filter in MongoDB Atlas Search, you can use several keys to control how the tokenization process modifies the tokens after they are created by the tokenizer. These filters apply operations such as removing stopwords, stemming words, or normalizing tokens.

Keys for Custom Token Filters

type:

Specifies the type of token filter.

Common types include lowercase, stopwords, edgeNGram, synonym, stemmer, length, patternReplace, and others.

Example: "type": "lowercase", "type": "stopwords".

stopwords (for stopwords type):

A list of words to be removed from the token stream.

Example: "stopwords": ["the", "and", "or"].

language (for language-specific filters):

Specifies the language to use for filters like stemming or stopword removal.

Example: "language": "english" (for language-specific token filters like stemming).

minGram (for edgeNGram type):

Defines the minimum length of tokens to be generated by an edgeNGram tokenizer.

Example: "minGram": 2 (to generate tokens starting with at least two characters).

maxGram (for edgeNGram type):

Defines the maximum length of tokens to be generated by an edgeNGram tokenizer.

Example: "maxGram": 5 (to generate tokens with a maximum length of five characters).

pattern (for patternReplace type):

A regular expression pattern used to match and replace parts of tokens.

Example: "pattern": "[^a-zA-Z]", "replacement": " " (replaces non-alphabetic characters with spaces).

replacement (for patternReplace type):

The string to replace matched tokens in the patternReplace filter.

Example: "replacement": "-".

stemmer (for stemmer type):

Specifies the language to use for stemming.

Example: "stemmer": "english" (for stemming English words).

synonyms (for synonym type):

A list of synonym rules, where a word can be replaced by another word or set of words.

Example: "synonyms": [["quick", "fast"], ["jumps", "leaps"]].

length (for length type):

Specifies the minimum and maximum token lengths to include.

Example: "min": 3, "max": 10 (tokens must be between 3 and 10 characters long).

caseInsensitive (for synonym or stopwords types):

A flag that specifies whether the filter should be case-insensitive.

Example: "caseInsensitive": true.

preserveOriginal (for edgeNGram and other filters):

A flag that specifies whether the original token should be preserved alongside the filtered version.

Example: "preserveOriginal": true.

Example of a Custom Token Filter Definition:

{
"tokenFilters": [
{
"type": "stopwords",
"stopwords": ["the", "and", "or"]
},
{
"type": "patternReplace",
"pattern": "[^a-zA-Z]",
"replacement": " "
},
{
"type": "synonym",
"synonyms": [["quick", "fast"], ["jumps", "leaps"]]
}
]
}

In this example:

  • stopwords removes common words like "the", "and", and "or".
  • patternReplace replaces non-alphabetic characters with spaces.
  • synonym replaces "quick" with "fast" and "jumps" with "leaps".

Character Filters

Character analyzers modify or transform characters before they are tokenized. They are generally used to handle things like case normalization, stemming, and stop words.

  1. lowercase - Converts all characters to lowercase before tokenization, ensuring that searches are case-insensitive.
  2. uppercase - Converts all characters to uppercase.
  3. asciifolding - Converts characters like accented characters (e.g., é to e) to their ASCII equivalents.
  4. stopwords - Removes common stopwords (like "and", "the", etc.) from the text before tokenization. This helps to reduce noise in searches.
  5. stemming - Reduces words to their root form (e.g., "running" to "run") to ensure that variations of a word are treated the same.
  6. synonyms - Allows the inclusion of synonyms to ensure that searches for one word can also match related terms.
  7. custom - Custom character filters can be applied for more specialized transformations.

Keys for Custom Character Filters

type:

Specifies the type of character filter.

Example: "type": "asciiFolding", "type": "patternReplace", etc.

pattern (for patternReplace type):

A regular expression pattern used to match characters to be replaced.

Example: "pattern": "[^a-zA-Z0-9]" (matches non-alphanumeric characters).

replacement (for patternReplace type):

The string to replace the matched characters.

Example: "replacement": " " (replaces matched characters with a space).

stopwords (for stopwords type):

A list of stopwords to remove from the input text.

Example: "stopwords": ["the", "and", "or"].

language (for stopwords and other language-specific filters):

The language to use for specific filters like stopwords or stemming.

Example: "language": "english" (used for stopwords, stemming, etc.).

replace (for asciifolding or patternReplace):

Defines the replacement behavior, often used for ASCII folding or pattern replacement operations.

preserveOriginal (for some character filters):

A flag that indicates whether the original text should be preserved alongside the modified version.

Example: "preserveOriginal": true.

Example Custom Analyzer

{
"analyzers": [{
"name": "lowercaser",
"charFilters": [],
"tokenizer": {"type": "keyword"},
"tokenFilters": [{"type": "lowercase"}]
}]
}

Analyzers that come with mongodb:

Lucene.Standard

  • Description: This is the default analyzer. It tokenizes text based on Unicode text segmentation rules, removes most punctuation, and lowercases terms.
  • Strengths: Suitable for general-purpose search. It handles multi-language input and removes unnecessary symbols, ensuring consistent tokenization.
  • Example: Searching “Hello, World!” produces tokens hello and world.
  • Best Use Cases: General search functionality for multi-language datasets, simple document search without specific linguistic rules.

Lucene.English

  • Description: Designed for English text, it performs stemming (reducing words to their root forms) and removes stop words (common words like “and,” “the,” etc.).
  • Strengths: Enhances search relevance for English text by normalizing words and ignoring non-informative terms.
  • Example: Searching “running quickly” produces tokens run and quick.
  • Best Use Cases: English-only datasets, where precision and relevance are critical, such as document repositories, blogs, or e-commerce.

Lucene.Simple

  • Description: Tokenizes text by splitting on non-letter characters and lowercases all tokens.
  • Strengths: A basic analyzer with no language-specific processing. It ignores numbers and special characters.
  • Example: Searching “123 Apples & Bananas” produces tokens apples and bananas.
  • Best Use Cases: Simplistic search requirements without the need for language rules or punctuation sensitivity.

Lucene.Whitespace

  • Description: Splits text only on whitespace and does not modify the text (e.g., no lowercasing, no stemming).
  • Strengths: Preserves case sensitivity and punctuation, making it useful for structured data or technical content.
  • Example: Searching “MongoDB Atlas Search” produces tokens MongoDB, Atlas, and Search.
  • Best Use Cases: Search for technical documentation, programming code, or datasets where case and special characters are meaningful.

Lucene.Keyword

  • Description: Does not tokenize the text; treats the entire input as a single token.
  • Strengths: Maintains the exact input, useful for fields requiring exact matches.
  • Example: Searching “User-Agent: Mozilla/5.0” keeps the entire string as a single token.
  • Best Use Cases: Metadata searches, IDs, URLs, or any structured data requiring exact match searches.

Lucene.French — Language specific

  • Description: Similar to Lucene.English, but tailored for the French language. It removes stop words and applies French-specific stemming rules.
  • Strengths: Improves search relevance by normalizing French words, taking into account unique grammatical rules.
  • Example: Searching “chats courants” produces tokens chat and courant.
  • Best Use Cases: French-only datasets, such as news articles, product catalogs, or academic papers.

Making a Search Index

minGram: This defines the minimum size of n-grams for n-gram tokenization. It’s particularly useful for partial matching. For example, if minGram is set to 2, terms are broken into pieces of at least two characters.

type: Specifies the type of field or mapping in the index.

  • document - The default type for documents in a collection, where the entire document is indexed.
  • string - For fields that hold string values, such as names, titles, etc.
  • number - For fields that hold numeric values, such as prices, quantities, etc.
  • boolean - For fields that hold boolean values (true or false).
  • date - For fields that hold date values, such as timestamps.
  • object - For fields that hold sub-documents or nested fields.
  • array - For fields that hold arrays of values (strings, numbers, etc.).
  • geo - For geographic coordinates, typically used in geospatial queries.
  • text - For fields that are meant for full-text search, where text analysis and tokenization can be applied.
  • keyword - For fields where exact matches are needed, often used for tags, IDs, or other non-analyzed fields.

tokenization: Defines how text is broken into smaller units (tokens). Tokenizers can be standard, edge n-gram, or custom, each tailored to specific use cases like word-based or partial matching.

  • keyword: Treats the entire input as a single token.
  • whitespace: Splits input text into tokens based on whitespace.
  • standard: Uses the Lucene Standard Tokenizer, which splits text based on language-specific rules.
  • edgeGram: Generates tokens of increasing lengths, starting from the beginning of a word.
  • nGram: Generates tokens of specified lengths (n-grams) from the input text.
  • simple: Splits text into tokens by non-letter characters.
  • letter: Splits text into tokens by non-letter characters, but removes all non-letter characters from the output.
  • classic: A legacy Lucene tokenizer that splits text similarly to standard but handles special cases like email addresses differently.
  • uaxUrlEmail: Tokenizes text and treats URLs and email addresses as single tokens.

fields: This parameter is used in dynamic mappings to specify which fields are included in the index. You can define static mappings for individual fields or use dynamic mappings for broader patterns.

Analyzer: Determines how text is processed during indexing and querying. You can use built-in or custom analyzers.

Search Analyzer: Optimizes query-time text processing.

Store Option: If enabled, stores a copy of indexed data, which can be retrieved without fetching the original document.

Dynamic or Static Mappings: Configures whether all fields (dynamic) or selected fields (static) are indexed.

Highlighting: Configures fields for highlighting query matches.

Score Modifiers: Adjusts scoring of results for relevance tuning.

Concerns for creating a search index:

  • Understand Your Data
    Identify the fields you want to make searchable, including full-text fields, numeric fields, dates, and arrays.
  • Define Mappings
    Choose appropriate types (string, number, date, document, etc.).
  • Choose the Right Analyzer
    Use lucene.standard for general use.
    Use lucene.english for English text with stemming and stopword removal.
    Use lucene.keyword for exact matches.
  • Optimize Tokenization
  • Handle Fields and Nested Documents
  • Set Weights for Fields using score
  • Test Your Index using common use cases
  • Monitor and Refine
    Use the indexStats stage to analyze how the index is used

Watch this video for more information on search indexes: https://www.youtube.com/watch?v=2oJuXx6mceE

--

--

Codylillyw
Codylillyw

Written by Codylillyw

I am a Software engineering student in my senior year with most of my experience in web development and related technology.

No responses yet