Skip to content

Preprocessing

The preprocessing pipeline applies minimal transformations:

  • Unicode normalization (NFKC): Ensures that text is in a consistent format, which is crucial for accurate searching.
  • Lowercasing for case-insensitive matching: This allows users to search without worrying about the case of the text.
  • Basic normalization for Japanese text: This step is essential for handling the unique characteristics of the Japanese language.

For inverted indexes, stopwords and stemming are not utilized to maintain neutrality across languages. Additionally, for graphemes composed of multiple code points (e.g., emojis), only the first code point is extracted for indexing.

Importance of Preprocessing

Preprocessing is vital for ensuring that the search engine can handle various text formats and languages effectively. For instance, Unicode normalization helps in avoiding discrepancies that may arise from different representations of the same character. Lowercasing ensures that searches are not case-sensitive, which is particularly useful for user queries.

Additional Preprocessing Steps for HybridTrieBigramInvertedIndex

For HybridTrieBigramInvertedIndex, additional preprocessing steps are performed:

  • Symbols and punctuation are removed and used as delimiters.
  • Tokenization by whitespace.
  • Classification of languages that are hard to segment into words (e.g., Japanese) versus those that do not (e.g., English).

As a result, standalone symbols cannot be searched, and URLs are tokenized into components. Ideally, a universal language-agnostic tokenization mechanism would be available, but existing solutions remain incomplete. Intl.Segmenter() provides partial support for languages like Japanese and Chinese, and future improvements are anticipated.