Splitter - semantic

Overview

Semantic splitter is an implementation of the Document Transformer interface that splits long documents based on semantic similarity. It follows Eino: Document Transformer Guide.

How It Works

First split the document into initial fragments using basic separators (newline, period, etc.)
Generate an embedding vector for each fragment
Compute cosine similarity between adjacent fragments
Decide split points by a similarity threshold percentile
Merge fragments smaller than the minimum size

Usage

Initialization

Initialize via NewSplitter with configuration:

splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
    Embedding:    embedder,                        // required: embedder to generate vectors
    BufferSize:   2,                               // optional: context buffer size
    MinChunkSize: 100,                             // optional: minimum chunk size
    Separators:   []string{"\n", ".", "?", "!"}, // optional: separator list
    Percentile:   0.9,                             // optional: split threshold percentile
    LenFunc:      nil,                             // optional: custom length func
})

Parameters:

Embedding: required embedder instance
BufferSize: include more context for similarity computation
MinChunkSize: merge fragments smaller than this size
Separators: ordered list used for initial split
Percentile: 0–1; higher means fewer splits
LenFunc: custom length function, default len()

Complete Example

package main

import (
    "context"

    "github.com/cloudwego/eino-ext/components/document/transformer/splitter/semantic"
    "github.com/cloudwego/eino/components/embedding"
    "github.com/cloudwego/eino/schema"
)

func main() {
    ctx := context.Background()

    embedder := &embedding.SomeEmbeddingImpl{} // eg: openai embedding

    splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
        Embedding:    embedder,
        BufferSize:   2,
        MinChunkSize: 100,
        Separators:   []string{"\n", ".", "?", "!"},
        Percentile:   0.9,
    })
    if err != nil { panic(err) }

    docs := []*schema.Document{{
        ID: "doc1",
        Content: `This is the first paragraph with important info.
This is the second paragraph, semantically related to the first.
This is the third paragraph, the topic has changed.
This is the fourth paragraph, continuing the new topic.`,
    }}

    results, err := splitter.Transform(ctx, docs)
    if err != nil { panic(err) }
    for i, doc := range results { println("fragment", i+1, ":", doc.Content) }
}

Advanced Usage

Custom length function:

splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
    Embedding: embedder,
    LenFunc: func(s string) int { return len([]rune(s)) }, // unicode length
})

Adjust granularity:

splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
    Embedding:    embedder,
    Percentile:   0.95,  // fewer split points
    MinChunkSize: 200,    // avoid too-small fragments
})

Optimize semantic judgment:

splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
    Embedding:  embedder,
    BufferSize: 10,                              // more context
    Separators: []string{"\n\n", "\n", "。", "！", "？", "，"}, // custom priority
})

References

Feedback

Was this page helpful?

Please tell us how we can improve.

Last modified December 11, 2025 : feat(eino): sync zh documents (#1474) (958594401a)