There's Coffee In That Nebula. Part 3: Preparing the knowledge

Written by

Mariano Cigliano

Published on

January 26, 2024

TL;DR

Continue the journey into the innovative world of Mobegí, our AI-powered Slack bot designed to streamline office queries. In the third chapter of the series, we're exploring its sophisticated data ingestion pipeline mechanisms, a crucial aspect of its architecture that ensures overall effectiveness and reliability. Continue reading for the detailed overlook.

Author

Mariano Cigliano

R&D Tech Leader

My LinkedIn

Download 2024 SaaS Report

Thank you! Your submission has been received

Oops! Something went wrong while submitting the form.

In the third chapter of our series, we delve deeper into the intricate world of Mobegí – our groundbreaking AI-powered conversational assistant designed to enhance company knowledge accessibility. You'll learn about the sophisticated mechanisms behind Mobegí's data ingestion pipeline, a crucial aspect of its architecture that ensures overall effectiveness and reliability.

We'll explain the complexities of data collection, transformation, and anonymization, highlighting the challenges and solutions encountered. Additionally, we'll discuss the sophisticated techniques used for entity recognition, optimizing vector embeddings, and determining the ideal chunk sizes for data processing.

Keep reading to discover details of the meticulous processes that empower Mobegí.

The pipeline

Our pipeline is a composition of 2 sub pipelines

World – transforms raw data into versioned world data
Knowledge – transforms versioned world data into versioned knowledge data

World

The world ingestion pipeline is a sequence of steps that allows us to:

Gather raw data
Process them
Store a snapshot of the result

It involves components like loaders, transformers, and ingesters. All of them belong to the knowledge module.

Loading data from Confluence

The PAT Token quest

After the first attempt of dumping all Confluence data and then parsing it, we realized API access was better. It required a PAT token, and that was an interesting moment that may resonate with many developers.

You’re in uncharted waters, dealing with a non-deterministic system with uncertain literature about how to make it work. You think that’s the difficult bit. Then, one day, you need a token to access an API – how hard could that be?

It turns out that API token and PAT token are two different concepts, and documentation is unclear about where to create the PAT one. Despite that, you think you found the right place, except it is listed as an "API Token," and it doesn’t work.

Eventually, a couple of hours and a few coffees later, you find out the token is right, but you need to compose it with your email and encode it to base64.

Of course, this is not about Atlassian or a specific product; it’s more about the developer's life and why we deserve empathy.

Parsing

LangChain has a component for that, and we also investigated frameworks such as unstructured.io, but we needed more control over which metadata would decorate the chunk of text in the vectorstore, so eventually, we wrote our own.

When it comes to parsing a page, confluence API gives you several representation options; the most common are:

view – renders the page content in HTML format, including macros, markup, etc. This is the default.
export_view – renders to HTML without any Confluence macros or markup. Plain HTML content.
storage – gives the storage format, which is the underlying Confluence wiki markup syntax used to store the content.

It took a bit of trial and error because there is no perfect choice. Eventually, ‘view’ was the representation that worked best for us. We wanted to get as close as possible to pure text while keeping the links that would be useful to include in the responses.

We used a particular page's content to gather all people's roles in the company — the Team page. This was not an ideal choice for several reasons. Most importantly, it was unreliable over time; any changes to the Confluence documentation could alter the page's content. As a result, the roles list might no longer remain centralized on that one page.

We have plans to address this issue on our roadmap. That said, it was a safe choice, given the context. The implementation worked nicely, and there were no immediate plans to alter the Confluence documentation. As a small team, we were also focused on addressing anonymization, which was the top priority at the time.

Sometimes attention to detail simply means making a note that something is not quite how you know it should be and revisiting it after getting to a functional, testable feature. The important aspect was keeping the code debt low. All the components related to people’s data gathering and processing are readable, maintainable, and decoupled, letting us seamlessly switch solutions later.

Transforming

In the previous step, we gathered 150 Pages and 402 People. Before storing them, we want to be sure the data is as clean as possible.

We mark a page as redundant if:

Its content is empty (It can happen easily with index pages)
Its content is included in another page

We don’t store people who are not mentioned on any page.

When it comes to mentions, we discovered that it could happen by their first name also, but only when they are already mentioned with their full name previously. When it happens, we replace all the occurrences of their first name with their full name, mainly for anonymization purposes.

At this stage, we also decorate people collection by:

Adding their role
Caching the paragraphs of the pages where they are mentioned with metadata

After transforming, we ended up with 147 pages and 89 people – clean and solid data we can ingest into our knowledge collections.

Preparing anonymization map

Once data is ready, we use a LangChain wrapper around Microsoft Presidio to prepare the anonymization map. We don’t anonymize data at this stage; we just analyze them and prepare the map to be stored.

Presidio utilizes spaCy underneath to power aspects of its entity analysis and data obfuscation capabilities.

As a default, Presidio:

Supports the English language for its classifiers
Downloads a large natural language processing model to enable this analysis

It can identify and categorize entities such as persons, emails, URLs, and organizations within the input text. It generates mappings of these potentially sensitive values to placeholders or hashes in order to anonymize data. However, as Microsoft states in their own Presidio documentation, "Every PII identification logic has its errors, requiring a trade-off between false positives (falsely detected text) and false negatives (undetected PII entities)."

We found that analyzing data in multiple passes significantly improved throughput over a single-pass approach that attempted to classify all data types at once. With a single pass, around 30% of emails were falsely classified as URLs. However, applying a two-pass analysis eliminated this issue completely.

Additionally, multi-pass analysis provided flexibility to optimize memory usage. Presidio defaults to spaCy's en_core_web_lg entity recognition model, which requires 560 MB. By assessing entities in two stages, we were able to switch to the more compact 40 MB en_core_web_md model without sacrificing accuracy.

What’s in a name?

We know Romeo was not thinking of data anonymization or us — sorry, Juliet, no scene-stealing intended — but his plea touches on our own struggles detecting personal entities.

Detecting Polish names specifically posed early challenges, as they were primarily false negatives.

Our first attempts – switching to the large spaCy model and adding a supplemental Polish model – seemed to resolve the issues.

However, overall person entity detection reached 88% accuracy — still short of the 100% precision needed for complete anonymization. Ironically, leveraging AI to improve recognition was not an option, as the very goal was obscuring names from those same language models.

After adjusting the acceptance threshold without success and reviewing guidelines on maximizing accuracy, we pursued a custom pattern-based recognizer tailored to the performance needs clearly stated in Presidio documentation.

This finally delivered 100% precision, securing complete anonymity and, most importantly, avoiding Romeo's unfortunate end.

Storing world data

With everything in place, we can finally ingest a new version of world data. The version ID can be optionally passed to the pipeline, otherwise, it is uniquely generated.

We store people and pages under the firestore collection:

ROOT/data/world/[version_id]

And the anonymizer map under:

ROOT/data/anonymizer_map/[version_id]

Knowledge

This pipeline uses a particular version of world data to create a version of knowledge data, data that will be actually used by the application.

Once again, the involved components are loaders, transformers, and ingesters from the knowledge module.

Loading world data

The pipeline accepts a world data version ID as input and uses it to load:

Pages
People
Anonymization map

Transforming

Both people and pages collections are anonymized using the versioned map. Additionally, pages are chunked and prepared for the vectorstore ingestion.

The chunking strategy

When chunking documents for indexing, determining the optimal segment size poses an inherent tradeoff.

Undoubtedly, the embedding model affects the choice because of its own tokens capacity, but it is mainly about the balance between representation accuracy and contextual significance.

Excessively large chunks may not reliably capture conceptual granularity in vector representations, while small chunks risk stripping inter-concept connections within a narrative, and topics can straddle boundaries.

There is no universal recipe, and ultimately, the choice of the chunking strategy depends on the nature of the content being indexed and its intended application context.
Are our documents long or short? Is there a hierarchy to preserve? What is the length and complexity of user queries?

We explored a multi-pass strategy - ingesting the corpus into three vector stores, each with different chunk sizes. Queries would then be retrieved from all three in parallel, merging and re-ranking aggregated results.

Despite promising accuracy, two primary drawbacks emerged:

Latency suffered with the added coordination complexity
Economic costs tripled through replicated indexing

Ultimately, we adopted an approach called Metadata Replacement + Node Sentence Window, by LLama Index. This technique splits documents into very small units sentence-wise during ingestion, preserving metadata on the surrounding context. Retrieval avoids returning the exact isolated chunk. Instead, the indexed metadata provides a wider passage of sentences to supply crucial context.

In practice, Node Sentence Windows enhanced accuracy over a single chunk size while simultaneously reducing latency and costs by sidestepping full duplication. This method delivered an optimal balance between precision and efficiency.

After our implementation, we became aware of other possible approaches. Among many, we plan to investigate the Hypothetical questions one, sharing more details in our roadmap.

Embeddings tests

A part of the obvious OpenAIEmbeddings(text-embedding-ada-002), we tested alternatives keeping attention to https://huggingface.co/spaces/mteb/leaderboard.

Eventually, we decided to stay with the first one and put our investigation on hold until we had a solid evaluation framework.

Storing knowledge data

In addition to a world version ID, a knowledge version ID can be optionally passed to the pipeline, otherwise it is uniquely generated.

Using the latter one, we store the processed people collection in firestore, under:

ROOT/data/knowledge/[knowledge version id]/people

Finally, we index the pages collection using knowledge configuration.

At the time of writing, we use
- OpenAIEmbeddings(text-embedding-ada-002)
- Chrome vectorstore

Under the collection:

pages_[knowledge version id]

Storing knowledge data snapshot

Once the knowledge base data processing completes ingestion, the pipeline records an immutable "snapshot" containing run metadata like:

The knowledge base version identifier
The world base version identifier, used as a source
Knowledge configuration parameters (which are the ones we use for indexing)

At query time, the snapshot can configure additional downstream processes to match settings used during ingestion.

As two examples:

The memory module utilizes the snapshot data to instantiate a retriever aligned to the original vectorization scheme and content embeddings. This ensures high relevance without requiring custom coordination logic.
The anonymization techniques in the skillset module reference the snapshot’s world version identifier when constructing mappings to replace sensitive data. Again, the complete context gets encapsulated in a single configuration object.

In summary, the snapshot allows consistency without sacrificing modularity across components. More details about our approach are available in the configuration section of the previous chapter.

At the moment of writing, this is how the knowledge configuration data snapshot looks like when serialized:

{
    "id":[knowledge version id],
    "world_version":[world version id],	
    "knowledge_configuration":{
    "vector_store":{
      "pages":{
        "mode":"chroma",
        "metadata":{
          "collection": "pages"
        }
      }
    },
    "embeddings:":{
      "pages:":{
        "mode":"openai",
        "metadata":{
          "model": "text-embedding-ada-002"
    }
      }
    },
    "chunking":{
      "pages":{
        "mode":"sentence_window",
        "metadata":{
          "window_size": 2
        }
      }
    }
  }
}

And it is stored in firestore, under:

ROOT/configuration_snapshots/knowledge/[knowledge version id]

Coming next

Thank you for exploring Mobegí’s data ingestion pipeline! We covered critical challenges like person entity recognition for anonymity and optimizing vector embeddings and chunk sizes.

If you’re interested, our next chapter unpacks the architecture of our conversational system, outlining its core components for responsible and effective dialog.