Vector databases

chromadb home page

Search for: chromadb home page

It is called trycrhroma for some reason.

home page

install error chromadb 0.3.29 depends on onnxruntime>=1.14.1

Search for: install error chromadb 0.3.29 depends on onnxruntime>=1.14.1

  1. Too many packages
  2. Fails on lot of dependencies
  3. will try pinecode cloud instead
  1. 1. seem to work
  2. 2. that was the latest in 3.11 series

Usage guide, Guide

Chromadb docs at hugging face

Search for: Chromadb docs at hugging face

Chromadb at hugging face

sentence embedding models at hugging face

Search for: sentence embedding models at hugging face

embeddings from an API

Getting started with embeddings: HF


"""
**********************************************
Examine the database
**********************************************
"""
def examineCollection(col: chromadb.Collection):

    #Print collection name
    log.ph1(f"Examining collection: {col.name}")

    #Get the result (top 10 items)
    result = col.peek()
    
    #
    # result: 
    # Dictionary of ids, metadatas, documents, embeddings, uris
    #
    log.dprint(f"ids:{result['ids']}")
    log.dprint(f"\nMetadatas: {result['metadatas']}")
    log.dprint(f"\nuris: {result['uris']}")

    #Not printing docuemnts for they could be too large on a print
    #Same with embeddings
    return

"""
**********************************************
Populate the database
**********************************************
"""

def addASonnet(col: chromadb.Collection, sonnet: str, roman: str):
    log.validate_not_null_or_empty(col, sonnet, roman)
    id = datasetutils.roman_to_int(roman)
    id_str = f"{id}"
    col.upsert(
        ids=[id_str],
        documents=[sonnet],
        metadatas=[{"Sonnet number": id_str, "Sonnet Roman Numeral": roman}])
    return

def addFromASonnetDictionary(col: chromadb.Collection, sonnetsDict: dict):
    #key: roman numeral
    #value: sonnet
    for item in sonnetsDict.items():
        roman = item[0]
        sonnet = item[1]
        log.dprint(f"Adding sonnet:{roman}")
        addASonnet(col,sonnet,roman)
    return

def addSonnetDatasetToChromadb():
    log.ph1("Adding 20 sonnets to chromadb")

    #get sonnets
    log.dprint("Getting 20 sonnet dictionary")
    sonnetDict = datasetutils.get20SonnetsDictionary()
    log.summarizeDictionary(sonnetDict)

    #get a chromadb collection
    log.dprint("Get chromadb collection")
    col = getOrCreateATestCollection()

    log.dprint("Add sonnets")
    addFromASonnetDictionary(col,sonnetDict)

    log.ph1("Done with creating all sonnets")
    return

def populateDatabase():
    addSonnetDatasetToChromadb()

def getChromaClientPath():
    path = fileutils.getTempDataRoot()
    filename ="chromadb1"
    new_db_path = fileutils.pathjoin(path,filename)
    return new_db_path

def getChromaDBClient():
    chromadb_path = getChromaClientPath()
    chromadbClient = chromadb.PersistentClient(chromadb_path)
    return chromadbClient

def getOrCreateCollection(client: chromadb.ClientAPI, name: str):
    return client.get_or_create_collection(name)

def getOrCreateATestCollection() -> chromadb.Collection:
    #create chroma db
    log.ph1("Creating chromadb")
    chromaClient = getChromaDBClient()
    col = chromaClient.get_or_create_collection("Test_Collection")
    log.dprint("Get/Created test collection")
    return col

Examples at Chromadb Github

Github for this code


collection.query(
    query_texts=["doc10", "thus spake zarathustra", ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)
  1. It is a collection of inputs
  2. Each input is vectorized and compared
  3. $contains is an operator to qualify how to use the search string.
  4. Similarly there a number of these operators for where clause and also where_document clause
  5. See the Usage Guide document
  1. docs
  2. metadata
  3. embeddings etc.

Chunking plays a massive role in locating similar content!!


python -m spacy download en_core_web_sm

Spacy segmentation

How come spacy doesn't break English sentences at periods and question marks out of the box?

Search for: How come spacy doesn't break English sentences at periods and question marks out of the box?

How do I reset a chromadb collection

Search for: How do I reset a chromadb collection


def deleteTestCollection():
    client = getChromaDBClient()
    global sonnet_collection_name
    client.delete_collection(sonnet_collection_name)

def deleteTestCollection():
    client = getChromaDBClient()
    global sonnet_collection_name
    log.ph("Deleting Collection",sonnet_collection_name)

    try :
        col = client.get_collection(sonnet_collection_name)
    except ValueError as e:
        log.dprint("test collection doesn't exist")
        return

    client.delete_collection(sonnet_collection_name)

Chromadb embeddings guide

Chromadb github jupy notebooks on embeddings

Chroma embedding functions are here in github

  1. chromadb.utils.embedding_functions.py
  2. chromadb.api.types.py

Chroma base class def