Vector databases

satya - 1/24/2024, 2:05:08 PM

chromadb home page

chromadb home page

Search for: chromadb home page

satya - 1/24/2024, 2:08:40 PM

It is called trycrhroma for some reason.

It is called trycrhroma for some reason.

satya - 1/24/2024, 2:09:01 PM

home page

home page

satya - 1/24/2024, 2:21:58 PM

install error chromadb 0.3.29 depends on onnxruntime>=1.14.1

install error chromadb 0.3.29 depends on onnxruntime>=1.14.1

Search for: install error chromadb 0.3.29 depends on onnxruntime>=1.14.1

satya - 1/24/2024, 3:35:39 PM

Installation of chromadb fails dramatically on python 1.12

  1. Too many packages
  2. Fails on lot of dependencies
  3. will try pinecode cloud instead

satya - 1/29/2024, 12:06:30 PM

I have installed 3.11.7

  1. 1. seem to work
  2. 2. that was the latest in 3.11 series

satya - 1/29/2024, 3:35:54 PM

Usage guide, Guide

Usage guide, Guide

satya - 1/29/2024, 3:59:58 PM

Chromadb docs at hugging face

Chromadb docs at hugging face

Search for: Chromadb docs at hugging face

satya - 1/29/2024, 4:05:18 PM

Chromadb at hugging face

Chromadb at hugging face

satya - 1/30/2024, 1:42:16 PM

sentence embedding models at hugging face

sentence embedding models at hugging face

Search for: sentence embedding models at hugging face

satya - 1/30/2024, 1:54:36 PM

embeddings from an API

embeddings from an API

satya - 1/30/2024, 1:55:57 PM

Getting started with embeddings: HF

Getting started with embeddings: HF

satya - 1/30/2024, 4:20:49 PM

Examine the database


"""
**********************************************
Examine the database
**********************************************
"""
def examineCollection(col: chromadb.Collection):

    #Print collection name
    log.ph1(f"Examining collection: {col.name}")

    #Get the result (top 10 items)
    result = col.peek()
    
    #
    # result: 
    # Dictionary of ids, metadatas, documents, embeddings, uris
    #
    log.dprint(f"ids:{result['ids']}")
    log.dprint(f"\nMetadatas: {result['metadatas']}")
    log.dprint(f"\nuris: {result['uris']}")

    #Not printing docuemnts for they could be too large on a print
    #Same with embeddings
    return

satya - 1/30/2024, 4:21:33 PM

Populating a database


"""
**********************************************
Populate the database
**********************************************
"""

def addASonnet(col: chromadb.Collection, sonnet: str, roman: str):
    log.validate_not_null_or_empty(col, sonnet, roman)
    id = datasetutils.roman_to_int(roman)
    id_str = f"{id}"
    col.upsert(
        ids=[id_str],
        documents=[sonnet],
        metadatas=[{"Sonnet number": id_str, "Sonnet Roman Numeral": roman}])
    return

def addFromASonnetDictionary(col: chromadb.Collection, sonnetsDict: dict):
    #key: roman numeral
    #value: sonnet
    for item in sonnetsDict.items():
        roman = item[0]
        sonnet = item[1]
        log.dprint(f"Adding sonnet:{roman}")
        addASonnet(col,sonnet,roman)
    return

def addSonnetDatasetToChromadb():
    log.ph1("Adding 20 sonnets to chromadb")

    #get sonnets
    log.dprint("Getting 20 sonnet dictionary")
    sonnetDict = datasetutils.get20SonnetsDictionary()
    log.summarizeDictionary(sonnetDict)

    #get a chromadb collection
    log.dprint("Get chromadb collection")
    col = getOrCreateATestCollection()

    log.dprint("Add sonnets")
    addFromASonnetDictionary(col,sonnetDict)

    log.ph1("Done with creating all sonnets")
    return

def populateDatabase():
    addSonnetDatasetToChromadb()

satya - 1/30/2024, 4:23:42 PM

Creating the database


def getChromaClientPath():
    path = fileutils.getTempDataRoot()
    filename ="chromadb1"
    new_db_path = fileutils.pathjoin(path,filename)
    return new_db_path

def getChromaDBClient():
    chromadb_path = getChromaClientPath()
    chromadbClient = chromadb.PersistentClient(chromadb_path)
    return chromadbClient

def getOrCreateCollection(client: chromadb.ClientAPI, name: str):
    return client.get_or_create_collection(name)

def getOrCreateATestCollection() -> chromadb.Collection:
    #create chroma db
    log.ph1("Creating chromadb")
    chromaClient = getChromaDBClient()
    col = chromaClient.get_or_create_collection("Test_Collection")
    log.dprint("Get/Created test collection")
    return col

satya - 1/30/2024, 4:29:43 PM

Examples at Chromadb Github

Examples at Chromadb Github

satya - 1/30/2024, 4:36:19 PM

Github for this code

Github for this code

satya - 1/31/2024, 3:55:38 PM

querying by general text


collection.query(
    query_texts=["doc10", "thus spake zarathustra", ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

satya - 1/31/2024, 4:00:26 PM

Key elements

  1. It is a collection of inputs
  2. Each input is vectorized and compared
  3. $contains is an operator to qualify how to use the search string.
  4. Similarly there a number of these operators for where clause and also where_document clause
  5. See the Usage Guide document

satya - 1/31/2024, 4:01:40 PM

You can also constrain what is returned among

  1. docs
  2. metadata
  3. embeddings etc.

satya - 2/1/2024, 8:46:23 AM

Chunking plays a massive role in locating similar content!!

Chunking plays a massive role in locating similar content!!

satya - 2/1/2024, 9:01:38 AM

For spacy you have to do this on the terminal


python -m spacy download en_core_web_sm

satya - 2/1/2024, 10:30:22 AM

Spacy segmentation

Spacy segmentation

satya - 2/1/2024, 10:43:55 AM

How come spacy doesn't break English sentences at periods and question marks out of the box?

How come spacy doesn't break English sentences at periods and question marks out of the box?

Search for: How come spacy doesn't break English sentences at periods and question marks out of the box?

satya - 2/1/2024, 12:21:26 PM

How do I reset a chromadb collection

How do I reset a chromadb collection

Search for: How do I reset a chromadb collection

satya - 2/1/2024, 12:26:51 PM

deleting a collection


def deleteTestCollection():
    client = getChromaDBClient()
    global sonnet_collection_name
    client.delete_collection(sonnet_collection_name)

satya - 2/1/2024, 12:42:04 PM

A better one


def deleteTestCollection():
    client = getChromaDBClient()
    global sonnet_collection_name
    log.ph("Deleting Collection",sonnet_collection_name)

    try :
        col = client.get_collection(sonnet_collection_name)
    except ValueError as e:
        log.dprint("test collection doesn't exist")
        return

    client.delete_collection(sonnet_collection_name)

satya - 2/8/2024, 4:55:57 PM

Chromadb embeddings guide

Chromadb embeddings guide

satya - 2/8/2024, 5:03:27 PM

Chromadb github jupy notebooks on embeddings

Chromadb github jupy notebooks on embeddings

satya - 2/8/2024, 5:05:20 PM

Chroma embedding functions are here in github

Chroma embedding functions are here in github

satya - 2/8/2024, 5:06:19 PM

Useful packages

  1. chromadb.utils.embedding_functions.py
  2. chromadb.api.types.py

satya - 2/8/2024, 5:09:38 PM

Chroma base class def

Chroma base class def