Vector databases
satya - 1/24/2024, 2:08:40 PM
It is called trycrhroma for some reason.
It is called trycrhroma for some reason.
satya - 1/24/2024, 2:21:58 PM
install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
Search for: install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
satya - 1/24/2024, 3:35:39 PM
Installation of chromadb fails dramatically on python 1.12
- Too many packages
- Fails on lot of dependencies
- will try pinecode cloud instead
satya - 1/29/2024, 12:06:30 PM
I have installed 3.11.7
- 1. seem to work
- 2. that was the latest in 3.11 series
satya - 1/29/2024, 3:59:58 PM
Chromadb docs at hugging face
Chromadb docs at hugging face
satya - 1/30/2024, 1:42:16 PM
sentence embedding models at hugging face
sentence embedding models at hugging face
satya - 1/30/2024, 1:55:57 PM
Getting started with embeddings: HF
satya - 1/30/2024, 4:20:49 PM
Examine the database
"""
**********************************************
Examine the database
**********************************************
"""
def examineCollection(col: chromadb.Collection):
    #Print collection name
    log.ph1(f"Examining collection: {col.name}")
    #Get the result (top 10 items)
    result = col.peek()
    
    #
    # result: 
    # Dictionary of ids, metadatas, documents, embeddings, uris
    #
    log.dprint(f"ids:{result['ids']}")
    log.dprint(f"\nMetadatas: {result['metadatas']}")
    log.dprint(f"\nuris: {result['uris']}")
    #Not printing docuemnts for they could be too large on a print
    #Same with embeddings
    return
satya - 1/30/2024, 4:21:33 PM
Populating a database
"""
**********************************************
Populate the database
**********************************************
"""
def addASonnet(col: chromadb.Collection, sonnet: str, roman: str):
    log.validate_not_null_or_empty(col, sonnet, roman)
    id = datasetutils.roman_to_int(roman)
    id_str = f"{id}"
    col.upsert(
        ids=[id_str],
        documents=[sonnet],
        metadatas=[{"Sonnet number": id_str, "Sonnet Roman Numeral": roman}])
    return
def addFromASonnetDictionary(col: chromadb.Collection, sonnetsDict: dict):
    #key: roman numeral
    #value: sonnet
    for item in sonnetsDict.items():
        roman = item[0]
        sonnet = item[1]
        log.dprint(f"Adding sonnet:{roman}")
        addASonnet(col,sonnet,roman)
    return
def addSonnetDatasetToChromadb():
    log.ph1("Adding 20 sonnets to chromadb")
    #get sonnets
    log.dprint("Getting 20 sonnet dictionary")
    sonnetDict = datasetutils.get20SonnetsDictionary()
    log.summarizeDictionary(sonnetDict)
    #get a chromadb collection
    log.dprint("Get chromadb collection")
    col = getOrCreateATestCollection()
    log.dprint("Add sonnets")
    addFromASonnetDictionary(col,sonnetDict)
    log.ph1("Done with creating all sonnets")
    return
def populateDatabase():
    addSonnetDatasetToChromadb()
satya - 1/30/2024, 4:23:42 PM
Creating the database
def getChromaClientPath():
    path = fileutils.getTempDataRoot()
    filename ="chromadb1"
    new_db_path = fileutils.pathjoin(path,filename)
    return new_db_path
def getChromaDBClient():
    chromadb_path = getChromaClientPath()
    chromadbClient = chromadb.PersistentClient(chromadb_path)
    return chromadbClient
def getOrCreateCollection(client: chromadb.ClientAPI, name: str):
    return client.get_or_create_collection(name)
def getOrCreateATestCollection() -> chromadb.Collection:
    #create chroma db
    log.ph1("Creating chromadb")
    chromaClient = getChromaDBClient()
    col = chromaClient.get_or_create_collection("Test_Collection")
    log.dprint("Get/Created test collection")
    return col
satya - 1/31/2024, 3:55:38 PM
querying by general text
collection.query(
    query_texts=["doc10", "thus spake zarathustra", ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)
satya - 1/31/2024, 4:00:26 PM
Key elements
- It is a collection of inputs
- Each input is vectorized and compared
- $contains is an operator to qualify how to use the search string.
- Similarly there a number of these operators for where clause and also where_document clause
- See the Usage Guide document
satya - 1/31/2024, 4:01:40 PM
You can also constrain what is returned among
- docs
- metadata
- embeddings etc.
satya - 2/1/2024, 8:46:23 AM
Chunking plays a massive role in locating similar content!!
Chunking plays a massive role in locating similar content!!
satya - 2/1/2024, 9:01:38 AM
For spacy you have to do this on the terminal
python -m spacy download en_core_web_sm
satya - 2/1/2024, 10:43:55 AM
How come spacy doesn't break English sentences at periods and question marks out of the box?
How come spacy doesn't break English sentences at periods and question marks out of the box?
satya - 2/1/2024, 12:21:26 PM
How do I reset a chromadb collection
How do I reset a chromadb collection
satya - 2/1/2024, 12:26:51 PM
deleting a collection
def deleteTestCollection():
    client = getChromaDBClient()
    global sonnet_collection_name
    client.delete_collection(sonnet_collection_name)
satya - 2/1/2024, 12:42:04 PM
A better one
def deleteTestCollection():
    client = getChromaDBClient()
    global sonnet_collection_name
    log.ph("Deleting Collection",sonnet_collection_name)
    try :
        col = client.get_collection(sonnet_collection_name)
    except ValueError as e:
        log.dprint("test collection doesn't exist")
        return
    client.delete_collection(sonnet_collection_name)
satya - 2/8/2024, 5:03:27 PM
Chromadb github jupy notebooks on embeddings
satya - 2/8/2024, 5:05:20 PM
Chroma embedding functions are here in github
satya - 2/8/2024, 5:06:19 PM
Useful packages
- chromadb.utils.embedding_functions.py
- chromadb.api.types.py
