Vector databases
satya - 1/24/2024, 2:08:40 PM
It is called trycrhroma for some reason.
It is called trycrhroma for some reason.
satya - 1/24/2024, 2:21:58 PM
install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
Search for: install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
satya - 1/24/2024, 3:35:39 PM
Installation of chromadb fails dramatically on python 1.12
- Too many packages
- Fails on lot of dependencies
- will try pinecode cloud instead
satya - 1/29/2024, 12:06:30 PM
I have installed 3.11.7
- 1. seem to work
- 2. that was the latest in 3.11 series
satya - 1/29/2024, 3:59:58 PM
Chromadb docs at hugging face
Chromadb docs at hugging face
satya - 1/30/2024, 1:42:16 PM
sentence embedding models at hugging face
sentence embedding models at hugging face
satya - 1/30/2024, 1:55:57 PM
Getting started with embeddings: HF
satya - 1/30/2024, 4:20:49 PM
Examine the database
"""
**********************************************
Examine the database
**********************************************
"""
def examineCollection(col: chromadb.Collection):
#Print collection name
log.ph1(f"Examining collection: {col.name}")
#Get the result (top 10 items)
result = col.peek()
#
# result:
# Dictionary of ids, metadatas, documents, embeddings, uris
#
log.dprint(f"ids:{result['ids']}")
log.dprint(f"\nMetadatas: {result['metadatas']}")
log.dprint(f"\nuris: {result['uris']}")
#Not printing docuemnts for they could be too large on a print
#Same with embeddings
return
satya - 1/30/2024, 4:21:33 PM
Populating a database
"""
**********************************************
Populate the database
**********************************************
"""
def addASonnet(col: chromadb.Collection, sonnet: str, roman: str):
log.validate_not_null_or_empty(col, sonnet, roman)
id = datasetutils.roman_to_int(roman)
id_str = f"{id}"
col.upsert(
ids=[id_str],
documents=[sonnet],
metadatas=[{"Sonnet number": id_str, "Sonnet Roman Numeral": roman}])
return
def addFromASonnetDictionary(col: chromadb.Collection, sonnetsDict: dict):
#key: roman numeral
#value: sonnet
for item in sonnetsDict.items():
roman = item[0]
sonnet = item[1]
log.dprint(f"Adding sonnet:{roman}")
addASonnet(col,sonnet,roman)
return
def addSonnetDatasetToChromadb():
log.ph1("Adding 20 sonnets to chromadb")
#get sonnets
log.dprint("Getting 20 sonnet dictionary")
sonnetDict = datasetutils.get20SonnetsDictionary()
log.summarizeDictionary(sonnetDict)
#get a chromadb collection
log.dprint("Get chromadb collection")
col = getOrCreateATestCollection()
log.dprint("Add sonnets")
addFromASonnetDictionary(col,sonnetDict)
log.ph1("Done with creating all sonnets")
return
def populateDatabase():
addSonnetDatasetToChromadb()
satya - 1/30/2024, 4:23:42 PM
Creating the database
def getChromaClientPath():
path = fileutils.getTempDataRoot()
filename ="chromadb1"
new_db_path = fileutils.pathjoin(path,filename)
return new_db_path
def getChromaDBClient():
chromadb_path = getChromaClientPath()
chromadbClient = chromadb.PersistentClient(chromadb_path)
return chromadbClient
def getOrCreateCollection(client: chromadb.ClientAPI, name: str):
return client.get_or_create_collection(name)
def getOrCreateATestCollection() -> chromadb.Collection:
#create chroma db
log.ph1("Creating chromadb")
chromaClient = getChromaDBClient()
col = chromaClient.get_or_create_collection("Test_Collection")
log.dprint("Get/Created test collection")
return col
satya - 1/31/2024, 3:55:38 PM
querying by general text
collection.query(
query_texts=["doc10", "thus spake zarathustra", ...],
n_results=10,
where={"metadata_field": "is_equal_to_this"},
where_document={"$contains":"search_string"}
)
satya - 1/31/2024, 4:00:26 PM
Key elements
- It is a collection of inputs
- Each input is vectorized and compared
- $contains is an operator to qualify how to use the search string.
- Similarly there a number of these operators for where clause and also where_document clause
- See the Usage Guide document
satya - 1/31/2024, 4:01:40 PM
You can also constrain what is returned among
- docs
- metadata
- embeddings etc.
satya - 2/1/2024, 8:46:23 AM
Chunking plays a massive role in locating similar content!!
Chunking plays a massive role in locating similar content!!
satya - 2/1/2024, 9:01:38 AM
For spacy you have to do this on the terminal
python -m spacy download en_core_web_sm
satya - 2/1/2024, 10:43:55 AM
How come spacy doesn't break English sentences at periods and question marks out of the box?
How come spacy doesn't break English sentences at periods and question marks out of the box?
satya - 2/1/2024, 12:21:26 PM
How do I reset a chromadb collection
How do I reset a chromadb collection
satya - 2/1/2024, 12:26:51 PM
deleting a collection
def deleteTestCollection():
client = getChromaDBClient()
global sonnet_collection_name
client.delete_collection(sonnet_collection_name)
satya - 2/1/2024, 12:42:04 PM
A better one
def deleteTestCollection():
client = getChromaDBClient()
global sonnet_collection_name
log.ph("Deleting Collection",sonnet_collection_name)
try :
col = client.get_collection(sonnet_collection_name)
except ValueError as e:
log.dprint("test collection doesn't exist")
return
client.delete_collection(sonnet_collection_name)
satya - 2/8/2024, 5:03:27 PM
Chromadb github jupy notebooks on embeddings
satya - 2/8/2024, 5:05:20 PM
Chroma embedding functions are here in github
satya - 2/8/2024, 5:06:19 PM
Useful packages
- chromadb.utils.embedding_functions.py
- chromadb.api.types.py