Vector databases
satya - 1/24/2024, 2:08:40 PM
It is called trycrhroma for some reason.
It is called trycrhroma for some reason.
satya - 1/24/2024, 2:21:58 PM
install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
Search for: install error chromadb 0.3.29 depends on onnxruntime>=1.14.1
satya - 1/24/2024, 3:35:39 PM
Installation of chromadb fails dramatically on python 1.12
satya - 1/29/2024, 12:06:30 PM
I have installed 3.11.7
satya - 1/29/2024, 3:59:58 PM
Chromadb docs at hugging face
Chromadb docs at hugging face
satya - 1/30/2024, 1:42:16 PM
sentence embedding models at hugging face
sentence embedding models at hugging face
satya - 1/30/2024, 1:55:57 PM
Getting started with embeddings: HF
satya - 1/30/2024, 4:20:49 PM
Examine the database
"""
**********************************************
Examine the database
**********************************************
"""
def examineCollection(col: chromadb.Collection):
#Print collection name
log.ph1(f"Examining collection: {col.name}")
#Get the result (top 10 items)
result = col.peek()
#
# result:
# Dictionary of ids, metadatas, documents, embeddings, uris
#
log.dprint(f"ids:{result['ids']}")
log.dprint(f"\nMetadatas: {result['metadatas']}")
log.dprint(f"\nuris: {result['uris']}")
#Not printing docuemnts for they could be too large on a print
#Same with embeddings
return
satya - 1/30/2024, 4:21:33 PM
Populating a database
"""
**********************************************
Populate the database
**********************************************
"""
def addASonnet(col: chromadb.Collection, sonnet: str, roman: str):
log.validate_not_null_or_empty(col, sonnet, roman)
id = datasetutils.roman_to_int(roman)
id_str = f"{id}"
col.upsert(
ids=[id_str],
documents=[sonnet],
metadatas=[{"Sonnet number": id_str, "Sonnet Roman Numeral": roman}])
return
def addFromASonnetDictionary(col: chromadb.Collection, sonnetsDict: dict):
#key: roman numeral
#value: sonnet
for item in sonnetsDict.items():
roman = item[0]
sonnet = item[1]
log.dprint(f"Adding sonnet:{roman}")
addASonnet(col,sonnet,roman)
return
def addSonnetDatasetToChromadb():
log.ph1("Adding 20 sonnets to chromadb")
#get sonnets
log.dprint("Getting 20 sonnet dictionary")
sonnetDict = datasetutils.get20SonnetsDictionary()
log.summarizeDictionary(sonnetDict)
#get a chromadb collection
log.dprint("Get chromadb collection")
col = getOrCreateATestCollection()
log.dprint("Add sonnets")
addFromASonnetDictionary(col,sonnetDict)
log.ph1("Done with creating all sonnets")
return
def populateDatabase():
addSonnetDatasetToChromadb()
satya - 1/30/2024, 4:23:42 PM
Creating the database
def getChromaClientPath():
path = fileutils.getTempDataRoot()
filename ="chromadb1"
new_db_path = fileutils.pathjoin(path,filename)
return new_db_path
def getChromaDBClient():
chromadb_path = getChromaClientPath()
chromadbClient = chromadb.PersistentClient(chromadb_path)
return chromadbClient
def getOrCreateCollection(client: chromadb.ClientAPI, name: str):
return client.get_or_create_collection(name)
def getOrCreateATestCollection() -> chromadb.Collection:
#create chroma db
log.ph1("Creating chromadb")
chromaClient = getChromaDBClient()
col = chromaClient.get_or_create_collection("Test_Collection")
log.dprint("Get/Created test collection")
return col
satya - 1/31/2024, 3:55:38 PM
querying by general text
collection.query(
query_texts=["doc10", "thus spake zarathustra", ...],
n_results=10,
where={"metadata_field": "is_equal_to_this"},
where_document={"$contains":"search_string"}
)
satya - 1/31/2024, 4:00:26 PM
Key elements
satya - 1/31/2024, 4:01:40 PM
You can also constrain what is returned among
satya - 2/1/2024, 8:46:23 AM
Chunking plays a massive role in locating similar content!!
Chunking plays a massive role in locating similar content!!
satya - 2/1/2024, 9:01:38 AM
For spacy you have to do this on the terminal
python -m spacy download en_core_web_sm
satya - 2/1/2024, 10:43:55 AM
How come spacy doesn't break English sentences at periods and question marks out of the box?
How come spacy doesn't break English sentences at periods and question marks out of the box?
satya - 2/1/2024, 12:21:26 PM
How do I reset a chromadb collection
How do I reset a chromadb collection
satya - 2/1/2024, 12:26:51 PM
deleting a collection
def deleteTestCollection():
client = getChromaDBClient()
global sonnet_collection_name
client.delete_collection(sonnet_collection_name)
satya - 2/1/2024, 12:42:04 PM
A better one
def deleteTestCollection():
client = getChromaDBClient()
global sonnet_collection_name
log.ph("Deleting Collection",sonnet_collection_name)
try :
col = client.get_collection(sonnet_collection_name)
except ValueError as e:
log.dprint("test collection doesn't exist")
return
client.delete_collection(sonnet_collection_name)
satya - 2/8/2024, 5:03:27 PM
Chromadb github jupy notebooks on embeddings
satya - 2/8/2024, 5:05:20 PM
Chroma embedding functions are here in github
satya - 2/8/2024, 5:06:19 PM
Useful packages