Using Docling synthetic data generation capabilities.
Introduction
In the rapidly evolving landscape of generative AI, the demand for high-quality, diverse training data is insatiable. However, acquiring and annotating vast amounts of real-world data can be a time-consuming, expensive, and often privacy-sensitive endeavor. This is where synthetic data generation (SDG) emerges as a transformative solution. Docling for Synthetic Data Generation (SDG) provides a robust set of tools specifically designed to create artificial data directly from existing documents, seamlessly leveraging advanced generative AI models alongside Docling’s powerful parsing capabilities.
By generating synthetic datasets, we can accelerate the development and evaluation of AI applications, overcome data scarcity challenges, enhance model robustness by exposing them to varied scenarios, and safeguard sensitive information. This innovative approach significantly streamlines workflows, allowing developers and researchers to iterate faster and build more intelligent AI systems with unprecedented efficiency.
Code and Implementation
To implement and test the full workflow, begin by setting up your environment: first, install docling-sdg using pip install docling-sdg, then download and install Ollama, and pull the granite:3.3 model using ollama pull granite:3.3. Once these prerequisites are met, execute the docling-sdg Python application to generate your synthetic data from a specified URL, which will create a .jsonl file containing relevant document passages. Finally, run the ollama-rag-app Python script, ensuring the generated .jsonl file is in the same directory, to initiate a chat interface where you can query the granite:3.3 model, with its responses informed by the synthetic data.
The steps are provided below.
- Pull Granite using Ollama.
# assuming ollama is installed! you can test the LLM locally
ollama run granite3.3
ollama run granite3.3
###
ollama run granite3.3
pulling manifest
pulling 77bcee066a76: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB
pulling 3da071a01bbe: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 6.6 KB
pulling 4a99a6dd617d: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 11 KB
pulling 122661774644: 100% ▕███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 417 B
verifying sha256 digest
writing manifest
success
>>> /exit
- Build a virtual environment with all the requirements.
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install docling # not mandatory for this test
pip install docling-sdg
pip install ollama
- The data generation code (the code provide is based completely on the sample provided by Docling-SDG).
# generate-data.py
from docling_sdg.qa.sample import PassageSampler
import json
import os
def generate_synthetic_data(source_url: str, output_filename: str = "docling_sdg_sample.jsonl"):
"""
Generates synthetic question-answer pairs from a given source URL
using docling-sdg and saves them to a JSONL file.
Args:
source_url (str): The URL of the document to sample passages from.
output_filename (str): The name of the file to save the generated data.
Defaults to "docling_sdg_sample.jsonl".
"""
print(f"Initializing PassageSampler...")
passage_sampler = PassageSampler()
print(f"Attempting to sample passages from: {source_url}")
try:
# The 'source' argument expects a list of URLs.
# The 'sample' method by default exports results to 'docling_sdg_sample.jsonl'.
# We will let it write to the default and then rename if a custom output_filename is specified.
default_output_file = "docling_sdg_sample.jsonl"
# Ensure the output directory exists if the filename includes a path
output_dir = os.path.dirname(output_filename)
if output_dir and not os.path.exists(output_dir):
os.makedirs(output_dir)
# Call the sample method with the source as a list
# It will write to 'docling_sdg_sample.jsonl'
passage_sampler.sample(source=[source_url]) # Pass source as a list
# Define the full path to the default generated file
generated_default_path = os.path.join(output_dir if output_dir else os.getcwd(), default_output_file)
desired_output_path = os.path.join(output_dir if output_dir else os.getcwd(), output_filename)
# Check if the default file was created and rename it if a different output_filename is desired
if os.path.exists(generated_default_path):
if generated_default_path != desired_output_path:
os.rename(generated_default_path, desired_output_path)
print(f"Successfully generated data and saved to: {desired_output_path}")
else:
print(f"Successfully generated data and saved to: {generated_default_path}")
# Optional: Read and print the first few lines to confirm content
file_to_read = desired_output_path if generated_default_path != desired_output_path else generated_default_path
with open(file_to_read, 'r', encoding='utf-8') as f:
print("\n--- First 3 passages from the generated file: ---")
for i, line in enumerate(f):
if i >= 3:
break
try:
print(json.dumps(json.loads(line.strip()), indent=2))
except json.JSONDecodeError:
print(f"Invalid JSON line: {line.strip()}")
print("--------------------------------------------------")
else:
print(f"Error: Default output file ({default_output_file}) not found after sampling.")
except Exception as e:
print(f"An error occurred during data generation: {e}")
print("Please ensure the source URL is accessible and valid and docling-sdg is correctly installed.")
if __name__ == "__main__":
# Example usage:
# You can change the source_url and output_filename as needed.
# Note: For this to run successfully, you need internet access to fetch the URL.
sample_source_url = "https://en.wikipedia.org/wiki/Duck"
custom_output_file = "my_duck_data.jsonl"
print(f"Starting data generation for source: {sample_source_url}")
generate_synthetic_data(sample_source_url, custom_output_file)
# You can add more examples here:
# generate_synthetic_data("https://en.wikipedia.org/wiki/Artificial_intelligence", "ai_data.jsonl")
- Run the code above to generate the synthetic data.
python generate-data.py
#####
> python generate_data.py
Starting data generation for source: https://en.wikipedia.org/wiki/Duck
Initializing PassageSampler...
Attempting to sample passages from: https://en.wikipedia.org/wiki/Duck
Successfully generated data and saved to: /Users/xxxxx/Devs/doclig-sdg/my_duck_data.jsonl
{"text": "Tagalog\nதமிழ்\nTaqbaylit\nТатарча / tatarça\nతెలుగు\nไทย\nTürkçe\nУкраїнська\nئۇيغۇرچە / Uyghurche\nVahcuengh\nTiếng Việt\nWalon\n文言\nWinaray\n吴语\n粵語\nŽemaitėška\n中文\nJaku Iban\nArticle\nTalk\nRead\nView source\nView history\nTools\nActions\nRead\nView source\nView history\nGeneral\nWhat links here\nRelated changes\nUpload file\nPermanent link\nPage information\nCite this page\nGet shortened URL\nDownload QR code\nPrint/export\nDownload as PDF\nPrintable version\nIn other projects\nWikimedia Commons\nWikiquote\nWikidata item\nAppearance\nFrom Wikipedia, the free encyclopedia\nCommon name for many species of bird\nThis article is about the bird. For duck as a food, see . For other uses, see .\n\"Duckling\" redirects here. For other uses, see .", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/166", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/167", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/168", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/169", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/170", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/171", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/172", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/173", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/174", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/175", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/176", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/177", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/178", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/179", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/180", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/181", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/182", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/183", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/184", "parent": {"cref": "#/groups/29"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/185", "parent": {"cref": "#/groups/30"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/186", "parent": {"cref": "#/groups/30"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/187", "parent": {"cref": "#/groups/32"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/188", "parent": {"cref": "#/groups/32"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/189", "parent": {"cref": "#/groups/32"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/190", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/191", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/192", "parent": {"cref": "#/groups/33"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/193", "parent": {"cref": "#/groups/33"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/194", "parent": {"cref": "#/groups/33"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/195", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/196", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/197", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/198", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/199", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/200", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/201", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/202", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/203", "parent": {"cref": "#/groups/34"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/204", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/205", "parent": {"cref": "#/groups/35"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/206", "parent": {"cref": "#/groups/35"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/207", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/208", "parent": {"cref": "#/groups/36"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/209", "parent": {"cref": "#/groups/36"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/210", "parent": {"cref": "#/groups/36"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/211", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/212", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/213", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/214", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/215", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "03dc6562e00f426b3575e4e7a16ba45c9dca3128567d1a1008115828a9c7c2c8", "doc_id": "4194132580746041524"}}
{"text": ", Duck = . Bufflehead\n(Bucephala albeola), Duck = Bufflehead\n(Bucephala albeola). Scientific classification, Duck = Scientific classification. Domain:, Duck = Eukaryota. Kingdom:, Duck = Animalia. Phylum:, Duck = Chordata. Class:, Duck = Aves. Order:, Duck = Anseriformes. Superfamily:, Duck = Anatoidea. Family:, Duck = Anatidae. Subfamilies, Duck = Subfamilies. See text, Duck = See text\nDuck is the common name for numerous species of waterfowl in the family Anatidae. Ducks are generally smaller and shorter-necked than swans and geese, which are members of the same family. Divided among several subfamilies, they are a form taxon; they do not represent a monophyletic group (the group of all descendants of a single common ancestral species), since swans and geese are not considered ducks. Ducks are mostly aquatic birds, and may be found in both fresh water and sea water.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/tables/0", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "table", "prov": []}, {"self_ref": "#/texts/216", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "1ec21b78f7ed71f3fc12c1c078aafb1177ec8491ddfc58f90da2679f3ef8d769", "doc_id": "4194132580746041524"}}
{"text": "Ducks are sometimes confused with several types of unrelated water birds with similar forms, such as loons or divers, grebes, gallinules and coots.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/217", "parent": {"cref": "#/texts/45"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "197f256e2d72092b7f5d5bc419b0a03840da26799f8f90d9c04e2f0cbad52f4c", "doc_id": "4194132580746041524"}}
{"text": "The word duck comes from Old English dūce 'diver', a derivative of the verb *dūcan 'to duck, bend down low as if to get under something, or dive', because of the way many species in the dabbling duck group feed by upending; compare with Dutch duiken and German tauchen 'to dive'.\nPacific black duck displaying the characteristic upending \"duck\"\nThis word replaced Old English ened /ænid 'duck', possibly to avoid confusion with other words, such as ende 'end' with similar forms. Other Germanic languages still have similar words for duck, for example, Dutch eend, German Ente and Norwegian and. The word ened /ænid was inherited from Proto-Indo-European; cf. Latin anas \"duck\", Lithuanian ántis 'duck', Ancient Greek νῆσσα /νῆττα (nēssa /nētta) 'duck', and Sanskrit ātí 'water bird', among others.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/219", "parent": {"cref": "#/texts/218"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/220", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/221", "parent": {"cref": "#/texts/218"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Etymology"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "7d6807e5234c545b42278b594e62e8bd8e139a2e3d88c787079de20a7d5e3ffb", "doc_id": "4194132580746041524"}}
{"text": "A duckling is a young duck in downy plumage[1] or baby duck,[2] but in the food trade a young domestic duck which has just reached adult size and bulk and its meat is still fully tender, is sometimes labelled as a duckling.\nA male is called a drake and the female is called a duck, or in ornithology a hen.[3][4]\nMale mallard.\nWood ducks.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/222", "parent": {"cref": "#/texts/218"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/223", "parent": {"cref": "#/texts/218"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/224", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/225", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}], "headings": ["Duck", "Etymology"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "225d764dda3bd786f8351b4ff093ea7f7257fe39717af6ba99bd7d8c18bc2cfc", "doc_id": "4194132580746041524"}}
{"text": "All ducks belong to the biological order Anseriformes, a group that contains the ducks, geese and swans, as well as the screamers, and the magpie goose.[5] All except the screamers belong to the biological family Anatidae.[5] Within the family, ducks are split into a variety of subfamilies and 'tribes'. The number and composition of these subfamilies and tribes is the cause of considerable disagreement among taxonomists.[5] Some base their decisions on morphological characteristics, others on shared behaviours or genetic studies.[6][7] The number of suggested subfamilies containing ducks ranges from two to five.[8][9] The significant level of hybridisation that occurs among wild ducks complicates efforts to tease apart the relationships between various species.[9]\nMallard landing in approach", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/227", "parent": {"cref": "#/texts/226"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/228", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}], "headings": ["Duck", "Taxonomy"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "de5c10dd68fdbdbe49d25b7aa1dbd34f85cd92010c01f9daf88c2c54f5793c4a", "doc_id": "4194132580746041524"}}
{"text": "In most modern classifications, the so-called 'true ducks' belong to the subfamily Anatinae, which is further split into a varying number of tribes.[10] The largest of these, the Anatini, contains the 'dabbling' or 'river' ducks – named for their method of feeding primarily at the surface of fresh water.[11] The 'diving ducks', also named for their primary feeding method, make up the tribe Aythyini.[12] The 'sea ducks' of the tribe Mergini are diving ducks which specialise on fish and shellfish and spend a majority of their lives in saltwater.[13] The tribe Oxyurini contains the 'stifftails', diving ducks notable for their small size and stiff, upright tails.[14]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/229", "parent": {"cref": "#/texts/226"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Taxonomy"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "510d74e8cae678db635f9320a11f5694d359b4584c2ae877e40c3503a5641750", "doc_id": "4194132580746041524"}}
{"text": "A number of other species called ducks are not considered to be 'true ducks', and are typically placed in other subfamilies or tribes. The whistling ducks are assigned either to a tribe (Dendrocygnini) in the subfamily Anatinae or the subfamily Anserinae,[15] or to their own subfamily (Dendrocygninae) or family (Dendrocyganidae).[9][16] The freckled duck of Australia is either the sole member of the tribe Stictonettini in the subfamily Anserinae,[15] or in its own family, the Stictonettinae.[9] The shelducks make up the tribe Tadornini in the family Anserinae in some classifications,[15] and their own subfamily, Tadorninae, in others,[17] while the steamer ducks are either placed in the family Anserinae in the tribe Tachyerini[15] or lumped with the shelducks in the tribe Tadorini.[9] The perching ducks make up in the tribe Cairinini in the subfamily Anserinae in some classifications, while that tribe is eliminated in other classifications and its members assigned to the", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/230", "parent": {"cref": "#/texts/226"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Taxonomy"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "30861bdc9de3a9ca0de0a50adf7cbc661b9b20346375c88a80d6db1cdfd86b92", "doc_id": "4194132580746041524"}}
{"text": "tribe Anatini.[9] The torrent duck is generally included in the subfamily Anserinae in the monotypic tribe Merganettini,[15] but is sometimes included in the tribe Tadornini.[18] The pink-eared duck is sometimes included as a true duck either in the tribe Anatini[15] or the tribe Malacorhynchini,[19] and other times is included with the shelducks in the tribe Tadornini.[15]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/230", "parent": {"cref": "#/texts/226"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Taxonomy"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "d5ffa4b275ac2458e9d38ea3186634f68d199254b2128d51fdf69297b0293bb6", "doc_id": "4194132580746041524"}}
{"text": "Male Mandarin duck\n, 1 = This section does not cite any sources. Please help improve this section by adding citations to reliable sources. Unsourced material may be challenged and removed. (October 2024) (Learn how and when to remove this message)\nThe overall body plan of ducks is elongated and broad, and they are also relatively long-necked, albeit not as long-necked as the geese and swans. The body shape of diving ducks varies somewhat from this in being more rounded. The bill is usually broad and contains serrated pectens, which are particularly well defined in the filter-feeding species. In the case of some fishing species the bill is long and strongly serrated. The scaled legs are strong and well developed, and generally set far back on the body, more so in the highly aquatic species, which typically feature webbed feet. The wings are very strong and are generally short and pointed, and the flight of ducks requires fast continuous strokes, requiring in turn strong wing muscles. Three species of steamer duck are almost flightless, however. Many species of duck are temporarily flightless while moulting; they seek out protected habitat with good food supplies during this period. This moult typically precedes migration.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/232", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/tables/1", "parent": {"cref": "#/texts/231"}, "children": [], "content_layer": "body", "label": "table", "prov": []}, {"self_ref": "#/texts/233", "parent": {"cref": "#/texts/231"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Morphology"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "8926752b1f396c7a76b47a5decabb8787b8ecbff6f19c151846674f5a8a7ce5f", "doc_id": "4194132580746041524"}}
{"text": "The drakes of northern species often have extravagant plumage, but that is moulted in summer to give a more female-like appearance, the \"eclipse\" plumage. Southern resident species typically show less sexual dimorphism, although there are exceptions such as the paradise shelduck of New Zealand, which is both strikingly sexually dimorphic and in which the female's plumage is brighter than that of the male. The plumage of juvenile birds generally resembles that of the female. Female ducks have evolved to have a corkscrew shaped vagina to prevent forced copulations.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/234", "parent": {"cref": "#/texts/231"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Morphology"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "e1cdf53309eb03de1621b68de9d0c5dcd101e791a1f8f8293df6633f63dd33ed", "doc_id": "4194132580746041524"}}
{"text": "Flying steamer ducks in Ushuaia, Argentina\nDucks have a cosmopolitan distribution, and are found on every continent except Antarctica.[5] Several species manage to live on subantarctic islands, including South Georgia and the Auckland Islands.[20] Ducks have reached a number of isolated oceanic islands, including the Hawaiian Islands, Micronesia and the Galápagos Islands, where they are often vagrants and less often residents.[21][22] A handful are endemic to such far-flung islands.[21]\nFemale mallard in Cornwall, England\nSome duck species, mainly those breeding in the temperate and Arctic Northern Hemisphere, are migratory; those in the tropics are generally not. Some ducks, particularly in Australia where rainfall is erratic, are nomadic, seeking out the temporary lakes and pools that form after localised heavy rain.[23]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/236", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/237", "parent": {"cref": "#/texts/235"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/238", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/239", "parent": {"cref": "#/texts/235"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Distribution and habitat"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "9ddb7b1e022499575a0c47b4de5f2033cb5097c3ab9bba449008181b17e47127", "doc_id": "4194132580746041524"}}
{"text": "Pecten along the bill\nMallard duckling preening\nDucks eat food sources such as grasses, aquatic plants, fish, insects, small amphibians, worms, and small molluscs.\nDabbling ducks feed on the surface of water or on land, or as deep as they can reach by up-ending without completely submerging.[24] Along the edge of the bill, there is a comb-like structure called a pecten. This strains the water squirting from the side of the bill and traps any food. The pecten is also used to preen feathers and to hold slippery food items.\nDiving ducks and sea ducks forage deep underwater. To be able to submerge more easily, the diving ducks are heavier than dabbling ducks, and therefore have more difficulty taking off to fly.\nA few specialized species such as the mergansers are adapted to catch and swallow large fish.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/242", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/243", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/244", "parent": {"cref": "#/texts/241"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/245", "parent": {"cref": "#/texts/241"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/246", "parent": {"cref": "#/texts/241"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/247", "parent": {"cref": "#/texts/241"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Behaviour", "Feeding"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "3bff4564bd278fd8d1b20bf0029e75a61c39d75a94067dd2b3948f7b10908546", "doc_id": "4194132580746041524"}}
{"text": "The others have the characteristic wide flat bill adapted to dredging-type jobs such as pulling up waterweed, pulling worms and small molluscs out of mud, searching for insect larvae, and bulk jobs such as dredging out, holding, turning head first, and swallowing a squirming frog. To avoid injury when digging into sediment it has no cere, but the nostrils come out through hard horn.\nThe Guardian published an article advising that ducks should not be fed with bread because it damages the health of the ducks and pollutes waterways.[25]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/248", "parent": {"cref": "#/texts/241"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/249", "parent": {"cref": "#/texts/241"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Behaviour", "Feeding"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "600ce4f9ee8d5f8245a847fe5987ae86dadbbecb2a2a9121f4c14dfb3ccf052a", "doc_id": "4194132580746041524"}}
{"text": "A Muscovy duckling\nDucks generally only have one partner at a time, although the partnership usually only lasts one year.[26] Larger species and the more sedentary species (like fast-river specialists) tend to have pair-bonds that last numerous years.[27] Most duck species breed once a year, choosing to do so in favourable conditions (spring/summer or wet seasons). Ducks also tend to make a nest before breeding, and, after hatching, lead their ducklings to water. Mother ducks are very caring and protective of their young, but may abandon some of their ducklings if they are physically stuck in an area they cannot get out of (such as nesting in an enclosed courtyard) or are not prospering due to genetic defects or sickness brought about by hypothermia, starvation, or disease. Ducklings can also be orphaned by inconsistent late hatching where a few eggs hatch after the mother has abandoned the nest and led her ducklings to water.[28]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/251", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/252", "parent": {"cref": "#/texts/250"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Behaviour", "Breeding"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "654bc02b8f0971aeb3e92a6eb73331373c29ed95c8936cc75ff17331a7c50715", "doc_id": "4194132580746041524"}}
{"text": "Female mallard ducks (as well as several other species in the genus Anas, such as the American and Pacific black ducks, spot-billed duck, northern pintail and common teal) make the classic \"quack\" sound while males make a similar but raspier sound that is sometimes written as \"breeeeze\",[29][self-published source?] but, despite widespread misconceptions, most species of duck do not \"quack\".[30] In general, ducks make a range of calls, including whistles, cooing, yodels and grunts. For example, the scaup – which are diving ducks – make a noise like \"scaup\" (hence their name). Calls may be loud displaying calls or quieter contact calls.\nA common urban legend claims that duck quacks do not echo; however, this has been proven to be false. This myth was first debunked by the Acoustics Research Centre at the University of Salford in 2003 as part of the British Association's Festival of Science.[31] It was also debunked in one of the earlier episodes of the popular Discovery Channel television show MythBusters.[32]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/254", "parent": {"cref": "#/texts/253"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/255", "parent": {"cref": "#/texts/253"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Behaviour", "Communication"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "9ff2a732d1289756f89c4c5aba94e4af174f82e25d5e666d1ab4f7152072f76e", "doc_id": "4194132580746041524"}}
{"text": "Ringed teal\nDucks have many predators. Ducklings are particularly vulnerable, since their inability to fly makes them easy prey not only for predatory birds but also for large fish like pike, crocodilians, predatory testudines such as the alligator snapping turtle, and other aquatic hunters, including fish-eating birds such as herons. Ducks' nests are raided by land-based predators, and brooding females may be caught unaware on the nest by mammals, such as foxes, or large birds, such as hawks or owls.\nAdult ducks are fast fliers, but may be caught on the water by large aquatic predators including big fish such as the North American muskie and the European pike. In flight, ducks are safe from all but a few predators such as humans and the peregrine falcon, which uses its speed and strength to catch ducks.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/257", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/258", "parent": {"cref": "#/texts/256"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/259", "parent": {"cref": "#/texts/256"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Behaviour", "Predators"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "80c1edd2941bcbde6f31a3151ec789c742c2a1383051f689120ea15f9518a45b", "doc_id": "4194132580746041524"}}
{"text": "Humans have hunted ducks since prehistoric times. Excavations of middens in California dating to 7800 – 6400 BP have turned up bones of ducks, including at least one now-extinct flightless species.[33] Ducks were captured in \"significant numbers\" by Holocene inhabitants of the lower Ohio River valley, suggesting they took advantage of the seasonal bounty provided by migrating waterfowl.[34] Neolithic hunters in locations as far apart as the Caribbean,[35] Scandinavia,[36] Egypt,[37] Switzerland,[38] and China relied on ducks as a source of protein for some or all of the year.[39] Archeological evidence shows that Māori people in New Zealand hunted the flightless Finsch's duck, possibly to extinction, though rat predation may also have contributed to its fate.[40] A similar end awaited the Chatham duck, a species with reduced flying capabilities which went extinct shortly after its island was colonised by Polynesian settlers.[41] It is probable that duck eggs were gathered by Neolithic hunter-gathers as well, though hard evidence of this is uncommon.[35][42]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/262", "parent": {"cref": "#/texts/261"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Relationship with humans", "Hunting"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "c6a69a758cd7d9591b5e0e1c660ca65f055891d92f78c09da84f2d387a4a7f48", "doc_id": "4194132580746041524"}}
{"text": "In many areas, wild ducks (including ducks farmed and released into the wild) are hunted for food or sport,[43] by shooting, or by being trapped using duck decoys. Because an idle floating duck or a duck squatting on land cannot react to fly or move quickly, \"a sitting duck\" has come to mean \"an easy target\". These ducks may be contaminated by pollutants such as PCBs.[44]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/263", "parent": {"cref": "#/texts/261"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Relationship with humans", "Hunting"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "3cb9131c2e05687dbb604b1c4cda5a04a028da576274e0a6e009c88cdddf3115", "doc_id": "4194132580746041524"}}
{"text": "Indian Runner ducks, a common breed of domestic ducks\nDucks have many economic uses, being farmed for their meat, eggs, and feathers (particularly their down). Approximately 3 billion ducks are slaughtered each year for meat worldwide.[45] They are also kept and bred by aviculturists and often displayed in zoos. Almost all the varieties of domestic ducks are descended from the mallard (Anas platyrhynchos), apart from the Muscovy duck (Cairina moschata).[46][47] The Call duck is another example of a domestic duck breed. Its name comes from its original use established by hunters, as a decoy to attract wild mallards from the sky, into traps set for them on the ground. The call duck is the world's smallest domestic duck breed, as it weighs less than 1 kg (2.2 lb).[48]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/265", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/266", "parent": {"cref": "#/texts/264"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Relationship with humans", "Domestication"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "47d610c659e7b079505b380e8d577379ac7d120d3ebc5add41fd8bd5d847757f", "doc_id": "4194132580746041524"}}
{"text": "Three black-colored ducks in the coat of arms of Maaninka[49]\nDucks appear on several coats of arms, including the coat of arms of Lubāna (Latvia)[50] and the coat of arms of Föglö (Åland).[51]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/268", "parent": {"cref": "#/body"}, "children": [], "content_layer": "body", "label": "caption", "prov": []}, {"self_ref": "#/texts/269", "parent": {"cref": "#/texts/267"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Relationship with humans", "Heraldry"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "bc074b3c913b6b9d3e51deccf470fb2e33a825e269001c8d73a4eb75fe9a403f", "doc_id": "4194132580746041524"}}
{"text": "In 2002, psychologist Richard Wiseman and colleagues at the University of Hertfordshire, UK, finished a year-long LaughLab experiment, concluding that of all animals, ducks attract the most humor and silliness; he said, \"If you're going to tell a joke involving an animal, make it a duck.\"[52] The word \"duck\" may have become an inherently funny word in many languages, possibly because ducks are seen as silly in their looks or behavior. Of the many ducks in fiction, many are cartoon characters, such as Walt Disney's Donald Duck, and Warner Bros.' Daffy Duck. Howard the Duck started as a comic book character in 1973[53][54] and was made into a movie in 1986.", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/271", "parent": {"cref": "#/texts/270"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Relationship with humans", "Cultural references"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "420bd76c535db5373569db93474f043bcd44f9588b60c99e2263d71fb0333466", "doc_id": "4194132580746041524"}}
{"text": "The 1992 Disney film The Mighty Ducks, starring Emilio Estevez, chose the duck as the mascot for the fictional youth hockey team who are protagonists of the movie, based on the duck being described as a fierce fighter. This led to the duck becoming the nickname and mascot for the eventual National Hockey League professional team of the Anaheim Ducks, who were founded with the name the Mighty Ducks of Anaheim.[citation needed] The duck is also the nickname of the University of Oregon sports teams as well as the Long Island Ducks minor league baseball team.[55]", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/272", "parent": {"cref": "#/texts/270"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "Relationship with humans", "Cultural references"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "b5aa95b218e16612330ade20bc13e710d047432d71a8aa8c4774dabe938a6e69", "doc_id": "4194132580746041524"}}
{"text": "Definitions from Wiktionary\nMedia from Commons\nQuotations from Wikiquote\nRecipes from Wikibooks\nTaxa from Wikispecies\nData from Wikidata\nlist of books (useful looking abstracts)\nDucks on postage stamps Archived 2013-05-13 at the Wayback Machine\nDucks at a Distance, by Rob Hines at Project Gutenberg - A modern illustrated guide to identification of US waterfowl\nNational, Authority control databases = United StatesFranceBnF dataJapanLatviaIsrael. Other, Authority control databases = IdRef\nRetrieved from \"\"\n:\nDucks\nGame birds\nBird common names\nHidden categories:", "meta": {"schema_name": "docling_core.transforms.chunker.DocMeta", "version": "1.0.0", "doc_items": [{"self_ref": "#/texts/360", "parent": {"cref": "#/groups/41"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/361", "parent": {"cref": "#/groups/41"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/362", "parent": {"cref": "#/groups/41"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/363", "parent": {"cref": "#/groups/41"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/364", "parent": {"cref": "#/groups/41"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/365", "parent": {"cref": "#/groups/41"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/366", "parent": {"cref": "#/groups/42"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/367", "parent": {"cref": "#/groups/42"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/368", "parent": {"cref": "#/groups/42"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/tables/2", "parent": {"cref": "#/texts/359"}, "children": [], "content_layer": "body", "label": "table", "prov": []}, {"self_ref": "#/texts/369", "parent": {"cref": "#/texts/359"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/370", "parent": {"cref": "#/texts/359"}, "children": [], "content_layer": "body", "label": "text", "prov": []}, {"self_ref": "#/texts/371", "parent": {"cref": "#/groups/43"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/372", "parent": {"cref": "#/groups/43"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/373", "parent": {"cref": "#/groups/43"}, "children": [], "content_layer": "body", "label": "list_item", "prov": []}, {"self_ref": "#/texts/374", "parent": {"cref": "#/texts/359"}, "children": [], "content_layer": "body", "label": "text", "prov": []}], "headings": ["Duck", "External links"], "captions": null, "origin": {"mimetype": "text/html", "binary_hash": 4194132580746041524, "filename": "Duck", "uri": null}, "chunk_id": "e0cceaba67fa9dde1d8171886ab42f957e54f2f936f5d3c6d1e6c524b67ecbb1", "doc_id": "4194132580746041524"}}
- Now build an application using Granite 3.3 to chat with the synthetic generated data.
# ollama_chat_app.py
import json
import os
import requests # Used for making requests to the Ollama API
def load_passages_from_jsonl(filepath: str) -> list[str]:
"""
Loads passages from a .jsonl file, extracting the 'text' content.
Args:
filepath (str): The path to the .jsonl file.
Returns:
list[str]: A list of text passages.
"""
passages = []
if not os.path.exists(filepath):
print(f"Error: File not found at {filepath}")
return passages
try:
with open(filepath, 'r', encoding='utf-8') as f:
for line in f:
try:
data = json.loads(line.strip())
if 'text' in data:
passages.append(data['text'])
else:
print(f"Warning: 'text' key not found in line: {line.strip()}")
except json.JSONDecodeError:
print(f"Warning: Skipping invalid JSON line: {line.strip()}")
print(f"Loaded {len(passages)} passages from {filepath}")
except Exception as e:
print(f"An error occurred while loading passages: {e}")
return passages
def ask_ollama(prompt: str, model: str = "granite3.3:latest", context: str = "") -> str:
"""
Sends a prompt to the local Ollama server and returns the response.
Args:
prompt (str): The user's query.
model (str): The Ollama model to use (e.g., "granite3.3:latest").
context (str): Additional context to provide to the model for RAG.
Returns:
str: The generated response from the LLM.
"""
ollama_url = "http://localhost:11434/api/generate" # Default Ollama API endpoint
full_prompt = f"Using the following information, answer the question. If the answer is not in the provided information, state that you don't know.\n\nInformation:\n{context}\n\nQuestion: {prompt}\nAnswer:"
payload = {
"model": model,
"prompt": full_prompt,
"stream": False # We want the full response at once
}
try:
response = requests.post(ollama_url, json=payload)
response.raise_for_status() # Raise an HTTPError for bad responses (4xx or 5xx)
result = response.json()
return result.get("response", "No response generated.")
except requests.exceptions.ConnectionError:
return "Error: Could not connect to Ollama server. Please ensure Ollama is running and the model is pulled."
except requests.exceptions.HTTPError as err:
return f"Error: HTTP error occurred: {err} - {response.text}"
except Exception as e:
return f"An unexpected error occurred while communicating with Ollama: {e}"
def run_chat_application(data_filepath: str, ollama_model: str = "granite3.3:latest"):
"""
Runs a simple chat application that uses Ollama to answer questions
based on the content loaded from the specified .jsonl file.
Args:
data_filepath (str): The path to the .jsonl file containing passages.
ollama_model (str): The Ollama model to use (e.g., "granite3.3:latest").
"""
print(f"Loading data from {data_filepath}...")
passages = load_passages_from_jsonl(data_filepath)
if not passages:
print("No passages loaded. Exiting chat application.")
return
# Combine all passages into a single context string
# For very large files, consider a more sophisticated retrieval mechanism
# (e.g., semantic search on passages) before passing to the LLM.
# For now, we'll concatenate for simplicity.
full_context = "\n\n".join(passages)
print("\nChat application started. Type 'exit' to quit.")
print("Please ensure your Ollama server is running and the 'granite3.3:latest' model is downloaded.")
while True:
user_query = input("\nYour question: ").strip()
if user_query.lower() == 'exit':
print("Exiting chat application.")
break
print("Generating response (this may take a moment)...")
response = ask_ollama(prompt=user_query, model=ollama_model, context=full_context)
print(f"\nOllama: {response}")
if __name__ == "__main__":
# Specify the path to your generated .jsonl file
# Make sure this matches the output_filename from your docling-sdg application
jsonl_data_file = "my_duck_data.jsonl" # Or "docling_sdg_sample.jsonl" if you used the default
# Ensure Ollama is running and 'granite3.3:latest' model is pulled
# You can check by running 'ollama list' in your terminal
ollama_model_name = "granite3.3:latest"
run_chat_application(jsonl_data_file, ollama_model_name)
- Run the application.
python ollama_chat_app.py
- Test the generated data.
> python ollama_chat_app.py
Loading data from my_duck_data.jsonl...
Loaded 24 passages from my_duck_data.jsonl
Chat application started. Type 'exit' to quit.
Please ensure your Ollama server is running and the 'granite3.3:latest' model is downloaded.
Your question: where does all ducks belong to
Generating response (this may take a moment)...
Ollama: All ducks belong to the order Anseriformes, which is a group of birds that includes ducks, geese, and swans. Within this order, ducks are part of the family Anatidae, which also contains geese and some swan species. This family is further divided into several genera, including Anas (mallards), Cairina (Muscovy ducks), and various other genera for different types of ducks found around the world.
Et voilà 🎯
Conclusion
In conclusion, synthetic data generation, especially through tools like Docling SDG, proves indispensable for rigorously testing and validating RAG capabilities. By creating diverse and representative datasets, we can thoroughly assess how effectively our RAG systems retrieve and synthesize information, identify potential biases or gaps, and ultimately ensure their reliability in real-world applications. Docling SDG’s ability to seamlessly generate contextually relevant data directly from documents, leveraging generative AI, makes it an exceptionally useful and handy tool for developers and researchers aiming to build robust, high-performing, and ethically sound generative AI solutions.
Links
- Docling-SDG: https://github.com/docling-project/docling-sdg
- Ollama Granite 3.3: https://ollama.com/library/granite3.3
- Docling documentation: https://docling-project.github.io/docling/examples/
Hey blockchain user! collect your guaranteed $50 worth of crypto in BNB-based tokens ASAP! — Act fast! Your wallet = your ticket to free tokens. 👉 duckybsc.xyz