Generate Privacy-Safe Synthetic Data
LlamaIndex pack that generates differentially private synthetic datasets from sensitive data, preserving original attributes while minimizing performance
Why it matters
Create differentially private synthetic data from sensitive datasets, enabling privacy-safe downstream processing and LLM consumption without additional privacy costs.
Outcomes
What it gets done
Generate synthetic data examples with differential privacy.
Obscure source data while preserving original attributes.
Prepare privacy-safe datasets for LLM prompt ingestion.
Integrate with LLMs that produce LogProbs for generation.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/li-pack-packs-diff-private-simple-dataset | bash Steps
Steps in the chain
Create a DiffPrivateSimpleDatasetPack object with the following parameters: 1) an LLM (must return CompletionResponse), 2) its associated tokenizer, 3) a PromptBundle object containing parameters for prompting the LLM, 4) a LabelledSimpleDataset, 5) [Optional] sephamore_counter_size to reduce RateLimitError chances, 6) [Optional] sleep_time_in_seconds to reduce RateLimitError chances.
Download the DiffPrivateSimpleDatasetPack as a template using download_llama_pack() function. This allows you to customize the pack further for your specific needs before instantiating it with your LLM, tokenizer, prompt_bundle, and simple_dataset.
Execute the run() function which is a light wrapper around query_engine.query(). This function requires parameters including t_max (the max number of tokens) and processes the dataset to generate privacy-safe synthetic examples.
Overview
LlamaIndex Packs: `DiffPrivateSimpleDatasetPack`
What it does
A LlamaIndex pack implementing differential privacy techniques to create synthetic datasets from sensitive source data, designed for safe use in LLM workflows.
How it connects
Use when you need to generate privacy-preserving synthetic examples from labeled datasets containing sensitive information that will be passed to LLMs for downstream processing.
Source README
Description pending for li-pack-packs-diff-private-simple-dataset.
Step 1: Construct DiffPrivateSimpleDatasetPack object
Create a DiffPrivateSimpleDatasetPack object with the following parameters: 1) an LLM (must return CompletionResponse), 2) its associated tokenizer, 3) a PromptBundle object containing parameters for prompting the LLM, 4) a LabelledSimpleDataset, 5) [Optional] sephamore_counter_size to reduce RateLimitError chances, 6) [Optional] sleep_time_in_seconds to reduce RateLimitError chances.
Step 2: Download and customize the pack as template
Download the DiffPrivateSimpleDatasetPack as a template using download_llama_pack() function. This allows you to customize the pack further for your specific needs before instantiating it with your LLM, tokenizer, prompt_bundle, and simple_dataset.
Step 3: Call the run() function
Execute the run() function which is a light wrapper around query_engine.query(). This function requires parameters including t_max (the max number of tokens) and processes the dataset to generate privacy-safe synthetic examples.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.