Translate LaTeX Books from Slovenian to English
Multi-step prompt workflow demonstrating how to translate a LaTeX book from Slovenian to English while preserving LaTeX commands, using GPT-4o to process
Why it matters
Translate a book written in LaTeX from Slovenian to English while preserving all LaTeX commands. This asset processes the book in chunks, translates the text content, and reassembles it.
Outcomes
What it gets done
Split book into manageable chunks.
Translate text content of each chunk without altering LaTeX commands.
Reassemble translated chunks into a complete English version of the book.
Install
Add it to your toolbox
Run in your project directory:
curl -fsSL https://spark.entire.vc/get/oai-translatelatexbook | bash Steps
Steps in the chain
Begin by reading in the LaTeX book data that needs to be translated from Slovenian to English.
Split the book into chunks using double newlines as separators to avoid breaking text flow. Group shorter chunks into approximately 15000 token chunks to increase coherence. Ensure no individual chunk exceeds the model's token limit (gpt-4o: 16,384 tokens).
Structure the prompt with: (1) High-level instruction to translate only text, not LaTeX commands, (2) A sample untranslated command showing what needs translation, (3) The chunk of text to translate, (4) The translated sample command as a reference for the model.
Process all chunks sequentially through the model to translate them from Slovenian to English while preserving all LaTeX commands intact. This process will take approximately 2-3 hours.
Overview
Translate a book written in LaTeX from Slovenian into English
What it does
This workflow translates the text content of a book written in LaTeX from Slovenian to English, ensuring that all LaTeX commands remain untouched. The book is processed by being split into manageable chunks, translated individually, and then reassembled.
How it connects
This is ideal for translating academic or technical LaTeX documents from Slovenian to English where preserving the original LaTeX structure and commands is crucial. It is not suitable for translating the LaTeX commands themselves or for non-LaTeX source material.
Source README
Translate a book written in LaTeX from Slovenian into English
With permission of the author, we will demonstrate how to translate the book Euclidean Plane Geometry, written by Milan Mitrović from Slovenian into English, without modifying any of the LaTeX commands.
To achieve this, we will first split the book into chunks, each roughly a page long, then translate each chunk into English, and finally stitch them back together.
1. Read in the data
1.1 Count the tokens in each chunk
It turns out that a double newline is a good separator in this case, in order not to break the flow of the text. Also no individual chunk is larger than 1211 tokens. The model we will use is gpt-4o, which has a limit of 16,384 tokens, so we don't need to worry about breaking the chunks down further.
We will group the shorter chunks into chunks of around 15000 tokens, to increase the coherence of the text, and decrease the frequency of breaks within the text.
Notice that adding a sample untranslated and translated first command, where only the content of the chapter name needs to be translated, helps to get more consistent results.
The format of the prompt sent to the model consists of:
- A high level instruction to translate only the text, but not commands into the desired language
- A sample untranslated command, where only the content of the chapter name needs to be translated
- The chunk of text to be translated
- The translated sample command from 2, which shows the model the beginning of the translation process
The expected output is the translated chunk of text.
We can see here that this one chunk in particular translates only the text, but leaves LaTeX commands intact.
Let's now translate all the chunks in the book - this will take 2-3 hours, as we're processing requests sequentially.
Step 1: Read in the data
Begin by reading in the LaTeX book data that needs to be translated from Slovenian to English.
Step 2: Count the tokens in each chunk
Split the book into chunks using double newlines as separators to avoid breaking text flow. Group shorter chunks into approximately 15000 token chunks to increase coherence. Ensure no individual chunk exceeds the model's token limit (gpt-4o: 16,384 tokens).
Step 3: Prepare translation prompt format
Structure the prompt with: (1) High-level instruction to translate only text, not LaTeX commands, (2) A sample untranslated command showing what needs translation, (3) The chunk of text to translate, (4) The translated sample command as a reference for the model.
Step 4: Translate all chunks
Process all chunks sequentially through the model to translate them from Slovenian to English while preserving all LaTeX commands intact. This process will take approximately 2-3 hours.
Discussion
Questions & comments · 0
Sign In Sign in to leave a comment.