Dataset: LLM Token Usage in Everyday Office Tasks
This dataset provides realistic token usage estimates for 64 common AI tasks across different categories common in enterprise office environments: Communication, Coding, Analysis, Planning, Document Processing, and Multimodal tasks (Vision, Audio, Mixed).
About This Dataset
The dataset compares token consumption between standard models (like GPT-4o) and reasoning models (like OpenAI o1). Reasoning models use additional hidden tokens to “think” through problems step-by-step before generating responses, leading to more accurate and reliable results.
This dataset has been compiled from real-world use cases in our enterprise AI implementation projects, then validated and extended with data from authoritative sources listed below. The token estimates reflect actual production workloads and can be used for modeling theoretical scenarios.
Key Insights:
- Simple tasks (e.g., “Hello World”) use ~300-500 additional tokens for reasoning
- Complex tasks (math, debugging, logic) benefit most: reasoning models use 10-20x more tokens but deliver significantly better accuracy
- Multimodal tasks (images, audio) have high base costs before any reasoning
- Audio is extremely token-dense: ~1,000-1,200 tokens per minute
Token Usage Dataset
The complete dataset is available for download in CSV file format.
| Category | Task | Description | Input Tokens | Output Tokens (Normal) | Output Tokens (Reasoning) | Images | Audio (min) | Insight / Complexity Factor |
|---|---|---|---|---|---|---|---|---|
| Communication | Drafting a Short Email | Write a sick leave email to boss | 50 | 100 | 450 | 0 | 0 | Low complexity. Reasoning model overthinks simple politeness. |
| Communication | Polite Rejection | Decline a wedding invitation politely | 60 | 120 | 500 | 0 | 0 | Social nuance requires minimal reasoning overhead (~300 hidden tokens). |
| Communication | Rewrite for Tone | Make paragraph sound more professional | 150 | 150 | 800 | 0 | 0 | Reasoning model checks multiple variations internally. |
| Communication | Cover Letter Generation | Write cover letter for Sales role | 200 | 400 | 1800 | 0 | 0 | Reasoning model plans structure and maps skills to job requirements. |
| Communication | Replying to a Text | Give 3 witty replies to text message | 40 | 60 | 500 | 0 | 0 | Creativity task; reasoning model brainstorms options before selecting. |
| Communication | Grammar Check | Fix grammar in 200-word memo | 200 | 150 | 900 | 0 | 0 | Standard model is sufficient; reasoning adds unnecessary verification. |
| Coding | Hello World Script | Write a Python Hello World script | 30 | 20 | 350 | 0 | 0 | Even simple tasks use ~300 reasoning tokens for quality verification. |
| Coding | Excel Formula Help | Formula to VLOOKUP column A in Sheet 2 | 50 | 50 | 1200 | 0 | 0 | Reasoning model verifies syntax and edge cases (e.g., exact match). |
| Coding | Regex Generation | Regex to validate email address | 80 | 70 | 1500 | 0 | 0 | High reasoning gain; regex is error-prone, model self-corrects heavily. |
| Coding | SQL Query Generation | Select top 5 users by spend from tables | 100 | 100 | 1800 | 0 | 0 | Reasoning model plans joins and filtering logic carefully. |
| Coding | Debugging Code | Find error in 50-line Python function | 600 | 200 | 4500 | 0 | 0 | Major reasoning win. Model ‘traces’ execution mentally to find bugs. |
| Coding | Code Refactoring | Rewrite code to be more efficient | 700 | 300 | 5000 | 0 | 0 | Complex planning required to preserve logic while changing structure. |
| Coding | Explain Error Log | What does this stack trace mean? | 350 | 150 | 2500 | 0 | 0 | Reasoning model analyzes causal chain of the error. |
| Analysis | Summarize Article | Summarize 1000-word article | 1400 | 200 | 3500 | 0 | 0 | Reading cost is dominant. Reasoning adds synthesis overhead. |
| Analysis | Extract Data | List all dates and names from text | 1000 | 200 | 2800 | 0 | 0 | Reasoning model double-checks missed items (higher recall). |
| Analysis | Math Word Problem | If train leaves Chicago at 60mph… | 100 | 50 | 2500 | 0 | 0 | Massive multiplier (15x+). Standard models guess; reasoning models calculate. |
| Analysis | Logic Riddle Solving | Solve two doors two guards riddle | 120 | 80 | 2000 | 0 | 0 | Pure logic task. Reasoning model simulates scenarios. |
| Analysis | Financial Analysis | Analyze CSV rows for trends | 600 | 200 | 3500 | 0 | 0 | Reasoning model performs multi-step numerical comparison. |
| Analysis | Sentiment Analysis | Is customer review positive? | 80 | 70 | 600 | 0 | 0 | Low complexity. Standard model is usually sufficient. |
| Planning | Meal Plan | Create healthy 3-day meal plan | 150 | 350 | 1500 | 0 | 0 | Reasoning model balances constraints (nutrition, variety) better. |
| Planning | Trip Itinerary | Plan 3-day weekend in Tokyo | 200 | 600 | 2500 | 0 | 0 | Reasoning model checks logistics and travel times between spots. |
| Planning | Brainstorm Titles | 10 catchy titles for AI blog | 100 | 100 | 800 | 0 | 0 | Creative task. Reasoning overhead is mostly ‘filtering’ bad ideas. |
| Planning | Write a Haiku | Write haiku about the ocean | 30 | 30 | 400 | 0 | 0 | Syllable counting requires ‘thinking’ steps for accuracy. |
| Planning | Gift Ideas | Gift ideas for dad who likes golf | 100 | 200 | 1000 | 0 | 0 | Reasoning model models the ‘persona’ of the recipient. |
| Planning | Roleplay Scenario | Pretend you are a career coach | 150 | 450 | 1500 | 0 | 0 | Maintains character consistency via hidden state. |
| Document Processing | Extract Invoice Data | Extract vendor total date from 2-page invoice | 900 | 300 | 2800 | 0 | 0 | High layout complexity. Reasoning model traces field locations. |
| Document Processing | Summarize Contract | Summarize key terms from 10-page legal contract | 4000 | 500 | 9000 | 0 | 0 | Heavy reading load. Reasoning critical for interpreting legal clauses. |
| Document Processing | Resume Screening | Extract relevant skills from 2-page resume | 1100 | 400 | 3200 | 0 | 0 | Reasoning model infers implicit skills from experience descriptions. |
| Document Processing | Translate Document | Translate 5-page Spanish document to English | 2300 | 700 | 6500 | 0 | 0 | Reasoning model preserves nuance and idioms better than direct translation. |
| Document Processing | Format Markdown | Convert 500-word Word doc to structured Markdown | 1200 | 600 | 4000 | 0 | 0 | Formatting is tedious; reasoning model checks consistency. |
| Document Processing | Parse JSON Schema | Validate and fix malformed JSON document | 500 | 300 | 2200 | 0 | 0 | Reasoning model traces brace matching and structure errors. |
| Document Processing | CSV to SQL | Convert 100-row CSV to INSERT statements | 1200 | 800 | 4500 | 0 | 0 | Repetitive task. Reasoning model ensures data type correctness. |
| Document Processing | Extract Table Data | Extract and restructure table from PDF (500 rows) | 2800 | 700 | 7000 | 0 | 0 | High token usage due to dense data. Reasoning helps with row alignment. |
| Document Processing | Compare Versions | Identify changes between 2 versions of 5-page doc | 1700 | 500 | 5500 | 0 | 0 | Reasoning model performs semantic diffing, not just text diffing. |
| Document Processing | Review Code PR | Review 200-line code pull request for bugs | 1300 | 500 | 4500 | 0 | 0 | Reasoning model simulates runtime to find subtle logic bugs. |
| Document Processing | Generate API Docs | Create documentation from 50-function source file | 1800 | 700 | 5500 | 0 | 0 | Reasoning model infers function purpose from code logic. |
| Multimodal (Vision) | Describe Image | Describe content of single photograph | 800 | 150 | 2200 | 1 | 0 | Tokens = Image patches (765+) + Output. Reasoning adds visual analysis. |
| Multimodal (Vision) | OCR Document | Extract text from image of handwritten note | 850 | 200 | 2400 | 1 | 0 | Handwriting recognition requires ‘guessing’ and verifying context. |
| Multimodal (Vision) | Analyze Chart | Interpret data trends from bar chart image | 950 | 350 | 3000 | 1 | 0 | Reasoning model maps visual bars to approximate numerical values. |
| Multimodal (Vision) | Screenshot Analysis | Debug UI from application screenshot | 900 | 350 | 3800 | 1 | 0 | High reasoning: model must correlate UI elements with code logic. |
| Multimodal (Vision) | Identify Objects | List all objects in image of warehouse | 800 | 300 | 2800 | 1 | 0 | Scanning task. Reasoning model performs systematic grid search mentally. |
| Multimodal (Vision) | Compare Images | Find differences between 2 product photos | 1600 | 600 | 4500 | 2 | 0 | Double image cost (~1700+ tokens). Reasoning compares feature by feature. |
| Multimodal (Vision) | Read Whiteboard | Transcribe equation written on whiteboard photo | 800 | 250 | 2600 | 1 | 0 | Math + Vision. Reasoning model validates the math syntax extracted. |
| Multimodal (Audio) | Transcribe Audio | Transcribe 5-minute audio interview | 5000 | 800 | 11000 | 0 | 5 | Includes ~5k billed audio tokens. High base cost. |
| Multimodal (Audio) | Extract Meeting Notes | Generate summary and action items from 30-min meeting | 30000 | 1000 | 58000 | 0 | 30 | ~30k audio tokens! Extremely expensive task due to audio density. |
| Multimodal (Audio) | Identify Speaker | Identify speaker and emotion in 2-min audio clip | 2000 | 300 | 4800 | 0 | 2 | Audio analysis adds cost. Reasoning infers emotion from tone. |
| Multimodal (Audio) | Translate Audio | Transcribe and translate 10-min German audio to English | 10000 | 1000 | 21000 | 0 | 10 | Double load: Audio tokens (10k) + Translation reasoning. |
| Multimodal (Mixed) | Document + Image | Match text document to related photos | 1500 | 1000 | 5500 | 2 | 0 | Cross-modal reasoning (Text vs Image) is token-heavy. |
| Multimodal (Mixed) | Video Description | Describe content from 2-min video (frames + audio) | 2300 | 2200 | 9500 | 3 | 2 | Video = Audio tokens + Sampled Image Frames. Very high data density. |
| Multimodal (Mixed) | Multi-Image Comparison | Compare changes across 5 product design mockups | 4200 | 600 | 9500 | 5 | 0 | 5 images @ ~900 tokens each. High base cost before any reasoning. |
| Document Processing | Summarize 50-page Technical Report | Summarize key findings from 50-page technical PDF without images | 20000 | 1200 | 26000 | 0 | 0 | Very heavy reading. Reasoning model builds global mental map of the report. |
| Document Processing | Extract KPIs from 50-page Annual Report | Extract revenue profit and growth KPIs from 50-page annual report | 22000 | 1500 | 28000 | 0 | 0 | Reasoning model scans tables and narrative to consolidate financial metrics. |
| Document Processing | Summarize 100-page Regulatory Filing | Create executive summary of 100-page regulatory filing (10-K/10-Q) | 40000 | 2000 | 52000 | 0 | 0 | Extreme reading load. Reasoning ensures compliance-critical points are retained. |
| Document Processing | Compare Two 50-page Contracts | Identify differences and risks between two 50-page legal contracts | 38000 | 2500 | 60000 | 0 | 0 | Semantic diff across 100 pages. Reasoning model aligns clauses and flags conflicts. |
| Document Processing | Audit 5k-line Codebase File | Review a 5000-line single code file for bugs and architecture issues | 35000 | 3000 | 70000 | 0 | 0 | Reasoning model must track long-range dependencies and patterns across thousands of lines. |
| Multimodal (Vision) | Process 20-page Scanned PDF | OCR and structure 20-page scanned PDF (image-only) | 16000 | 2000 | 30000 | 20 | 0 | 20 page images (~800 tokens each). Reasoning aligns detected text into pages and sections. |
| Multimodal (Mixed) | Analyze 50-page Report with Charts | Summarize 50-page PDF containing text plus 10 chart images | 23000 | 2000 | 32000 | 10 | 0 | Combination of long text and visual charts; reasoning fuses numeric insight with narrative. |
| Multimodal (Audio) | Transcribe 60-min Podcast | Full transcription of a 60-minute podcast episode | 60000 | 3000 | 75000 | 0 | 60 | ~60k audio tokens. Reasoning may cluster topics or speakers if requested. |
| Multimodal (Audio) | Summarize 90-min University Lecture | Generate structured notes and sections from a 90-minute lecture recording | 90000 | 4000 | 90000 | 0 | 90 | Massive audio load; reasoning organizes into topics subtopics and key definitions. |
| Multimodal (Audio) | Analyze 2-hour Support Call Log | Extract issues sentiments and escalation points from 2-hour support call | 120000 | 5000 | 110000 | 0 | 120 | 120k audio tokens plus reasoning to classify intents and sentiment over long horizon. |
| Multimodal (Mixed) | Describe 10-min Product Demo Video | Summarize features and UX from 10-minute demo video (screen + narration) | 18000 | 3000 | 22000 | 10 | 10 | Combines ~10k audio tokens with ~8-10 key frames; reasoning links UI steps with spoken explanations. |
| Multimodal (Mixed) | Summarize 45-min Webinar with Slides | Generate structured summary from 45-min webinar audio plus 30 slide images | 75000 | 4000 | 80000 | 30 | 45 | Slide deck (30 images) plus long audio track; reasoning aligns slide content with spoken narrative. |
| Multimodal (Mixed) | Review 60-min Security Camera Footage | Identify key events in 60-min silent security recording | 48000 | 2500 | 52000 | 40 | 0 | Dozens of sampled frames; reasoning tracks motion and anomalies across time. |
Data Sources & Methodology
This dataset was compiled from the following authoritative sources:
| Source Category | Source Name | URL | Key Insight / Data Point |
|---|---|---|---|
| General Tokenizer | Tiktokenizer (OpenAI) | https://platform.openai.com/tokenizer | Standard text tokenization rule: 1 word ≈ 1.3 tokens (1000 tokens ≈ 750 words). |
| Reasoning Models | OpenAI o1 System Card | https://openai.com/index/learning-to-reason-with-llms/ | Reasoning tokens are hidden output tokens used by the model to “think” before answering. Can range from hundreds to tens of thousands depending on complexity. |
| Reasoning Models | PromptLayer Analysis (o1 vs GPT-4o) | https://blog.promptlayer.com/an-analysis-of-openai-models-o1-vs-gpt-4o/ | Reasoning models often use 3-10x more tokens for complex tasks like coding or math due to internal chain-of-thought generation. |
| Reasoning Models | Reddit Community Analysis (Hidden Tokens) | https://www.reddit.com/r/OpenAI/comments/1hrhdbp/o1_models_hidden_reasoning_tokens/ | User benchmarks showing simple tasks might use ~300 hidden tokens, while complex coding tasks can exceed 5,000+ hidden tokens. |
| Reasoning Models | Arxiv: Comparative Study on Reasoning Patterns | https://arxiv.org/html/2410.13639v1 | Comparative benchmarks showing reasoning models consuming 10x-20x more tokens on complex logical tasks. |
| Reasoning Models | Clarifai Reasoning Model Comparison | https://www.clarifai.com/blog/best-reasoning-model-apis/ | Benchmarks for hard math/logic problems showing reasoning token usage often exceeding 30,000+ for difficult queries. |
| Reasoning Models | Databricks: Long Context RAG & o1 | https://www.databricks.com/blog/long-context-rag-capabilities-openai-o1-and-google-gemini | Highlights that reasoning models can fail or hit output limits when reasoning over very large contexts (e.g., 100+ pages). |
| Vision Tasks | OpenAI Vision Documentation | https://platform.openai.com/docs/guides/images-vision | Images are processed in 512x512 tiles. High-detail mode costs ~85 tokens base + 170 tokens per tile. A standard 1080p image is often ~765-1105 tokens. |
| Vision Tasks | Cursor IDE Blog (GPT-4o Image Costs) | https://www.cursor-ide.com/blog/gpt4o-image-api-pricing-guide-2025 | Practical breakdown of image costs: Low detail is fixed at 85 tokens. High detail scales with resolution. |
| Audio Tasks | OpenAI Pricing (Audio) | https://openai.com/api/pricing/ | Audio inputs are billed separately from text. GPT-4o Audio input is ~$0.06/min (Realtime). |
| Audio Tasks | Microsoft Azure AI Blog (Audio Tokens) | https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/real-time-speech-intelligence-for-global-scale-gpt-4o-transcribe-/4403091 | Audio tokenization is dense. Approximately 1 minute of audio ≈ 1,000 - 1,200 audio tokens for billing purposes. |
| Audio Tasks | OpenAI GPT-4o Audio Guide | https://platform.openai.com/docs/guides/audio | Technical details on how audio is tokenized and processed, confirming the distinction between input audio tokens and output text tokens. |
| Document Processing | Arxiv: Chain of Draft | https://arxiv.org/html/2502.18600v2 | Discusses token efficiency in reasoning models for drafting and document tasks, highlighting the overhead of “thinking” steps. |
| General Tasks | Awesome LLM Tasks (GitHub) | https://github.com/ozbekburak/awesome-llm-tasks | Curated list of practical LLM tasks used to derive the common daily task list categories. |
Use Cases
- Cost Estimation: Calculate expected API costs for your AI applications
- Model Selection: Choose between standard and reasoning models based on task complexity
- Budgeting: Plan AI infrastructure costs for production workloads
- Research: Benchmark and compare token efficiency across different task types
Related Resources
Want to see how these tasks translate to real-world workloads? Check out our detailed analysis:
AI Costs by Office Role - We use this dataset to calculate typical daily token consumption for different business roles (Executive Assistant, Recruiter, Financial Analyst, Corporate Counsel, Software Engineer) and reveal what drives AI costs in your organization.
Citation
If you use this dataset in your research or applications, please cite:
onprem.ai Research (2025). Real-World LLM Token Usage Dataset.
Retrieved from https://onprem.ai/knowhow/llm-token-usage-dataset
Last Updated: December 2025 Version: 1.0 License: Creative Commons BY 4.0