Databricks Releases 15K Record Training Corpus for Instruction Tuning LLMs

Summary

Blog post: Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM

databricks-dolly-15k is an open source dataset of instruction-following records used in training databricks/dolly-v2-12b that was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License.

Supported Tasks:

  • Training LLMs
  • Synthetic Data Generation
  • Data Augmentation

Languages: English
Version: 1.0

Owner: Databricks, Inc.

Dataset Overview

databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language
models to exhibit the magical interactivity of ChatGPT. Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories, including the seven outlined in the InstructGPT paper, as well as an open-ended free-form category. The contributors were instructed to avoid using information from any source on the web with the exception of Wikipedia (for particular subsets of instruction categories), and explicitly instructed to avoid using generative AI in formulating instructions or responses. Examples of each behavior were provided to motivate the
types of questions and instructions appropriate to each category.

Halfway through the data generation process, contributors were given the option of answering questions posed by other contributors. They were asked to rephrase the original question and only select questions they could be reasonably expected to answer correctly.

For certain categories contributors were asked to provide reference texts copied from Wikipedia. Reference text (indicated by the context field in the actual dataset) may contain bracketed Wikipedia citation numbers (e.g. [42]) which we recommend users remove for downstream applications.

Intended Uses

While immediately valuable for instruction fine tuning large language models, as a corpus of human-generated instruction prompts, this dataset also presents a valuable opportunity for synthetic data generation in the methods outlined in the Self-Instruct paper. For example, contributor–generated prompts could be submitted as few-shot examples to a large open language model to generate a corpus of millions of examples of instructions in each of the respective InstructGPT categories.

Likewise, both the instructions and responses present fertile ground for data augmentation. A paraphrasing model might be used to restate each prompt or short responses, with the resulting text associated to the respective ground-truth sample. Such an approach might provide a form of regularization on the dataset that could allow for more robust instruction-following behavior in models derived from these synthetic datasets.

Dataset

Purpose of Collection

As part of our continuing commitment to open source, Databricks developed what is, to the best of our knowledge, the first open source, human-generated instruction corpus specifically designed to enable large language models to exhibit the magical interactivity of ChatGPT. Unlike other datasets that are limited to non-commercial use, this dataset can be used, modified, and extended for any purpose, including academic or commercial applications.

Sources

  • Human-generated data: Databricks employees were invited to create prompt / response pairs in each of eight different instruction categories.
  • Wikipedia: For instruction categories that require an annotator to consult a reference text (information extraction, closed QA, summarization) contributors selected passages from Wikipedia for particular subsets of instruction categories. No guidance was given to annotators as to how to select the target passages.

Annotator Guidelines

To create a record, employees were given a brief description of the annotation task as well as examples of the types of prompts typical of each annotation task. Guidelines were succinct by design so as to encourage a high task completion rate, possibly at the cost of rigorous compliance to an annotation rubric that concretely and reliably operationalizes the specific task. Caveat emptor.

The annotation guidelines for each of the categories are as follows:

  • Creative Writing: Write a question or instruction that requires a creative, open-ended written response. The instruction should be reasonable to ask of a person with general world knowledge and should not require searching. In this task, your prompt should give very specific instructions to follow. Constraints, instructions, guidelines, or requirements all work, and the more of them the better.
  • Closed QA: Write a question or instruction that requires factually correct response based on a passage of text from Wikipedia. The question can be complex and can involve human-level reasoning capabilities, but should not require special knowledge. To create a question for this task include both the text of the question as well as the reference text in the form.
  • Open QA: Write a question that can be answered using general world knowledge or at most a single search. This task asks for opinions and facts about the world at large and does not provide any reference text for consultation.
  • Summarization: Give a summary of a paragraph from Wikipedia. Please don’t ask questions that will require more than 3-5 minutes to answer. To create a question for this task include both the text of the question as well as the reference text in the form.
  • Information Extraction: These questions involve reading a paragraph from Wikipedia and extracting information from the passage. Everything required to produce an answer (e.g. a list, keywords etc) should be included in the passages. To create a question for this task include both the text of the question as well as the reference text in the form.
  • Classification: These prompts contain lists or examples of entities to be classified, e.g. movie reviews, products, etc. In this task the text or list of entities under consideration is contained in the prompt (e.g. there is no reference text.). You can choose any categories for classification you like, the more diverse the better.
  • Brainstorming: Think up lots of examples in response to a question asking to brainstorm ideas.

Personal or Sensitive Data

This dataset contains public information (e.g., some information from Wikipedia). To our knowledge, there are no private person’s personal identifiers or sensitive information.

Language

American English

Known Limitations

  • Wikipedia is a crowdsourced corpus and the contents of this dataset may reflect the bias, factual errors and topical focus found in Wikipedia
  • Some annotators may not be native English speakers
  • Annotator demographics and subject matter may reflect the makeup of Databricks employees

License/Attribution

Copyright (2023) Databricks, Inc.
This dataset was developed at Databricks (https://www.databricks.com) and its use is subject to the CC BY-SA 3.0 license.

Certain categories of material in the dataset include materials from the following sources, licensed under the CC BY-SA 3.0 license:

Wikipedia (various pages) – https://www.wikipedia.org/
Copyright © Wikipedia editors and contributors.

Read More
Sharie Byron

Latest

Inside the $9 billion World Cup: How Gianni Infantino built a FIFA-dom with a tight grip on soccer’s biggest global event

For Zurich’s bankers and executives, May 27, 2015, began as a normal Wednesday—until Swiss police stormed the financial hub’s five-star Baur au Lac hotel and arrested seven top officials of FIFA, soccer’s global governing body, who were gathered there for their annual congress. The U.S. Department of Justice had unsealed a sprawling indictment alleging payment

Deel Launches DLUSD to Pay Workers in Dollars — No US Bank Needed

Two announcements from traditional financial powerhouses this week signal that stablecoins are becoming the plumbing of everyday finance. Getting Paid in Stablecoins Deel, the global payroll platform serving 40,000 businesses and 1.5 million workers across 150 countries, launched DLUSD on June 3, a custom USD-backed stablecoin...

Coinbase freezes $3M tied to Southeast Asia crypto fraud networks

Coinbase freezes $3M tied to Southeast Asia crypto fraud networks Latest News Published Jun 4, 2026 Authorities around the world have been heavily targeting scam infrastructure this year, with joint actions involving the US, UAE, China, Austria and Albania. Crypto exchange Coinbase said it froze more than $3 million in cryptocurrency tied to a global

Morgan Stanley sees major upside for Apple stock ahead of WWDC

Please enable JS and disable any ad blocker

Newsletter

Don't miss

Inside the $9 billion World Cup: How Gianni Infantino built a FIFA-dom with a tight grip on soccer’s biggest global event

For Zurich’s bankers and executives, May 27, 2015, began as a normal Wednesday—until Swiss police stormed the financial hub’s five-star Baur au Lac hotel and arrested seven top officials of FIFA, soccer’s global governing body, who were gathered there for their annual congress. The U.S. Department of Justice had unsealed a sprawling indictment alleging payment

Deel Launches DLUSD to Pay Workers in Dollars — No US Bank Needed

Two announcements from traditional financial powerhouses this week signal that stablecoins are becoming the plumbing of everyday finance. Getting Paid in Stablecoins Deel, the global payroll platform serving 40,000 businesses and 1.5 million workers across 150 countries, launched DLUSD on June 3, a custom USD-backed stablecoin...

Coinbase freezes $3M tied to Southeast Asia crypto fraud networks

Coinbase freezes $3M tied to Southeast Asia crypto fraud networks Latest News Published Jun 4, 2026 Authorities around the world have been heavily targeting scam infrastructure this year, with joint actions involving the US, UAE, China, Austria and Albania. Crypto exchange Coinbase said it froze more than $3 million in cryptocurrency tied to a global

Morgan Stanley sees major upside for Apple stock ahead of WWDC

Please enable JS and disable any ad blocker

Why Your Business Could Lose More Than Its Founder If You’re Suddenly Incapacitated

If your business depends entirely on you for access to critical information, one emergency can put everything at risk. Here's how to build a continuity plan before that ever happens...

Jury acquits 2 business executives of bribing Navy admiral for government contract

A federal jury has acquitted two business executives of charges that they conspired to bribe a retired four-star U.S. Navy admiral, who is now serving a six-year prison sentence for his conviction on corruption charges By MICHAEL KUNZELMAN Associated Press WASHINGTON -- A federal jury has acquitted two business executives of charges that they conspired

US Business Leaders Optimistic About China Cooperation, Emphasize Importance of Chinese Market

© 2026 China Money Network. All Rights Reserved. Disclaimer: The views, opinions, forecasts, and statements made by our hosts and guests are the personal views of those respective individuals and may or may not be either endorsed or accepted by China Money Network Limited or the companies with which these individuals are employed.

Tesla’s Business Has Become Much More Diversified in Just the Past Five Years. Does That Make Its Stock a Better Buy Today?

Key Points Tesla's energy generation and storage segment generated 27% revenue growth last year. The company's non-automotive segments were able to help offset a double-digit decline in auto revenue in 2025. These 10 stocks could mint the next wave of millionaires › Tesla (NASDAQ: TSLA) is known for its electric vehicles (EVs), and while they