Publishing Data Packages - Best practice patterns

This page summarizes the best practice patterns that should be followed when creating a data package. It addresses

The Data Package name,
The resource and data file names,
The descriptor datapackage.json,
The Data Package folder names and structure,
The README file,
Validate and preview,
Examples of well-structured packages.

Complete specifications are available at http://dataprotocols.org/data-packages.

Data Package Name

The Data Package name is used in the name field of the datapackage.json.

This name is also frequently used for the folder/directory in which the Data Package is stored.

As per the Data Package spec The name SHOULD be:

lower-case
use '-' for word separators
reasonably concise (3-4 words)

Naming conventions

For country specific datasets:

{topic}                  # e.g. gdp
{topic}-{2-digit-iso}    # e.g. gdp-us

For time series data:

[...-]year
[...-]quarter
[...-]month
[...-]day

Resource and File Names

Similar to Data Package Names:

lower-case
use '-' for word separators

Resource names SHOULD, usually, be the same as the name of the associated file on disk but without the file extension. e.g.

gdp-quarterly     # resource name
gdp-quarterly.csv # on disk

Naming conventions of files follow that for data packages in terms of country or time series facets.

Descriptor `datapackage.json`

Alignment

With JSON, data is structured in a nested way through curly and squared brackets. Though the alignment of these structures is not relevant for computer programs, it makes it easier for the human reader if they are properly aligned.

Good alignment:

{
  "name": "corruption-perceptions-index",
  "title": "Corruption Perceptions Index (CPI)",
  "sources": [
    {
      "name": "Transparency International",
      "web": "http://www.transparency.org/research/cpi/overview"
    }
  ],
...
}

Bad alignment:

{
  "name": "corruption-perceptions-index","title": "Corruption Perceptions Index (CPI)",
  "sources":
  [{
    "name": "Transparency International",
    "web": "http://www.transparency.org/research/cpi/overview"}]
    ,
...
}

Please make sure to have your datapackage.json well structured to ease the understanding of your Data Package content. The Online DataPackage.json Creator can help you create the general structure.

Contributors fields

Add the 'contributors' field (original author of the package - see http://dataprotocols.org/data-packages) if you wish to keep the credits for the package.

Data Package Folder Names and Structure

It is standard practice to use the Data Package name (from the datapackage.json) for the name of the folder/directory in which the Data Package is kept.

If storing in e.g. git(hub) this would also be the the name of the repository.

If you include scripts allowing to automate the data extraction process, these should be stored in a script folder/directory.

README

A README is a text file giving (human-readable) information about your dataset.

Data Packages SHOULD have a README.

Formatting

The README SHOULD be a plain text file (no word or rich text etc) and SHOULD use markdown to allow for formatting

File Name

If markdown is used the file SHOULD be named README.md and otherwise SHOULD be named README.txt.

Sections

You can include anything you like in your README. It is standard practice to include some (if possible all) of the following sections: Introduction, Data, Preparation, License.

We SHOULD NOT include the title of the Data Package at the top of the README.

Each section other than the introduction should be headed with its name using level 2 heading in markdown e.g. for the data section you would have the following markdown in your README:

## Data

Introduction

Start with a short description of the dataset (the first sentence and first paragraph should be extractable to provide short standalone descriptions).

Unlike other sections this section SHOULD NOT have a heading as it starts the README. (i.e. you do not need the heading ## Introduction

Data

Put specific information about the data in a Data section. This can be things like information about the source of the data, the specific structure of the data, missing values etc.

Preparation

Put information on preparing the data in a Preparation section. In particular, any instructions about how to run any preparation and processing scripts to generate the data should go here.

License

Put additional information on the permissions and licensing of the data in the Data Package in the License section.

Since licensing information is often not clear from the data producers, the guideline here is to license the Data Package under the Public Domain Dedication and License, and then to add any relevant information or disclaimers regarding the source data.

See for example

See also the following thread https://discuss.okfn.org/t/copyright-on-data-sources/189.

Validate and preview your Data Package

Use the Online validator to check that your datapackage.json and Data Package are good to go. Simply drop the URL to your Data Package in the input box, and press Validate. If everything is fine, Status: Valid is returned.

Then use the Online Data Package viewer app to have a preview of your Data Package.

Examples

For examples of well-structured Data Package see:

For tabular data: http://data.okfn.org/data/core/corruption-perceptions-index
For geospatial data: http://data.okfn.org/data/core/geo-nuts-administrative-boundaries