Data preparation

In this tutorial, we explain how to make your data compatible with the move-dl commands.

For this tutorial we will work with a dataset taken from Walters et al. (2008) [1]. In their work, they report soil microbiome census data along with environmental data (e.g., temperature and precipitation) of different cultivars of maize.

We will start by downloading the files corresponding to their OTU table and metadata.

Formatting omics data

The move-dl pipeline requires continuous omics input to be formatted as a TSV file with one column per feature and one row per feature.

If we load the microbiome OTU table from the maize rhizosphere dataset, it will look something like this:

Original OTU table

otuids

11116.C02A66.1194587

11116.C06A63.1195666

11116.C08A61.1197689

4479944

70

8

18

513055

2

16

1

519510

22

15

12

810959

5

0

3

849092

5

2

1

We have columns corresponding to samples and rows corresponding to features (OTUs), so we need to transpose this table for MOVE.

Transposed OTU table

sampleids

4479944

513055

519510

810959

849092

11116.C02A66.1194587

70

2

22

5

5

11116.C06A63.1195666

8

16

15

0

2

11116.C08A61.1197689

18

1

12

3

1

Now, we can save our table as a TSV and we are ready to go. No need to do any further processing.

Formatting other continuous data

Other non-omics continuous data is formatted in a similar way.

For this tutorial, we are going to extract some continuous data from the maize metadata table. Let us load the table and take a peek:

Original metadata table

X.SampleID

Precipitation3Days

INBREDS

Maize_Line

Description1

11116.C02A66.1194587

0.14

Oh7B

Non_Stiff_Stalk

rhizosphere

11116.C06A63.1195666

0.14

P39

Sweet_Corn

rhizosphere

11116.C08A61.1197689

0.14

CML333

Tropical

rhizosphere

11116.C08A63.1196825

0.14

CML333

Tropical

rhizosphere

11116.C12A64.1197667

0.14

Il14H

Sweet_Corn

rhizosphere

The original metadata table contains both categorical (e.g., Maize_Line) and continuous data (e.g., Precipitation3Days). We need to separate these into different files.

In this example, we select three columns: age, Precipitation3Days, and Temperature.

Extracted continuous data

X.SampleID

age

Temperature

Precipitation3Days

11116.C02A66.1194587

12

76

0.14

11116.C06A63.1195666

12

76

0.14

11116.C08A61.1197689

12

76

0.14

11116.C08A63.1196825

12

76

0.14

11116.C12A64.1197667

12

76

0.14

Once again, we can save this table as a TSV, and we are ready to continue.

Formatting categorical data

Categorical data like binary variables (e.g., with/without treatment) or discrete categories needs to be formatted in individual files.

The metadata table contains several discrete variables that can be useful for classification, such as maize line, cultivar, and type of soil. For each one of these, we need to create a separate TSV file that will look something like:

Extracted maize line data

X.SampleID

Maize_Line

11116.C02A66.1194587

Non_Stiff_Stalk

11116.C06A63.1195666

Sweet_Corn

11116.C08A61.1197689

Tropical

11116.C08A63.1196825

Tropical

11116.C12A64.1197667

Sweet_Corn

Creating a data config file

We are missing two components to make our data compatible with move-dl. First, we need to create an additional text file with all the sample IDs (one ID per line, see example below). This file tells MOVE which samples to use, so the IDs in this file must be present in all the other input files.

Maize sample IDs
11116.C02A66.1194587
11116.C06A63.1195666
11116.C08A61.1197689
11116.C08A63.1196825
11116.C12A64.1197667

Finally, we need to create a data config YAML file. The purpose of this file is to tell MOVE which files to load, where to find them, and where to save any output files.

The data config file for this tutorial would look like this:

defaults:
  - base_data

raw_data_path: maize/data/
interim_data_path: maize/interim_data/
results_path: maize/results/

sample_names: maize_ids

categorical_inputs:
  - name: maize_field
  - name: maize_line
  - name: maize_variety

continuous_inputs:
  - name: maize_metadata
  - name: maize_microbiome

Here we break down the fields of this file:

  • defaults indicates this file is a config file. It should be left intact.

  • raw_data_path points to the raw data location (i.e., the files we created in this tutorial).

  • interim_data_path points to the directory where intermediary files will be deposited.

  • results_path points to the folder where results will be saved.

  • sample_names is the file name of the file containing all valid sample IDs. This file must have a txt extension.

  • categorical_inputs is a list of file names containing categorical data. Each element of the list should have a name name and may optionally have a weight. All referenced files should have a tsv extension.

  • continuous_inputs lists the continuous data files. Same format as categorical_inputs.

The data config file can have any name, but it must be saved in config/data directory. The final workspace structure should look like this::

tutorial/
│
├── maize/
│   └── data/
│       ├── maize_field.tsv       <- Type of soil data
│       ├── maize_ids.txt         <- Sample IDs
│       ├── maize_line.tsv        <- Maize line data
│       ├── maize_metadata.tsv    <- Age, temperature, precipitation data
│       ├── maize_microbiome.tsv  <- OTU table
│       └── maize_variety.tsv     <- Maize variety data
│
└── config/
    └── data/
        └── maize.yaml            <- Data configuration file

With your data formatted and ready, we can continue to run MOVE and exploring the associations between the different variables in your datasets. Have a look at our introductory tutorial for more information on this.

References