Good coding and dataset storage practices

You know that feeling when you open the folder of a student after they’ve spent a few months in the lab… and all you see are files named test1.mat, ramantest_powerlaser3.wdf, nirtest.raw, untitled2.m, new_script2_final.py, and copy_of_new_script_FINALdefinitely.txt?

Honestly, we can all agree this should be illegal (maybe even punished with jail time).

Jokes aside😄, clean code and organized datasets also shows professionalism and respect for others. It proves that you care about quality and understand that teamwork (and science!) works better when we can actually read each other’s functions and data.

Lovelace’s Square wants to build a repository of code and data that is easy to use, well-organized, and powerful. Since we are building a shared repository and because this repository may later support Ada or other machines, it’s even more important to keep everything clear, clean, and consistent. A well-structured codes and datasets helps Ada give better answers, and ensures that both people and machines can navigate the content without confusion.

By following some guidelines, you’ll not only make your fellow coders smile, but also help keep our shared space neat and useful. Importantly, although these guidelines are not mandatory, we strongly encourage you to follow them.

General Principles

Before going into details, it is important to understand some basic principles that will help us keep both code and datasets clear, useful, and easy to share.

Principle	Description
Be consistent	Use a consistent structure and naming style across your files. This makes it easier to navigate, reduces confusion, and improves collaboration.
Make it clear and readable	Choose names that describe the content. Use clear function and variable names in code. In datasets, label columns properly and include units if needed. Avoid vague names like `x`, `temp`, or `data1`.
Stay organized	Keep related items together. Use folders for different types of files (e.g. `raw/`, `processed/`, `scripts/`, `results/`). Name your files so others can understand what they are without opening them.
Document your work	Always include a short explanation. For code, you can use comments and headers to describe what the function does and how to use it. For datasets, this information can be included in the `README.txt` or metadata file.
Be efficient, but not at the cost of clarity	Write clean, simple code and avoid unnecessary complexity. For data, avoid large, unused files. Always balance performance with readability and usability.
Use open and common formats	Save data in formats like `.txt`, `.csv`, or `.xlsx`. Write code using standard tools (e.g. MATLAB, Python, R) and avoid formats that are hard to open or require very special software.
Follow FAIR principles	Add clear descriptions, use standard formats, and ensure others can use your work without needing to ask you directly.

FAIR stands for:

Findable: Others can find your work easily thanks to good naming, documentation, and metadata.
Accessible: Your code and data are available to others (open source, no hidden folders or password-protected ZIPs).
Interoperable: Your work should be usable with other tools, platforms, and programming languages. This means using open, standard, or widely supported formats (like .csv, .py, .xml, or .m) and writing code that follows common practices.
Reusable: Your code should be understandable, well-documented, and clearly licensed, so others can actually use it, not just today, but in future projects or by different teams. Don’t be afraid to add comments, examples, or explanations. Clear code is kind code.

FAIR principles help turn your one-time analysis into something bigger: a resource others can build on, verify, or even teach with.

General tips

So with all that in mind, we’ve collected some key tips to help you write code that not only works well, but will still make sense after a few months. The examples in this documentation are in MATLAB programming, but they can be extended to other languages.

Tips for codes

Documentation

Good documentation is one of the most valuable parts of a project. It helps others understand your code and makes it easier to reuse, share, and improve.

Function headers

We strongly recommend to use this template for your function headers. It gives users everything they need to understand how the function works.

function [output1, output2] = FunctionName(input1, input2)
% FUNCTIONNAME Brief description of the function
%
% Authors: Your Name
% Date Created: YYYY-MM-DD
% License: Specify your license here
% Version: Specify the version.
% Reviewed by Lovelace's Square team: Yes/No
%
% Detailed function description:
% Here's where you can really shine! Explain what your function does,
% any algorithms it uses, and any quirks or special features.
%
% Args:
%    input1 (type): Description of the first input parameter.
%    input2 (type): Description of the second input parameter.
%
% Returns:
%    output1 (type): Description of the first output parameter.
%    output2 (type): Description of the second output parameter.
%
% Example:
%    [result1, result2] = FunctionName(arg1, arg2)
%
% See also: RELATEDFUNCTION1, RELATEDFUNCTION2

% Your brilliant code goes here
end

README files

Every project or main folder should include a README.txt file. It explains what the code does, how to use it, and how to install or run it.

DirectorymyProject/
- README.txt
- mainScript.m
- Directoryfunctions/
  - calculateSomething.m
  - plotResults.m

Code style

Indentation

Use consistent indentation: Indentation makes your code easier to follow. Use 4 spaces per level. This helps show where blocks of code start and end.

Good
Bad

if isValid
    result = processData(data);
else
    result = [];
end

if isValid
result = processData(data);
else
result = [];
end

Line length

Keep lines under 75 characters. Break long lines using ... to improve readability.

Good
Bad

summaryText = ['The analysis shows a strong correlation ' ...
               'between variables.'];

summaryText = 'The analysis shows a strong correlation between variables.';

Spaces around operators

Add spaces around operators to make your code easier to read.

Good
Bad

result = (a + b) / 2;
count  = count + 1;

result=(a+b)/2;
count=count+1;

Use blank lines

Separate logical sections with blank lines to improve clarity.

Good
Bad

data = load('datafile.mat');

% Normalize the data
normData = (data - mean(data)) / std(data);

% Plot the result
plot(normData);

data = load('datafile.mat');
normData = (data - mean(data)) / std(data);
plot(normData);

Align similar lines to make code easier to scan.

Good
Bad

maxIterations = 100;   % maximum number of iterations
maxDepth      = 20;    % maximum depth
minError      = 0.01;  % minimum error threshold

maxIterations = 100; % maximum number of iterations
maxDepth = 20; % maximum depth
minError = 0.01; % minimum error threshold

Use parentheses to clarify logic

Use parentheses in complex expressions to show order of operations clearly.

Good
Bad

result = (a * b) + c - (sqrt(d) * e);

result = a * b + c - sqrt(d) * e;

Commenting

Use comments to explain what your code is doing, especially for complex logic.

Good
Bad

% Normalize the input values
normData = (data - mean(data)) / std(data);

% Check for outliers
isOutlier = normData > 3;

normData = (data - mean(data)) / std(data); % do stuff
isOutlier = normData > 3; % check

Descriptive names

Avoid using single letters or vague terms. Use names that explain what the variable or function does.

Good
Bad

velocity = 12.5;
result = calculateMean(data);

x = 12.5;
r = calcM(data);

Functions and classes: UpperCamelCase

Start each word with a capital letter. Do not use underscores.

Good
Bad

MyAwesomeFunction()
ChemometricAnalyzer()

my_awesome_function()
chemometric_analyzer()

Variables and properties: lowerCamelCase

Start with a lowercase letter. Capitalize the first letter of each new word.

Good
Bad

myVariable = 5;
dataMatrix = rand(10);

Myvariable = 5;
datamatrix = rand(10);

Constants: UPPERCASE

Write constants in all caps to show they are fixed values.

Good
Bad

MAX_ITERATIONS = 100;
DEFAULT_TIMEOUT = 30;

maxIterations = 100;
defaultTimeout = 30;

Functions and scripts

Functions and scripts are the basic parts of your code. Here is how to use them well:

One function per file. Each function should be in its own file. This makes it easier to find, understand, and update.

NormalizeData.m
CalculateMean.m
LoadAndPlot.m

Use functions instead of scripts. Functions are more reliable because they keep their own variables. This avoids conflicts and makes the code cleaner and easier to test.

Function (Good)
Script (Bad)

function result = CalculateMean(data)
    result = sum(data) / numel(data);
end

data = [1, 2, 3, 4];
result = sum(data) / numel(data);

Use sections in long scripts. If your script is long, break it into sections using %%. This helps you run or debug parts of the code without running everything at once. It also makes your script easier to read and follow.

With Sections (Good)
Without Sections (Bad)

%% Load Data
data = load('datafile.mat');

%% Process Data
normData = (data - mean(data)) / std(data);

%% Plot Results
plot(normData);

data = load('datafile.mat');
normData = (data - mean(data)) / std(data);
plot(normData);

Group related functions in packages or toolboxes. When functions are related, put them in the same folder or package. This keeps your project organized and makes it easier to reuse or share.

Directory+preprocessing/
- normalizeData.m
- scaleData.m
Directory+analysis/
- calculateMean.m
- calculatePCA.m

Tips for datasets

Datasets are as important as code. Here is how to prepare and store data so it is clear, reusable, and reliable.

Organization

Keep your data files in a clear folder structure. Group related files together and name folders meaningfully.

Directorydata/
- Directoryraw/
  - sample1.csv
  - sample2.csv
- Directoryprocessed/
  - sample1_clean.csv
  - sample2_clean.csv
- Directorydocs/
  - metadata.json
  - README.txt

Documentation & Metadata

Every dataset should include clear documentation and metadata so users know what the data is, where it came from, and how to use it. At minimum, each data folder needs:

README file (e.g. README.json or README.txt) – An overview of the dataset:
- Title and Brief description of the data
- Source: Where the data originated (e.g. instrument, study, or publicly available repository)
- Directory structure: What files or subfolders exist and what they contain
- Usage: How to load or process the data, including any required software or dependencies
- Contact: Who to reach if there are questions
Metadata file (e.g. metadata.json or metadata.txt) – A machine-readable record:
- File names and column descriptions (with units)
- Date collected and author/institution
- Format and version of the dataset
- License or usage terms

Example README.txt:

# Infrared Spectra Dataset

## Description
This dataset contains raw and processed infrared spectra of chemical samples, collected using Fourier-transform infrared (FTIR) spectroscopy. It was created as part of a study on solvent concentration and spectral variation in analytical chemistry.

## Source
- Instrument: FTIR Model X
- Location: Lovelace Lab, Lovelace University
- Date Collected: March 2025
- Experiment: Solvent concentration and spectral profile study

## Authors
- Dr. Ada R. Lovelace (ada.lovelace@lovelacesquare.org)
- Clara Babbage
- Team Lovelace's Square – Spectroscopy & Chemometrics Unit

## Publication
If you use this dataset, please cite:
A. R. Lovelace, C. Babbage. "Spectral Response to Solvent Concentration in FTIR." Journal of Chemometric Data, 2025.
DOI: doi.org/10.1242/jeb.025361

## Structure

data/
├── raw/
│   ├── sample1.csv               # unprocessed FTIR spectra
│   └── sample2.csv
├── processed/
│   ├── sample1_clean.csv         # baseline corrected, smoothed
│   └── sample2_clean.csv
└── docs/
    ├── README.md
    └── metadata.json

## Format
- Files are in .csv format with two columns:
  - wavenumber (cm⁻¹)
  - intensity (absorbance)

## Usage

raw = readtable('data/raw/sample1.csv');
clean = readtable('data/processed/sample1_clean.csv');
plot(raw.wavenumber, raw.intensity);

## License
This dataset is licensed under CC BY 4.0 – you are free to use, modify, and share it, as long as you give proper credit.

## Contact
Questions? Email lab@lovelacesquare.org

Example metadata.json:

metadata_json = """{
  "sample1.csv": {
    "date_collected": "2025-03-10",
    "instrument": "FTIR Model X",
    "location": "Lovelace Lab, Lovelace University",
    "experiment": "Solvent concentration and spectral profile study",
    "columns": {
      "wavenumber": "cm^-1",
      "intensity": "absorbance"
    },
    "notes": "Raw spectral data collected without preprocessing."
  },
  "sample1_clean.csv": {
    "date_processed": "2025-03-12",
    "processing_steps": [
      "baseline correction",
      "smoothing",
      "normalization"
    ],
    "software": "MATLAB R2025a",
    "columns": {
      "wavenumber": "cm^-1",
      "intensity": "absorbance (processed)"
    },
    "notes": "Processed version of sample1.csv ready for analysis."
  }
}

FAIR for Data

Apply FAIR principles to your datasets:

Findable: Add clear names, README, and metadata.
Accessible: Use open formats and avoid paywalled locations.
Interoperable: Follow common standards (e.g. CSV, JSON, .py, .m).
Reusable: Include license information and usage examples.

Let’s keep Lovelace’s Square neat and useful!

Writing clean code and keeping data organized isn’t just about following rules. It shows respect to your team, helps you find your work easily later, and encourages others to build on what you’ve done. When we all do our part, Lovelace’s Square becomes a space where great science happens easily.