OLIVER SERANG

PROGRAMMING STYLE GUIDE

Use two-space indentation.

In C/C++, do not use braces to enclose a single line.

	      for (unsigned long i=0; i<N; ++i)
  // do something

In C/C++, use walk-like-an-Egyptian braces:

	      for (unsigned long i=0; i<N; ++i) {
  // do some stuff
  // do some more stuff
}

Always use long or unsigned long for iteration unless you know you will need fewer possible values.
Class names should be upper camel-case (e.g., SparseVector). Function names should be lowercase underscored (e.g., pack_dense_vector_to_sparse).
Use C++11 iteration notation where possible: for (const std::string & peptide : all_peptides).
In C++, only use auto if (1) typing is clear from context and (2) writing the type explicitly is cumbersome (e.g., when assigning a lambda function.

Name variables and functions specifically to describe their task. fix_string should instead be named replace_k_amino_acid_with_r_amino_acid. Your code should be able to be read like a story:

	      class Girl:
  def sleep_in_bed_from_bear(bear_name):
    if bear_name == 'Papa':
      result = 'too hard'
    elif bear_name == 'Mama':
      result = 'too soft'
    elif bear_name == 'Baby':
      result = 'just right'
    else:
      assert(False)
    print(bear_name + "'s bed was " + result)

goldilocks = Girl()
for bed_for_which_bear in ['Papa', 'Mama', 'Baby']:
  goldilocks.sleep_in_bed_from_bear(bed_for_which_bear)

Comment any code whose task is not immediately clear. Place the comments above the code and followed by a colon. Place whitespace above the comment to demonstrate the intent that the comment is paired with the code below.

There are four types of comments:

General comment:
These types of comments communicate information about your code.

	      // Find the argmin:
int min_index = -1;
double min_val = inf;
for (long i=0; i<N; ++i) {
  double val = arr[i];
  if (val < min_val) {
    min_index = i;
    min_val = val;
  }
}

Note:
These types of comments are for illustrative purposes, e.g., for improvements that may later be helpful, but which are not planned.

	      // Note: currently sorts to find the median;
// this will be slower than O(n) median-of-medians,
// but will not yet limit performance.

std::sort(vec.begin(), vec.end());
double median;
if (N % 2 == 1)
  median = vec[N/2];
else
  median = (vec[N/2-1] +  vec[N/2]) / 2.;

todo:
These comments are used for improvements that are planned but that do not alter the correctness of the code.

	      // todo: replace this with an O(n log(n)) sort
sort_naive(vec);

fixme:
These comments are used for changes that are necessary for correctness (or even necessary to compile the code).

	      // fixme: currently assumes N is odd:
median = vec[N/2];

Use 2. instead of 2.0 or 2 as the format to distinguish floating point types; however, when the magnitude is <1, use 0.7 rather than .7.
Where two functions do the same task but with a different mechanism, name the functions with the task first and then the mechanism: e.g., bit_reverse_naive_bitwise, bit_reverse_bytewise_table. This is because the user of this function gives primacy to the task at hand; the method for accomplishing it is less important than the problem it solves.
Use the following a_to_b syntax for maps/dictionaries: std::map<std::string, std::vector<std::string> > protein_to_peptides. Note the specific use of plurals: a value is pluralized only when multiple things go in or come out; a dictionary of one husband to one wife should not be named ~~husbands_to_wives~~.
Don't repeat yourself. Duplicate code should be refactored out ASAP. Often, the ABC design pattern is useful for this. Once again: don't repeat yourself.
Whitespace should help someone to read your code, just as pauses in conversation help introduce a change of topic.
Use symmetric whitespace. if (lhs ==rhs) should be replaced with if (lhs==rhs) or if (lhs == rhs).
Performant and concise code are both good, but readable code is more important. Optimize for readability first, then performance and brevity.
Unless constexpr types are necessary, do not update a variable (e.g., problem size) and recompile and run to test different problem sizes. Instead, accept command-line parameters.
All command line tools should have a usage:... statement and exit(1) when no arguments are provided. The usage statement should explain how to run the tool.
Keep your functions short (< ≈40 lines) wherever possible. Corollary: do not put substantial code in main. Call other functions that will do what needs to be done.
In python, import numpy as np and import pylab as P. Other libraries should be imported with thier names (e.g., import sympy) or surgically using only the functions you need (e.g., from scipy.signal import fftconvole).
In C++, do not write ~~using namespace std;~~.
Separate concerns so that each function and object has one, atomic task. Avoid performing other tasks.
In C++, when calling a function inherited from a base class, use the function name directly. If this does not work because of templated objects, use the explicit version BaseClassName::function_name() instead of the implicit version ~~this->function_name()~~.
If a variable name is plural, name it with an s in the appropriate place. If the plurality refers to multiple objects together, use _s:
// one protein -> multiple peptides std::unordered_map<std::string, std::vector<std::string> > protein_to_peptides; // several proteins std::vector<std::string> proteins_sharing_peptide; // (protein,length) x several std::vector<std::pair<std::string, unsigned> > protein_and_length_s.
If a variable name is posessive, name it without an s and without a quotation mark: e.g., car_transmission.
Unless it would cause great trouble, use one class per file.
Compile with -O3 -march=native -std=c++17.
Code should compile without warnings when using -Wall.

SCIENTIFIC PRACTICE

Test ideas in python, implement long-term version in object-oriented C++.
Every project should have this structure: /src/, /doc/papers/, /doc/presentations/ if any of these would be non-empty. Each paper should have its own subdirectory, e.g., /doc/papers/inference-is-subquadratic and /doc/papers/practically-fast-solutions-to-knapsack. Directories should be named with lowercase and hyphenation for whitespace. Source files should be named with upper camel-case if they contain a class, and their name should be the same as the class (i.e., SparseVector.cpp should contain the definition for the class SparseVector. A source file containing only function should be named with the function's name, which should be lowercase and with underscores for whitespace (e.g., sort_naive). A source file containing multiple functions should be named descriptively with the theme of the functions in the file (e.g., bit_reversal). Functions should not be in the same file if they are not thematic with one another. Figures included in paper subdirectories should be located in /doc/papers/paper-name/figures. If scripts are needed to build a particular figure, that should be included in a subdirectory, named /doc/papers/paper-name/figures/figure-description. Include a Makefile for each paper and in source code where necessary.
Do not optimize an implementation until you are confident it is doing what it is supposed to. It is far better to have something slow that works than something fast that doesn't (I can give you a wrong answer very fast!).
Be honest with yourself and others: if something isn't yet reliable, don't put your head in the sand. Work to fix it. Do not simply paint over rust.
Confidence that something is working comes from making unit tests that solve problems with known solutions. These known solutions will often come from another (slower or less concise) existing implementation.
If you're comparing things, use an odd number of replicates so that you will not have ties. Three is generally too small, so five replicates is a good starting place.
Test memory errors early on with valgrind: valgrind --tool=memcheck --leak-check=yes --leak-check=full ./a.out.
Before optimizing, profile with valgrind: valgrind --tool=callgrind ./a.out and view the results with kcachegrind. Do not leave around temporary files output by callgrind; by default, the most recent file will be loaded by kcachegrind, and you may accidentally load an incorrect file.
In general, keep your workspace clean. Files should either be checked in or deleted as unnecessary as soon as they've served their purpose.
During prototyping, code can be messier. But that code should be crystalized into organized, object-oriented code ASAP.
Git commits should have detailed comments. Do not commit unless you're sure everything is working!
Only check into git files that are human generated or that have been deliberately cached (e.g., a text file with a series, which a script may turn into a figure). Never use ~~git add *~~.
If a binary file has been commited and pushed, it can be removed from the git history using the following steps. First, use git reset <hash> where <hash> is the hash of the latest commit without the undesired file. Then use git add to add all the files you want to keep that were effected by the reset. In the commit message, make note that you are rewriting the commit history. Then use git push -f to force an update. Finally, make sure you let everyone with write access to the repository know that the history has been rewritten and that they should reclone the repo (this prevents them from accidentally merging back in the unwanted binary file).
Write your code to do exactly the job for which it is written.
Put your projects in ~/git-projects.

SCIENTIFIC WRITING & PRESENTATION

Write with the goal of making the idea as accessible as possible to anyone reading.
In presentations, use analogies wherever it helps to make the ideas accessible.
Generally, a couple means 2, a few means 3 to 5, several means >5.
Unless it would lead to multiple semicolons in one sentence, almost always use the following pattern for using the words "however" and "therefore":
I have a tree; therefore, I can make firewood.
I have a tree; however, I would like to have a forest.
After a colon, begin with a capital letter if multiple sentences follow from the idea used in the colon. Use a lowercase letter if only one sentence follows closely.
Living in the mountains is not trivial: There is no electricity. And each winter brings at least three feet of snow.
Keep the order of consistent: if a paper introduces methods in the order sort_selection, sort_insertion, sort_merge, then the subsections for these methods should be listed in the same order.
Use an Oxford comma:
~~A, B and C.~~ A, B, and C.
When writing a series, write the most complex, most lengthy items last.
~~caboodle and kit~~ kit and caboodle
~~my dog, some cat, which is probably a stray, and your dog~~ my dog, your dog, and some cat, which is probably a stray
When listing several values, a colon should follow an independent clause.
We have several flavors of ice cream: vanilla, chocolate, strawberry, coconut, huckleberry, and liver disaster.
In the body of text, write the word for numeric values <10 (e.g., "seven"); for larger values, you may write them numerically (e.g., 13). In a numeric or more scientific context, these values can all be written numerically. Try to use the one that you think is the most intuitive for people to understand and recreate what you're doing.
A pronoun refers to the last relevant object. In the passage below, "that algorithm" refers to merge sort.
"Insertion sort is a quadratic sorting algorithm, whereas merge sort is subquadratic. That algorithm is..."
Avoid pronouns where possible.
~~Insertion sort is a quadratic sorting algorithm, whereas merge sort is subquadratic. That algorithm is...~~ Insertion sort is a quadratic sorting algorithm, whereas merge sort is subquadratic. Merge sort is...
``I.e.'' means ``that is'' ``E.g. means ``for example''. Use i.e. when one idea naturally follows. Use e.g., when multiple ideas could follow, but you've chosen one particular example.
~~Prepositions are things you should try not to end sentences with.~~ You should try not to end sentences with prepositions.
If you make an assumption for the sake of simple explanation, but the idea generalizes so that the assumption is not necessary, write "w.l.o.g.":
"We will prove that all any two nonnegative values, x and y, must have a maximum value <x+y. W.l.o.g., let x≥y; thus max(x,y)=x≤x+y, because y≥0." This means that we will assume that x≥y, but that the proof would hold if we swapped the values into y'=x and x'=y so that y'>x'.
``Which'' is used to elaborate on an already chosen object. It is preceded by a comma: ``I like my truck, which is primer gray.'' ``That'' is used to narrow the discourse to specific objects. It is not preceded by a comma: ``Do you have any Christmas trees that are tall?''
Isotopes are copies of the same element with differing number of neutrons. Oxygen has three (stable) isotopes: ¹⁶O has eight neutrons, ¹⁷O has nine neutrons, and ¹⁸O has ten neutrons. Subisotopologues are all the copies of a single element in a chemical compound. Bovine insulin, C₂₅₄H₃₇₇N₆₅O₇₅S₆, has five subisotopologues: C₂₅₄, H₃₇₇, N₆₅, O₇₅, and S₆. Isotopologues are chemical compounds which differ only in their isotopic makeup. ¹H₂¹⁶O, ¹H₂¹⁷O, and ¹H₁²H₁¹⁶O are three of the eight isotopologues of water.

----------------------------------------------

PROGRAMMING STYLE GUIDE

SCIENTIFIC PRACTICE

SCIENTIFIC WRITING & PRESENTATION

Figure 1: Be amazing.