Monolithic Repository

Warning: Do not confuse a monolithic repository with a monolithic application. They're absolutely different things.

Unless you're in one of the FAANG or similarly-sized companies, or just don't like the idea, there are probably few technical reasons, if any, against using the monorepo approach to organising a project that has multiple moving parts. And Go projects can benefit quite well from it.

Moreover, teams are likely to benefit from this technique as it has several strong supporting arguments like:

  • simplified workflows for CI/CD and testing
  • reduced burden with managing internal dependencies
  • significantly less chores with permissions/access
  • unified tooling
  • easier to package/ship and so forth.

Of course, there are some limitations and downsides, but they are not in effect until after at least a couple of hundreds of thousands SLOCs have been committed in a couple of millions of commits.

A valid concern may be access control, but that's only partially true. If the security policies at a company do make sense and implemented and respected well, there is little to worry about. Experience of companies for which security is the biggest asset, suggests that the attitude and mindset are what matter. When everyone is accountable for what they're doing, there is no need in introducing artificial barriers.

The decision on whether to use a monorepo or not is upon you, the reader. The goal of this section is not to make you use this approach, although it makes sense rather often than not. This work covers various aspects of software development, is based on experience, and provides practical advice and guidelines. Several services will be introduced in the third module (it's quite far from this point), and the materials do use a monolithic repository.

With that in mind, further we talk about how to structure a monorepo in a way that makes the experience of using it better. As there are several ways to approach organising a monorepo, and to understand what makes a good one, it's worth getting familiar with some of them. Finally, one reasonable approach will be introduced, and we'll learn more about it.

There is no official classification for monolithic repositories. Yet some of common traits can be derived after investigating and working with different projects. The experience shows that most of the time a monorepo is of one of the following types:

  • naive
  • based on services
  • focused on entities.

Let's have a quick look at these and consider their pros and cons. Then we introduce an alternative, the structured approach.

Naive Monorepo

A naive monorepo is exactly what its name says. It's what usually happens when a team decides to use a monorepo, without taking the time to learn and putting a good thought into how to make it work:

  • a new repository is created
  • then each of the existing services is copied over to the new repo, as is
  • done
  • alternatively, everything can be moved into one of the existing repositories, but this doesn't do much of a difference.

As a result, the team got just a rather large directory with lots of semi-independent parts, lots of duplicated scripts, files, words, whatnot. One of the immediate benefits seems to be that now something from service A can be imported by service B, but this will quickly become troublesome, especially if the original repositories did not follow the suggestions (such as) listed above, which is highly likely.

This approach has many downsides. Besides being messy and containing many duplications, it is the shortest path to creating dependency cycles. This can often lead to either a poor hierarchy and structure that is created just for the sake of breaking a cycle, or even to concluding that the monorepo was a mistake, and should be avoided. Neither of the those leads to productive and efficient collaboration.

It's hard to justify following this way. Instead, plan the migration, and then execute the plan iteratively. You need to come up with a good structure that suits well and will serve your needs for years. Don't rush, think, plan, try, and only then execute. Not the other way around.

After having realised that this way is not better than it was before the move, the team may decide to organise the repository somehow. One of obvious ways is to use a service as a boundary and splitting criterion. This is how a naive monorepo may evolve to a service-based one.

Service-Based Monorepo

The service-based approach is a slight improvement of naive. The main difference is that some of the duplicated components among services are unified, e.g. CI and building routines, but the codebase continues using services as boundaries for packages. Put simply, each folder at the root level contains a service along with everything that's in the service's scope - data types, business and transport logic, etc. When a service needs something that's already implemented somewhere, it just imports that. New functionality is being developed within the boundaries of a service.

While it might work for some time, such a repository still has exactly the same major downside as naive - it's too easy to end up with a dependency cycle, more so when you try to re-use some code with business logic. Also, there isn't much of an order, since data, logic and utility code are spread across the entire codebase.

A few other serious downsides enter the stage at this point, caused by importing different parts of various services:

  • increased sizes of binaries
  • increased compilation times
  • not always clear what to do with tests.

As the project evolves, it might seem natural to think of grouping code based on entities it belongs to. Here is how a monorepo may transform into entity-focused.

Entity-Focused Monorepo

The entity-focused technique is organising code around a particular entity. Packages are often created for different units of the business domain of a project, such as user, photo, library and so forth. Developers add logic into appropriate packages, and then use it in services.

This approach is a bit better than the previous two. It allows for working on services' part separately from the business logic, if implemented correctly.

Still, there are two potential problems:

  • different levels of representation and responsibility could be mixed together, such as data types, methods for accessing storage, business logic and transport details
  • a major risk in creating cycles, specifically at three levels:
    • entity - when several entities depend on each other
    • business logic - when a business process depends on the logic of several entities
    • transport and representation - when representations of several entities/processes depend on each other.

The second issue comes from a fact that it's rare for entities, business logic and their representations to be independent from each other, i.e.:

  • entities often aggregate or are parts of other entities
  • business logic for one process depends or is included into another process
  • representation for one area of the domain requires other parts.

Is there a solution to address these and the problems described above? How to organise a repository in a way which reduces the dependency cycle risk to a minimum (or better to zero)? How to organise a repository in a way when it's easy to re-use entities and logic in different services? How to make developers happier, and services better organised?

The first problem is to be addressed by making better package and architectural design decisions. The former is the subject for Unit 2 in this module, the latter is the topic for Module 3. Some suggestions will be given shortly, in the very next section.

There is no ultimate solution, of course. But there is an option which, if implemented carefully and everyone respects the process, can help to achieve better efficiency and maintainability.

The Structured Monorepo

The structured approach is based on grouping code by responsibilities, and levels where objects play their roles. In other words, things are put together by what they do and where they belong:

  • data layer (models)
  • database layer (repositories)
  • business processes (controllers)
  • transport representations (handlers) and so on.

By doing this way, we avoid problems described above, and get some additional benefits:

  • at the model level, any model can safely relate to another
  • at the business process level, any process is free to do the following, without the risk of introducing a cycle:
    • use any model or combine multiple
    • include any other business process, or be included in other business process, as a step
    • interact with database representations for the models it works with
  • similarly, at the transport level, any service can use or combine various business processes, and more.

The insightful reader might have already noticed that this has many similarities with properly designing and structuring a good single service. Moreover, if a single service has followed this way, adding another service wouldn't even require to do anything, since the project, hence the repo, is already prepared to accommodate as many applications as needed.

The Layout of a Structured Monorepo

Everything that has been discussed about different layouts so far comes together and applies to a monolithic repository. Taken into account, implemented and followed carefully, the practices establish the foundation for a good monorepo.

This is what a project may look like at this point:

  • the documentation is provided and up to date
  • the list of elements at any level is of a reasonable size
  • all maintenance scripts and other non-code files are organised
  • the entry points to services are located in the cmd directory
  • the binaries automatically go and picked up from the bin directory
  • code that implements the project's data and logic is grouped by responsibilities and roles.

A few questions inevitably arise while working on a reasonably large project within one or a couple of teams:

  • Where do we put code that is meant to be used by many services?
  • Where should utility code go?
  • How to gradually and safely introduce something new or a breaking change?

There is no simple and direct answer. It's also where good planning and thinking should be done, as well as some exceptions. With that in mind, let's consider the following suggestions that help to keep things at the right places:

  • use appropriate position at the file tree to reflect the importance of a package
  • organise utility code as own small standard library
  • keep breaking changes in sandbox.

Use Appropriate Position at the File Tree

Place packages appropriately in the file tree of a monolithic repository to reflect the importance and nature of a package.

What does it mean and what properties can we use to determine a right placement of a package? There's no hard set of rules. The following heuristics can help in understanding where a package should be placed:

  • An approximate position of a package in the dependency graph. A package that is imported by many other different packages should be located closer to the root of the hierarchy. A rarely imported package is most likely an implementation detail, and should go deeper in the tree.
  • The frequency of use. The position of a package is in direct proportion to of how frequently the package is used.
  • The importance of a package. Something that is unique, and provides and implements an important piece of functionality should be placed closer to the top.
  • The level of abstraction and role. The higher the abstraction, the higher the level at which a package should be placed.

For example, a package with code for converting internal errors from their internal representation to the external, and which is used by the most of the packages that implement application functionality, should be placed at the top level of the structure.

Another example is the packages that are the definitions of the business logic - packages with models, controllers, and the database layer.

On the other hand, a set of middleware handlers for an API should be located deeper in the tree, as it's used only in the context of API, and only by an instance of API. Similarly, routines for data validation, migrations, etc are better to be placed at the second or third level of the tree.

More on this to come in later units covering package design and architecture of a service.

Organise Utility Code as Own Standard Library

Organise, treat, and maintain all utility code as a small private extension to the standard library. If possible, consider releasing it open source.

This recommendation sounds controversial, and needs further explanations. To understand it better, we first need to clarify what differentiates utility code from other, and then go deeper into how to apply it in real life.

Utility Code

Utility code is code that implements technical details of a process, and is independent from the business logic of a project. The independence from the business logic is the crucial part that distinguishes utility code from any other.

The following traits are common for utility code:

  • it's independent from any other code besides the standard library and/or itself
  • most of the methods operate on built-in types, or on the types from the standard library
  • provides common and often used routines
  • it can be extracted as a separate library
  • it can be open sourced
  • provides methods that do not exist in the standard library.

Here are some examples of code that can be part of a private extension to the standard library:

  • managing the lifecycle of a process, including proper signal handling, graceful and forced shutdown
  • archiving/compressing directories
  • extended configuration and methods for the http.Client, such as custom transports, file downloading, etc
  • handling multipart uploads
  • advanced strings manipulation
  • some standard and generic data structures and algorithms, such as queues, graphs, lists, and so forth
  • concurrency patterns
  • file tree traversing utilities and so forth.

Having clarified what utility code is and what is not, we can discuss what to do with it.

Details and Discussion

To begin with, it's worth reminding one of the most often repeated mantras in the Go community:

Prefer duplication over the wrong abstraction.

– Sandi Metz, and many gophers.

This is true, and this section should not be considered as something opposite. The author is one of those who respects and follows this advice.

Nonetheless, many rules do have exceptions. It's more about finding what works. What is good for a small project/service, can be a poor choice when applied to multiple services. What's good at a small scale, can significantly complicate things at a larger, and vice-versa.

The definition of utility code given above implicitly prohibits building additional abstractions on top of the standard library code. It can be only considered as an extension.

Duplication works well for a small service, when a small team maintains a couple of medium-sized services. In other words, duplication suits well when it's used moderately and rarely, and in isolation.

Things are different in a project that is supported by one large or several teams, when it's a monorepo, when the number of services grows. The duplication approach is not scalable. Employing it at a larger scale leads to unnecessary duplications, mess in codebase, lack of guarantees, and makes the project prone to errors. The quality of testing decreases.

One of the biggest strengths of Go as a programming language, is that there is mostly one way for accomplishing a task. It becomes almost a requirement when working with many services. There should be only one way to download a file, zip or unzip a directory, parse an authentication header and so forth. Failing to acknowledge this can turn out to be a major source of obscure and nasty bugs at later stages of a project.

So when a project is a monorepo, and has more than one service, the following is true about utility code:

  • it will inevitably occur
  • it should be reliable and trustworthy
  • it should be tested
  • it should be maintained
  • it should be standardised.

One of the ways to provide these guarantees is to have such code in one place, and being conscious about and responsible for it.

In Practice

The guideline can be simply employed like this:

  • at the root level, have the lib directory as home for utility code
  • put packages inside lib
  • for packages those names clash with ones from the standard library, add a prefix, for example, x, xstrings
  • alternatively, keep names for packages as is, but have an agreement on how a custom package should be named when imported along with a package from the standard library with the same name. Do not use custom names for the standard packages, ever.

This approach also works especially well when implementations of some routines are different between the platforms your project supports.

As a result, the file tree may look like this:

├── bin
├── cmd
└── lib
    ├── process
    │   ├── process.go
    │   ├── process_darwin.go
    │   ├── process_linux.go
    │   ├── process_posix_test.go
    │   ├── process_windows.go
    │   └── process_windows_test.go
    ├── files
    │   ├── files_posix.go
    │   ├── files_test.go
    │   └── files_windows.go
    ├── http
    │   ├── http.go
    │   └── http_test.go
    ├── os
    │   ├── os.go
    │   ├── os_posix_test.go
    │   └── os_windows_test.go
    └── strings
        ├── strings.go
        └── strings_test.go

When applied and followed carefully, this advice helps to consolidate and maintain low level code, providing one way for accomplishing a particular task, giving more guarantees about the quality and correctness of the implementation, and reducing the maintenance cost.

Have Sandbox for Breaking Changes

Another important aspect that is different when working with a monolithic repository, is introduction of breaking changes, or entirely new code that is not proved to be stable. Within a monorepo, a piece of code is relied upon potentially in hundreds of places, so it's better to test a change in isolation, and only then use elsewhere. How do we do about it in a monorepo?

In a monorepo, have a special place for adding potentially unstable code. An experimental or unstable directory at the root level would be a good choice. Then, inside that directory, follow similar structure as if it were the root level.

Details and Discussion

In a classic scenario, dependency management tools usually solve this problem. A dependency that is used in many places, is updated, and the change is then gradually introduced. Go modules and/or vendoring are handy tools here, and this is one of the main reasons for their existence.

However, these tools are not available anymore for addressing this problem as all code is in a single repository. At least, not directly. It's impossible to vendor a part of a repository, or use a separate import path for a package that is part of a large module.

A solution to this problem has existed for many years, and is in use by various kinds of software, from private projects to the Linux kernel, and many established Linux distributions and software. A common name for the solution is "experimental".

What do we mean by "experimental"?

Of course, there are different stages in release processes, such as development, alpha and beta versions, release candidates and so on. Somewhere between development and beta there is usually an experimental branch, or a stage. After reaching some level of stability, the project transitions to next stage, usually testing, or beta. This is followed by many projects, yet it's not exactly what current advice is about.

This guideline is about having experimental code included in the stable version, and made available for conscious use. If the reader have ever configured and compiled a Linux kernel, they would recall the EXPERIMENTAL label for drivers that are included in a stable release, but are still in active development, and included for use as is, without any guarantees.

Similarly, even the most stable version of Debian and many other Linux distributions have an experimental section in their repositories, which contain early releases of software. Such sources are turned off by default, but the user is free to use it.

So here's what to do in a monorepo. When introducing a breaking change or something entirely new, and such code is not guaranteed to work or work correctly (hence it's not for general use), consider this:

  • add a package for new code, if it doesn't exist yet, to the experimental or unstable directory
  • add new code to the package
  • use it, test it, change it, prove it works
  • once confirmed working:
    • for new code, move it to the stable tree
    • for changes to existing code, depending on the situation
      • move changes to the stable tree and make necessary adjustments
      • alternatively, use type aliases to redirect uses from old code to new
        • test and prove it works
        • move changes to the stable tree.

An important note about the process is that the experimental/unstable tree must be kept in a good order at all times. As with utility code, without discipline, it's too easy to let these places become junk yards. Keep them clear from clutter by making sure everyone on the team is following conventions and the boy scout rule, move working code to the stable tree, eliminate unused and incorrect code.