A command-line tool for crate registry backup/export https://shipyard.rs
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Jonathan Strong 2c83cd14c5 add timeouts to download request 1 year ago
doc document `publish` 1 year ago
script clean up code 1 year ago
src add timeouts to download request 1 year ago
.gitignore working first iteration for evaluation 1 year ago
Cargo.lock document `publish` 1 year ago
Cargo.toml document `publish` 1 year ago
LICENSE adds MIT license 2 years ago
README.md re-generate docs 1 year ago
config.toml.sample fix edge case in output path generation 2 years ago
justfile document `publish` 1 year ago
publish-config.toml.sample document `publish` 1 year ago

README.md

registry-backup

Command line utilities for backup, export, and migration of a Rust private crate registry.

Use cases:

  • Backup: retrieve a registry server's files for backup storage
  • Export: pull the files so you can host them at another registry server
  • Migration: publish downloaded .crate files to a new private registry, including modifying the Cargo.toml manifests of each published crate version to make it compatible with the destination registry

Tools

There are two binaries in the repo:

  • registry-backup: for downloading all .crate files hosted by a Cargo registry server
  • publish: for publishing the .crate files downloaded by registry-backup to a different registry

registry-backup

registry-backup is a tool to download all of the .crate files hosted by a Cargo registry server.

Example Usage

Specify the registry index either as a local path (--index-path)...

$ git clone https://github.com/rust-lang/crates.io-index.git
$ RUST_LOG=info registry-backup \
    --index-path crates.io-index \
    --output-path crates.io-crate-files \
    --requests-per-second 10

...or as an --index-url instead:

$ RUST_LOG=info registry-backup \
    --index-url ssh://git@ssh.shipyard.rs/shipyard-rs/crate-index.git \
    --output-path shipyard-rs-crate-files \
    --auth-token ${AUTH_TOKEN} # for private registry, need auth

Install

$ cargo install registry-backup --git https://git.shipyard.rs/jstrong/registry-backup.git

Runtime Options

$ ./target/release/registry-backup --help

registry-backup 0.5.0-beta.1
Jonathan Strong <jstrong@shipyard.rs>
Download all .crate files from a registry server

USAGE:
    registry-backup [OPTIONS]

OPTIONS:
        --index-url <URL>
            URL of the registry index we are downloading .crate files from. The program expects that
            it will be able to clone the index to a local temporary directory; the user must handle
            authentication if needed

        --index-path <PATH>
            instead of an index url, just point to a local path where the index is already cloned

    -a, --auth-token <TOKEN>
            If registry requires authorization (i.e. "auth-required" key is set to `true` in the
            `config.json` file), the token to include using the Authorization HTTP header

    -o, --output-path <PATH>
            Directory where downloaded .crate files will be saved to
            
            [default: output]

        --overwrite-existing
            Download files when if .crate file already exists in output dir for a given crate
            version, and overwrite the existing file with the new one. Default behavior is to skip
            downloading if .crate file already exists

        --output-format <FORMAT>
            What format to use for the output filenames. Works the same as Cargo's registry syntax
            for the "dl" key in the `config.json` file in a reigstry index. See [Cargo
            docs](https://doc.rust-lang.org/cargo/reference/registries.html#index-format) for
            additional details. Not specifying this field is equivalent to specifying
            "{crate}/{version}/download", the default.
            
            The resulting path specified by the format should be relative; it will be joined with
            the --output-path. (i.e. it should not start with "/".)

    -U, --user-agent <USER_AGENT>
            Value of user-agent HTTP header
            
            [default: registry-backup/v0.5.0-beta.1]

    -R, --requests-per-second <INT>
            Requests to registry server will not exceed this rate
            
            [default: 100]

    -M, --max-concurrent-requests <INT>
            Independent of the requests per second rate limit, no more than
            `max_concurrent_requests` will be in flight at any given moment
            
            [default: 50]

    -c, --config-file <PATH>
            Specify configuration values using the provided TOML file, instead of via command line
            flags. The values in the config file will override any values passed as command line
            flags. See config.toml.sample for syntax of the config file

        --filter-crates <REGEX>
            Only crates with names that match --filter-crate regex will be downloaded

        --dry-run
            Don't actually download the .crate files, just list files which would be downloaded.
            Note: --requests-per-second and --max-concurrent-requests are still enforced even in
            --dry-mode!

    -h, --help
            Print help information

    -V, --version
            Print version information

Configuration File

A toml configuration file may be used instead of command line flags. A sample file (config.toml.sample) is included. From the example file:

dry-run = false
filter-crates = "^."

[registry]
index-url = "ssh://git@ssh.shipyard.rs/shipyard-rs-public/crate-index.git"
# alternatively, specify a local dir
# index-path = "/path/to/cloned/index"
auth-token = "xxx"

[http]
user-agent = "registry-backup/v0.1.0"
requests-per-second = 100
max-concurrent-requests = 50

[output]
path = "output"
overwrite-existing = false
format = "{crate}/{version}/download"

Build From Source

$ git clone https://git.shipyard.rs/jstrong/registry-backup.git
$ cd registry-backup
$ just release-build # alternatively, cargo build --bin registry-backup --release
# ./target/release/registry-backup --help
# cp target/release/registry-backup ~/.cargo/bin/

publish

publish is a tool to publish all of the crate versions from a source registry to second destination registry.

Usage Overview

publish is different from registry-backup in that in requires several steps, including the use of a Python script.

In general, migrating all of the crate versions to another registry is relatively complex, compared to just downloading the .crate files. Migrating to a new registry involves the following (big picture) steps:

  1. extracting the order that crate versions were published to the source registry from the git history of the crate index repository
  2. extracting the source files, including Cargo.toml manifests, from the downloaded .crate files
  3. modifying the Cargo.toml manifests for each crate version so the crate will be compatible with the destination registry
  4. publishing the crate versions, in the right order and using the modified Cargo.toml manifests, to the destination registry

Background Context: cargo publish, .crate Files, and Cargo.toml.orig

When you run the cargo publish command to publish a crate version to a registry server, it generates an alternate Cargo.toml manifest based on the contents of the original Cargo.toml in combination with the configured settings with which the command was invoked.

For example, if you had configured a private registry in ~/.cargo/config.toml:

# ~/.cargo/config.toml

[registries.my-private-registry]
index = "ssh://git@ssh.shipyard.rs/my-private-registry/crate-index.git"

And then added a dependency from that registry in a Cargo.toml for a crate:

# Cargo.toml
[package]
name = "foo"
publish = ["my-private-registry"]

[dependencies]
bar = { version = "1.0", registry = "my-private-registry" }

...cargo publish would convert the dependency into one with a hard-coded registry-index field that points to the specific index URL that was configured at the time it was invoked:

# cargo publish-generated Cargo.toml
[package]
name = "foo"
publish = ["my-private-registry"]

[dependencies]
bar = { version = "1.0", registry-index = "ssh://git@ssh.shipyard.rs/my-private-registry/crate-index.git" }

cargo publish includes the original Cargo.toml file at the path Cargo.toml.orig in the .crate file (actually a .tar.gz archive).

Since the registry-index entries generated by cargo publish point to the specific URL of the source registry, just publishing the .crate file as is to the destination registry will not suffice. To resolve this problem, publish uses the Cargo.toml.orig file contained in the .crate file, modifies the dependency entries according to the settings of the destination registry, and publishes them to the destination registry using cargo publish (i.e. discard the cargo publish-generated Cargo.toml, relying instead on the modified Cargo.toml.orig in combination with runtime settings provided as env vars to cargo).

The Global Dependency Graph of a Registry and publish-log.csv

Once we have solved how to take a .crate file from the source registry and publish it to the destination registry, there is still the issue of which order the crate versions should be published. If crate a version 1.2.3 depends on crate b version 2.3.4, then crate b version 2.3.4 needs to have already been published to the registry at the time crate a version 1.2.3 is published, otherwise it will depend on a crate that does not (yet) exist (in the destination registry, at least). If you try to publish crates without respecting this global dependency graph using cargo publish, it will exit with an error, and it's not a good idea otherwise, either.

Building a dependency graph for the entire registry is certainly possible, theoretically. However, in practice it is tedious to do, mainly because it requires mirroring cargo's dependency resolution process, just to be able to identify the full set of dependencies that would end up in the Cargo.lock file. That, in turn, requires using cargo (i.e. via the cargo metadata command), which is slow for large registries (only a single cargo metadata command can run at a time due to the use of lock files), and quite involved in terms of parsing the programmatically-generated outputs (wow it is amazing how many different forms crate metadata is represented in various cargo/registry contexts!).

To shortcut these complexities, publish relies on the use of a Python script to extract the order in which crate versions were published to a registry using the git history of the crate index repository.

The tool (script/get-publish-history.py) was based on an open source script that utilizes the GitPython library to traverse the commit history of a repo. In a few minutes work, we were able to modify the script to extract the publish order of all the crate versions appearing in the crate index repository. And, as much as we love Rust (and do not share the same passion for Python), porting the code to Rust using the git2 crate appeared like quite a tedious project itself.

To generate a .csv file with the order in which crates were published, first clone the crate index repository, e.g.:

$ git clone ssh://git@ssh.shipyard.rs/my-private-registry/crate-index.git

Then run the script (it has two dependencies GitPython and pandas, both of which can be pip installed or otherwise acquired using whatever terrible Python package manager you want):

$ python script/get-publish-history.py path/to/crate-index > publish-log.csv

You will need a publish-log.csv generated from the source registry to use publish.

(You might be wondering why we are relying on git history to reconstruct the publishing order. The primary reason is the crate index metadata (or any other metadata universally available from a crate registry) does not include any information about when each crate version was published.)

Detailed Usage Example

1) Clone the source registry crate index repository:
$ mkdir source-registry
$ git clone <source registry crate index repo url> source-registry/crate-index
2) Use registry-backup to download all the .crate files from the source registry:
$ cargo install registry-backup --git https://git.shipyard.rs/jstrong/registry-backup.git # or build from source
$ RUST_LOG=info registry-backup \
    --index-path source-registry/crate-index \
    --output-path source-registry/crate-files
3) Use the get-publish-history.py script to extract the crate version publish history:
$ . ../virtualenvs/my-env/activate # or whatever you use
$ pip install GitPython
$ pip install pandas
$ python3 script/get-publish-history.py source-registry/crate-index > source-registry/publish-log.csv
4) Create a configuration file:
# publish-config.toml

# source registry config
[src]
index-dir = "source-registry/crate-index" # <- see step 1
crate-files-dir = "source-registry/crate-files" # <- see step 2
publish-history-csv = "source-registry/publish-log.csv" # <- see step 3
registry-name = "my-old-registry" # <- whatever label the source registry was given in Cargo.toml files
index-url = "https://github.com/my-org/crate-index.git" # <- index url, i.e. same as one provided in ~/.cargo/config.toml

# destination registry config
[dst]
index-url = "ssh://git@ssh.shipyard.rs/my-new-registry/crate-index.git"
registry-name = "my-new-registry" # can be same as old name or a different name
auth-token = "xxx" # auth token for publishing to the destination registry
5) Build publish:
$ cargo bulid --bin publish --features publish --release
6) Validate your config file (optional):
$ ./target/release/publish --config publish-config.toml --validate
7) Publish to the destination registry using publish:
$ RUST_LOG=info ./target/release/publish --config publish-config.toml

Expected Runtime

As an example, using publish, it took us about 50 minutes to migrate a registry with 77 crates and 937 versions. Results may vary based on the machine used to run publish as well as the performance of the destination registry server.

Building publish (Full Example)

$ git clone https://git.shipyard.rs/jstrong/registry-backup.git
$ cd registry-backup
$ just release-build-publish # alternately, cargo build --bin publish --features publish --release

Note: --release really is quite a bit faster, at least for larger registries.

Configuration File

Annotated example configuration file:

# optional field for providing a regex-based filter
# to limit which crates are published to the destination
# registry. only crates with names matching the regex will
# be published.
#
filter-crates = "^."

# do everything except actually publish to the destination registry
dry-run = false

# source registry config
[src]
index-dir = "path/to/crate-index/repo" # git clone of crate index repository
crate-files-dir = "path/to/crate/files" # i.e. files downloaded by registry-backup tool
publish-history-csv = "path/to/publish-log.csv" # see docs above
registry-name = "my-old-registry" # whatever label the source registry was given in Cargo.toml files
index-url = "https://github.com/my-org/crate-index.git" # index url, i.e. same as one provided in ~/.cargo/config.toml

# destination registry config
[dst]
index-url = "ssh://git@ssh.shipyard.rs/my-new-registry/crate-index.git" # index url of new registry
registry-name = "my-new-registry" # can be same as old name or a different name
auth-token = "xxx" # auth token for publishing to the destination registry

Runtime Options

$ ./target/release/publish --help

registry-backup 0.5.0-beta.1
Jonathan Strong <jstrong@shipyard.rs>

USAGE:
    publish [OPTIONS] --config-file <PATH>

OPTIONS:
    -c, --config-file <PATH>       Config file with source directories and destination registry info
        --dry-run                  Perform all the work of generating `cargo publish` payloads, but
                                   don't send them to the destination registry server
        --validate                 Load config file, validate the settings, and display the final
                                   loaded content to stdout, then exit
        --filter-crates <REGEX>    Use to limit which crates from the source registry are published
                                   to the destination registry. Expects a regular expression which
                                   will be matched against the names of crates. Only crates with
                                   names that match the regex will be published. This field may also
                                   be specified at the top level of the config file
    -h, --help                     Print help information
    -V, --version                  Print version information

Configuration File

A toml configuration file may be used instead of command line flags. A sample file (config.toml.sample) is included. From the example file:

dry-run = false
filter-crates = "^."

[registry]
index-url = "ssh://git@ssh.shipyard.rs/shipyard-rs-public/crate-index.git"
# alternatively, specify a local dir
# index-path = "/path/to/cloned/index"
auth-token = "xxx"

[http]
user-agent = "registry-backup/v0.1.0"
requests-per-second = 100
max-concurrent-requests = 50

[output]
path = "output"
overwrite-existing = false
format = "{crate}/{version}/download"

Running Tests

$ just test # alternatively, cargo test

Justfile

The repository includes a justfile with functionality for building, testing, etc.

Included commands:

$ just --list

Available recipes:
    cargo +args=''                 # cargo wrapper; executes a cargo command using the settings in justfile (RUSTFLAGS, etc.)
    check +args=''                 # cargo check wrapper
    debug-build +args=''           # cargo build wrapper - builds registry-backup in debug mode
    debug-build-publish +args=''   # cargo build wrapper - builds publish tool in debug mode
    generate-readme                # generate updated README.md
    get-crate-version
    install                        # cargo install registry-backup via git dep
    pre-release                    # check, run tests, check non-error output for clippy, run rustfmt
    release                        # release version (regenerate docs, git tag v0.0.0)
    release-build +args=''         # cargo build --release wrapper - builds registry-backup in release mode
    release-build-publish +args='' # cargo build --release wrapper - builds publish tool in release mode
    release-prep                   # get everything all ready for release
    show-build-env                 # diagnostic command for viewing value of build variables at runtime
    test +args=''                  # cargo test wrapper
    update-readme                  # re-generate README.md and overwrite existing file with output
    update-readme-and-commit       # re-generate, overwrite, stage, and commit
    update-readme-and-stage        # re-generate, overwrite, and stage changes
    verify-clean-git               # verify no uncommitted changes

The commands that mirror cargo commands (e.g. just test) are included for the purpose of convenience, so that various options (e.g. RUSTFLAGS='-C target-cpu=native) can be included without typing them out each time.

Generating README.md

This file is generated using a template (doc/README.tera.md) rendered using updated outputs of the CLI menu, config sample, and other values.

This version of README.md was generated at Fri, 10 Nov 2023 01:30:48 +0000 based on git commit 4c2a9e5f.

To (re-)generate the README.md file, use the justfile command:

$ just generate-readme