本地部署清华anaconda开源镜像源

Scroll Down

本地部署清华anaconda开源镜像源

anaconda是一个非常方便且好用的python虚拟环境管理工具,但是使用anaconda安装python模块对国内用户来说并不是很友好,幸亏国内有很多大厂为我们提供了很多开源镜像,作为一名科研dog,使用最多的还是清华的开源镜像,但清华的镜像源有时候并不是很稳定,尤其对那些网速并不好的人来说,经常会在安装过程中遇到网络超时的报错,这就很让人头疼。

幸运的是,清华开源镜像官方提供了一份开源代码用于将anaconda相关的仓库下载到本地,这使得个人在本地部署镜像仓库成为可能,极大地方便了广大科研dog,such as me!!! 嘿嘿

本文致力于讲解如何在同步官方的anaconda镜像仓库到本地,并在本地部署供个人使用的镜像源,以加快下载速度。

首先,感谢清华官方开源的anaconda镜像下载脚本:开源脚本

1. 从官方镜像仓库下载镜像文件

清华开源的下载脚本中默认下载所有仓库和安装包,但其实我们每个研究方向和常用的python包很固定,而且下载全部安装包将占用非常大的存储空间,因此我们可以将脚本中的某些我们用不到的仓库列表注释掉。

以下是我修改后的代码,其中我只下载了 linux64win64 位的安装包,除此之外,如果本地没有anaconda环境的条件下,我们通常都是手动从网上下载anacondaminiconda安装包进行安装,然后才能正常使用conda虚拟环境管理工具。因此如果下载archiveminiconda路径的conda安装包将不是必要的,因为这些包最终只会被用来在 已安装的conda环境中通过命令行升级conda时使用

例如:

conda update conda
conda update anaconda

在执行如上命令时才会用到。

另外,根据本人测试(其实是因为没想到这俩包会有这么大 o(╥﹏╥)o ), archiveminiconda 两个仓库中的安装包分别为 135.9G 和 21G, conda-forge 为900+G (截止到2021.05.18)。 因此如果硬盘容量不够大的同学可以只下载 freemain 以及 pytorch仓库。

修改后的代码如下:

#!/usr/bin/python3
# -*- coding: utf-8 -*-
# @File : anaconda.py
# @Time : 2021/5/16 19:52
# @Author : www.mlzhilu.com
# @Version : 1.0
# @Descriptions : downloading python packages from anaconda repositories

# here put the import lib

import hashlib
import json
import logging
import os
import random
import shutil
import subprocess as sp
import tempfile
from email.utils import parsedate_to_datetime
from pathlib import Path

from pyquery import PyQuery as pq

import requests

import hashlib
import json
import logging
import os
import random
import shutil
import subprocess as sp
import tempfile
from email.utils import parsedate_to_datetime
from pathlib import Path

import requests
from pyquery import PyQuery as pq

DEFAULT_CONDA_REPO_BASE = "https://repo.continuum.io"
DEFAULT_CONDA_CLOUD_BASE = "https://conda.anaconda.org"

CONDA_REPO_BASE_URL = os.getenv("CONDA_REPO_URL", "https://repo.continuum.io")
CONDA_CLOUD_BASE_URL = os.getenv("CONDA_COULD_URL", "https://conda.anaconda.org")

WORKING_DIR = os.getenv("TUNASYNC_WORKING_DIR")

CONDA_REPOS = ("main", "free")
CONDA_ARCHES = (
    "noarch", "linux-64", "win-64"
)

# CONDA_ARCHES = (
#     "noarch", "linux-64", "linux-32", "linux-armv6l", "linux-armv7l",
#     "linux-ppc64le", "osx-64", "osx-32", "osx-arm64", "win-64", "win-32"
# )

CONDA_CLOUD_REPOS = (
    #"conda-forge/linux-64", "conda-forge/win-64", "conda-forge/noarch",
    # "conda-forge/osx-64", "conda-forge/osx-arm64",
    "msys2/linux-64", "msys2/win-64", "msys2/noarch",
    # "rapidsai/linux-64", "rapidsai/noarch",
    # "bioconda/linux-64", "bioconda/win-64", "bioconda/noarch",
    # "bioconda/osx-64",
    "menpo/linux-64", "menpo/win-64", "menpo/noarch",
    # "menpo/osx-64", "menpo/win-32",
    "pytorch/linux-64",  "pytorch/win-64", "pytorch/noarch",
    # "pytorch/osx-64", "pytorch/win-32",
    # fastai使用现代最佳实践简化了快速而准确的神经网络的训练
    # "fastai/linux-64", "fastai/win-64", "fastai/noarch",
    # "fastai/osx-64",
    # SimpleITK是专门处理医学影像的软件,是ITK的简化接口,使用起来更加方便,有多种语言接口
    # "simpleitk/linux-64", "simpleitk/linux-32", "simpleitk/osx-64", "simpleitk/win-64", "simpleitk/win-32", "simpleitk/noarch",
    # "caffe2/linux-64", "caffe2/osx-64", "caffe2/win-64", "caffe2/noarch",
    # "plotly/linux-64", "plotly/linux-32", "plotly/osx-64", "plotly/win-64", "plotly/win-32", "plotly/noarch",
    # "intel/linux-64", "intel/linux-32", "intel/osx-64", "intel/win-64", "intel/win-32", "intel/noarch",
    # "auto/linux-64", "auto/linux-32", "auto/osx-64", "auto/win-64", "auto/win-32", "auto/noarch",
    # "ursky/linux-64", "ursky/osx-64", "ursky/noarch",
    # "matsci/linux-64", "matsci/osx-64", "matsci/win-64", "matsci/noarch",
    # 群体稳定性指标PSI(population stability index )用于衡量测试样本和建模样本分数间数据分布差异性,是模型稳定性的常见指标
    # "psi4/linux-64", "psi4/osx-64", "psi4/win-64", "psi4/noarch",
    "Paddle/linux-64", "Paddle/win-64", "Paddle/noarch",
    # "Paddle/linux-32", "Paddle/osx-64", "Paddle/win-32",
    # DeepModeling最终将是机器学习、物理建模与前沿计算模式极致结合的一套方法论和工具的沉淀
    # "deepmodeling/linux-64", "deepmodeling/noarch",
    # "numba/linux-64", "numba/linux-32", "numba/osx-64", "numba/win-64", "numba/win-32", "numba/noarch",
    # "numba/label/dev/win-64", "numba/label/dev/noarch",
    # "pyviz/linux-64", "pyviz/linux-32", "pyviz/win-64", "pyviz/win-32", "pyviz/osx-64", "pyviz/noarch",
    # DGL,一款面向图神经网络以及图机器学习的全新框架。
    # "dglteam/linux-64", "dglteam/win-64", "dglteam/osx-64", "dglteam/noarch",
    # RDKit 是开源的化学信息python软件包,功能非常强大
    # "rdkit/linux-64", "rdkit/win-64", "rdkit/osx-64", "rdkit/noarch",
    # "mordred-descriptor/linux-64", "mordred-descriptor/win-64", "mordred-descriptor/win-32", "mordred-descriptor/osx-64", "mordred-descriptor/noarch",
    # "ohmeta/linux-64", "ohmeta/osx-64", "ohmeta/noarch",
    # QIIME 2 是一款强大、可扩展和去中心化的微生物组分析包
    # "qiime2/linux-64", "qiime2/osx-64", "qiime2/noarch",
    # "biobakery/linux-64", "biobakery/osx-64", "biobakery/noarch",
    # "c4aarch64/linux-aarch64", "c4aarch64/noarch",
    # "pytorch3d/linux-64", "pytorch3d/noarch",
    # "idaholab/linux-64", "idaholab/noarch",
)

EXCLUDED_PACKAGES = (
    "pytorch-nightly", "pytorch-nightly-cpu", "ignite-nightly",
)

# connect and read timeout value
TIMEOUT_OPTION = (7, 10)

logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s] [%(levelname)s] %(message)s",
)

def sizeof_fmt(num, suffix='iB'):
    for unit in ['','K','M','G','T','P','E','Z']:
        if abs(num) < 1024.0:
            return "%3.2f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.2f%s%s" % (num, 'Y', suffix)

def md5_check(file: Path, md5: str = None):
    m = hashlib.md5()
    with file.open('rb') as f:
        while True:
            buf = f.read(1*1024*1024)
            if not buf:
                break
            m.update(buf)
    return m.hexdigest() == md5


def curl_download(remote_url: str, dst_file: Path, md5: str = None):
    sp.check_call([
        "curl", "-o", str(dst_file),
        "-sL", "--remote-time", "--show-error",
        "--fail", "--retry", "10", "--speed-time", "15",
        "--speed-limit", "5000", remote_url,
    ])

    if md5 and (not md5_check(dst_file, md5)):
        return "MD5 mismatch"


def sync_repo(repo_url: str, local_dir: Path, tmpdir: Path, delete: bool):
    logging.info("Start syncing {}".format(repo_url))
    local_dir.mkdir(parents=True, exist_ok=True)

    repodata_url = repo_url + '/repodata.json'
    bz2_repodata_url = repo_url + '/repodata.json.bz2'
    # https://docs.conda.io/projects/conda-build/en/latest/release-notes.html
    # "current_repodata.json" - like repodata.json, but only has the newest version of each file
    current_repodata_url = repo_url + '/current_repodata.json'

    tmp_repodata = tmpdir / "repodata.json"
    tmp_bz2_repodata = tmpdir / "repodata.json.bz2"
    tmp_current_repodata = tmpdir / 'current_repodata.json'

    curl_download(repodata_url, tmp_repodata)
    curl_download(bz2_repodata_url, tmp_bz2_repodata)
    try:
        curl_download(current_repodata_url, tmp_current_repodata)
    except:
        pass

    with tmp_repodata.open() as f:
        repodata = json.load(f)

    remote_filelist = []
    total_size = 0
    packages = repodata['packages']
    if 'packages.conda' in repodata:
        packages.update(repodata['packages.conda'])
    for filename, meta in packages.items():
        if meta['name'] in EXCLUDED_PACKAGES:
            continue

        file_size, md5 = meta['size'], meta['md5']
        total_size += file_size

        pkg_url = '/'.join([repo_url, filename])
        dst_file = local_dir / filename
        dst_file_wip = local_dir / ('.downloading.' + filename)
        remote_filelist.append(dst_file)

        if dst_file.is_file():
            stat = dst_file.stat()
            local_filesize = stat.st_size

            if file_size == local_filesize:
                logging.info("Skipping {}".format(filename))
                continue

            dst_file.unlink()

        for retry in range(3):
            logging.info("Downloading {}".format(filename))
            try:
                err = curl_download(pkg_url, dst_file_wip, md5=md5)
                if err is None:
                    dst_file_wip.rename(dst_file)
            except sp.CalledProcessError:
                err = 'CalledProcessError'
            if err is None:
                break
            logging.error("Failed to download {}: {}".format(filename, err))


    shutil.move(str(tmp_repodata), str(local_dir / "repodata.json"))
    shutil.move(str(tmp_bz2_repodata), str(local_dir / "repodata.json.bz2"))
    if tmp_current_repodata.is_file():
        shutil.move(str(tmp_current_repodata), str(
            local_dir / "current_repodata.json"))

    if delete:
        local_filelist = []
        delete_count = 0
        for i in local_dir.glob('*.tar.bz2'):
            local_filelist.append(i)
        for i in local_dir.glob('*.conda'):
            local_filelist.append(i)
        for i in set(local_filelist) - set(remote_filelist):
            logging.info("Deleting {}".format(i))
            i.unlink()
            delete_count += 1
        logging.info("{} files deleted".format(delete_count))

    logging.info("{}: {} files, {} in total".format(
        repodata_url, len(remote_filelist), sizeof_fmt(total_size)))
    return total_size

def sync_installer(repo_url, local_dir: Path):
    logging.info("Start syncing {}".format(repo_url))
    local_dir.mkdir(parents=True, exist_ok=True)
    full_scan = random.random() < 0.1 # Do full version check less frequently

    def remote_list():
        r = requests.get(repo_url, timeout=TIMEOUT_OPTION)
        d = pq(r.content)
        for tr in d('table').find('tr'):
            tds = pq(tr).find('td')
            if len(tds) != 4:
                continue
            fname = tds[0].find('a').text
            md5 = tds[3].text
            yield (fname, md5)

    for filename, md5 in remote_list():
        pkg_url = "/".join([repo_url, filename])
        dst_file = local_dir / filename
        dst_file_wip = local_dir / ('.downloading.' + filename)

        if dst_file.is_file():
            r = requests.head(pkg_url, allow_redirects=True, timeout=TIMEOUT_OPTION)
            len_avail = 'content-length' in r.headers
            if len_avail:
                remote_filesize = int(r.headers['content-length'])
            remote_date = parsedate_to_datetime(r.headers['last-modified'])
            stat = dst_file.stat()
            local_filesize = stat.st_size
            local_mtime = stat.st_mtime

            # Do content verification on ~5% of files (see issue #25)
            if (not len_avail or remote_filesize == local_filesize) and remote_date.timestamp() == local_mtime and \
                    (random.random() < 0.95 or md5_check(dst_file, md5)):
                logging.info("Skipping {}".format(filename))

                # Stop the scanning if the most recent version is present
                if not full_scan:
                    logging.info("Stop the scanning")
                    break

                continue

            logging.info("Removing {}".format(filename))
            dst_file.unlink()

        for retry in range(3):
            logging.info("Downloading {}".format(filename))
            err = ''
            try:
                err = curl_download(pkg_url, dst_file_wip, md5=md5)
                if err is None:
                    dst_file_wip.rename(dst_file)
            except sp.CalledProcessError:
                err = 'CalledProcessError'
            if err is None:
                break
            logging.error("Failed to download {}: {}".format(filename, err))


def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--working-dir", default=WORKING_DIR)
    parser.add_argument("--delete", action='store_true',
                        help='delete unreferenced package files')
    args = parser.parse_args()

    if args.working_dir is None:
        raise Exception("Working Directory is None")

    working_dir = Path(args.working_dir)
    size_statistics = 0
    random.seed()

    logging.info("Syncing installers...")
	
    # 如果需要下载archive和miniconda安装包,请将下面注释的代码打开.
    """ 
	for dist in ("archive", "miniconda"):
        remote_url = "{}/{}".format(CONDA_REPO_BASE_URL, dist)
        local_dir = working_dir / dist
        try:
            sync_installer(remote_url, local_dir)
            size_statistics += sum(
                f.stat().st_size for f in local_dir.glob('*') if f.is_file())
        except Exception:
            logging.exception("Failed to sync installers of {}".format(dist))
    """

    for repo in CONDA_REPOS:
        for arch in CONDA_ARCHES:
            remote_url = "{}/pkgs/{}/{}".format(CONDA_REPO_BASE_URL, repo, arch)
            local_dir = working_dir / "pkgs" / repo / arch

            tmpdir = tempfile.mkdtemp()
            try:
                size_statistics += sync_repo(remote_url,
                                             local_dir, Path(tmpdir), args.delete)
            except Exception:
                logging.exception("Failed to sync repo: {}/{}".format(repo, arch))
            finally:
                shutil.rmtree(tmpdir)

    for repo in CONDA_CLOUD_REPOS:
        remote_url = "{}/{}".format(CONDA_CLOUD_BASE_URL, repo)
        local_dir = working_dir / "cloud" / repo

        tmpdir = tempfile.mkdtemp()
        try:
            size_statistics += sync_repo(remote_url,
                                         local_dir, Path(tmpdir), args.delete)
        except Exception:
            logging.exception("Failed to sync repo: {}".format(repo))
        finally:
            shutil.rmtree(tmpdir)

    print("Total size is", sizeof_fmt(size_statistics, suffix=""))


if __name__ == "__main__":
    main()

2. 建立索引

在你的下载目录中会有 pkgscloud 两个文件夹,在他们的上一级目录中执行:

conda index pkgs/*
conda index cloud/*

该命令会运行很长时间以建立本地文件的索引

3. 部署本地文件服务器

说白了就是将我们下载的文件目录可以在浏览器中访问

这里我使用nginx进行部署,因为最近docker玩的比较多,另外因为使用docker避免了各种繁琐的环境配置,但是相对于直接安装nginx,使用docker的方式会占用稍多一点的资源,如果你本机已经安装了docker的话可以直接使用docker-nginx,否则可以手动安装和部署nginx以避免浪费资源。

或者如果愿意,你也可以跟着教程往下走。

docker安装

如果没安装docker可以参考: ubuntu安装docker

docker-nginx安装

拉取nginx镜像

docker pull nginx

nginx启动

创建docker-compose-nginx.yml配置文件

mkdir -p ~/docker/nginx
cd ~/docker/nginx
vim docker-compose-nginx.yml

将如下内容写入docker-compose-nginx.yml

# docker-compose-nginx.yml
version: '2.2' # 版本号
services:
    nginx:
     restart: always # docker重启容器自动重启
     container_name: nginx # 容器名称
     image: nginx # 依赖镜像
     ports:
        - 80:80 # 映射端口 80是http端口
        - 443:443 # 443 是https端口
     volumes:
        - ~/docker/nginx/conf/nginx.conf:/etc/nginx/nginx.conf # 挂载默认配置文件目录
        - ~/docker/nginx/conf/conf.d:/etc/nginx/conf.d # 挂载宿主机配置文件目录到容器里面
        - ~/docker/nginx/logs:/var/log/nginx # 挂载宿主机日志目录到容器的日志目录,以便在宿主机能直接查看容器里面的日志
        - ~/docker/nginx/html:/data/docker/nginx/html # 挂载宿主机 html文件夹目录到容器目录,以便可以直接在宿主机更新html资源
        - /opt/conda-mirror/anaconda:/opt/anaconda # 挂载anaconda资源目录,将/opt/conda-mirror/anaconda修改为你下载anaconda的目录
     environment:
        - TZ=Asia/Shanghai # 设置nginx 容器里面的时区

启动nginx

docker-compose-nginx.yml同级目录下运行:

docker-compose -f docker-compose-nginx.yml up -d

配置anaconda目录映射

~/docker/nginx/conf/conf.d下创建conda.conf,然后写入如下内容:

server {
    listen       80;
    server_name  localhost;

    location /anaconda {
        autoindex on; # 启用目录自动索引,否则报错
        root   /opt/;
    }

}

重启docker

这一步其实是为了重启nginx,因为重启docker容器内部的nginx也会重启,所以我们可以直接重启docker

docker restart nginx

或者也可以执行如下命令仅重启nginx

docker exec -it nginx bash -c '/etc/init.d/nginx reload'

这时我们通过 http://192.168.1.120/anaconda应该可以访问到目录了。

4. 使用镜像

使用如下命令添加我们自己配置的镜像源:

conda config --add channels http://192.168.1.120/anaconda/cloud/pytorch/
conda config --add channels http://192.168.1.120/anaconda/pkgs/free/
conda config --add channels http://192.168.1.120/anaconda/pkgs/main/
conda config --set show_channel_urls yes